All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] RFC: CGroup Namespaces
       [not found] <adityakali-cgroupns>
  2014-07-17 19:52   ` Aditya Kali
@ 2014-07-17 19:52 ` Aditya Kali
  2014-10-13 21:23   ` Aditya Kali
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel.

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by bind-mounting for the
  $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
  $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
  cgroup view inside the container.

  In its current simplistic form, the cgroup namespaces provide
  following behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns, the real path will be visible:
      [ns2]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place. This can be
       detected changed if desired.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) setns() is not supported for cgroup namespace in the initial
      version.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

Implementation
  The current patch-set is based on top of Tejun's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.

  (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
      command when ran from inside a cgroupns should only mount the
      hierarchy from cgroupns-root cgroup:
      $ mount -t cgroup cgroup <cgroup-mountpoint>
      # -o __DEVEL__sane_behavior should be implicit

      This is similar to how procfs can be mounted for every PIDNS. This
      may have some usecases.

---
 fs/kernfs/dir.c                  |  51 +++++++++++++---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  36 ++++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/kernfs.h           |   3 +
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  75 +++++++++++++++++------
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 14 files changed, 364 insertions(+), 34 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCH 1/5] kernfs: Add API to get generate relative kernfs path
[PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCH 3/5] cgroup: add function to get task's cgroup on default
[PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
[PATCH 5/5] cgroup: introduce cgroup namespaces
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCH 0/5] RFC: CGroup Namespaces
       [not found] <adityakali-cgroupns>
@ 2014-07-17 19:52   ` Aditya Kali
  2014-07-17 19:52 ` Aditya Kali
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo; +Cc: containers

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel.

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by bind-mounting for the
  $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
  $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
  cgroup view inside the container.

  In its current simplistic form, the cgroup namespaces provide
  following behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns, the real path will be visible:
      [ns2]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place. This can be
       detected changed if desired.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) setns() is not supported for cgroup namespace in the initial
      version.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

Implementation
  The current patch-set is based on top of Tejun's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.

  (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
      command when ran from inside a cgroupns should only mount the
      hierarchy from cgroupns-root cgroup:
      $ mount -t cgroup cgroup <cgroup-mountpoint>
      # -o __DEVEL__sane_behavior should be implicit

      This is similar to how procfs can be mounted for every PIDNS. This
      may have some usecases.

---
 fs/kernfs/dir.c                  |  51 +++++++++++++---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  36 ++++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/kernfs.h           |   3 +
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  75 +++++++++++++++++------
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 14 files changed, 364 insertions(+), 34 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCH 1/5] kernfs: Add API to get generate relative kernfs path
[PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCH 3/5] cgroup: add function to get task's cgroup on default
[PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
[PATCH 5/5] cgroup: introduce cgroup namespaces

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-17 19:52   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel.

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by bind-mounting for the
  $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
  $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
  cgroup view inside the container.

  In its current simplistic form, the cgroup namespaces provide
  following behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns, the real path will be visible:
      [ns2]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place. This can be
       detected changed if desired.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) setns() is not supported for cgroup namespace in the initial
      version.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

Implementation
  The current patch-set is based on top of Tejun's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.

  (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
      command when ran from inside a cgroupns should only mount the
      hierarchy from cgroupns-root cgroup:
      $ mount -t cgroup cgroup <cgroup-mountpoint>
      # -o __DEVEL__sane_behavior should be implicit

      This is similar to how procfs can be mounted for every PIDNS. This
      may have some usecases.

---
 fs/kernfs/dir.c                  |  51 +++++++++++++---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  36 ++++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/kernfs.h           |   3 +
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  75 +++++++++++++++++------
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 14 files changed, 364 insertions(+), 34 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCH 1/5] kernfs: Add API to get generate relative kernfs path
[PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCH 3/5] cgroup: add function to get task's cgroup on default
[PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
[PATCH 5/5] cgroup: introduce cgroup namespaces

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
  2014-07-17 19:52   ` Aditya Kali
@ 2014-07-17 19:52       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 51 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..2224f08 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_root,
+	struct kernfs_node *kn,
+	char *buf,
+	size_t buflen)
 {
 	char *p = buf + buflen;
 	int len;
 
 	*--p = '\0';
 
+	if (kn == kn_root) {
+		*--p = '/';
+		return p;
+	}
+
 	do {
 		len = strlen(kn->name);
 		if (p - buf < len + 1) {
@@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
+		if (kn == kn_root)
+			break;
 	} while (kn && kn->parent);
 
 	return p;
@@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 20f4935..1627341 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
@ 2014-07-17 19:52       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/dir.c        | 51 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..2224f08 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_root,
+	struct kernfs_node *kn,
+	char *buf,
+	size_t buflen)
 {
 	char *p = buf + buflen;
 	int len;
 
 	*--p = '\0';
 
+	if (kn == kn_root) {
+		*--p = '/';
+		return p;
+	}
+
 	do {
 		len = strlen(kn->name);
 		if (p - buf < len + 1) {
@@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
+		if (kn == kn_root)
+			break;
 	} while (kn && kn->parent);
 
 	return p;
@@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 20f4935..1627341 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2014-07-17 19:52       ` Aditya Kali
@ 2014-07-17 19:52     ` Aditya Kali
  2014-07-17 19:52       ` Aditya Kali
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-17 19:52     ` Aditya Kali
  2014-07-17 19:52     ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, Aditya Kali

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-07-17 19:52     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aditya Kali

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy
  2014-07-17 19:52   ` Aditya Kali
@ 2014-07-17 19:52       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5223c5..707c302 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1e94b71..1671345 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy
@ 2014-07-17 19:52       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, Aditya Kali

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5223c5..707c302 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1e94b71..1671345 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
  2014-07-17 19:52   ` Aditya Kali
@ 2014-07-17 19:52       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 17 +++++++++++++++++
 kernel/cgroup.c        | 18 ------------------
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 707c302..4ea477f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1671345..8552513 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
 static struct cftype cgroup_dfl_base_files[];
 static struct cftype cgroup_legacy_base_files[];
 
-static void cgroup_put(struct cgroup *cgrp);
 static int rebind_subsystems(struct cgroup_root *dst_root,
 			     unsigned int ss_mask);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
@@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
@ 2014-07-17 19:52       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, Aditya Kali

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h | 17 +++++++++++++++++
 kernel/cgroup.c        | 18 ------------------
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 707c302..4ea477f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1671345..8552513 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
 static struct cftype cgroup_dfl_base_files[];
 static struct cftype cgroup_legacy_base_files[];
 
-static void cgroup_put(struct cgroup *cgrp);
 static int rebind_subsystems(struct cgroup_root *dst_root,
 			     unsigned int ss_mask);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
@@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-17 19:52   ` Aditya Kali
@ 2014-07-17 19:52       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp. In the first version,
setns() is not supported for cgroup namespaces.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  32 ++++++++++
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 11 files changed, 276 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+	&cgroupns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4ea477f..d3c6070 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -469,6 +471,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -591,10 +600,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+	return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+		unsigned long flags,
+		struct user_namespace *user_ns,
+		struct cgroup_namespace *old_ns) {
+	if (flags & CLONE_NEWCGROUP)
+		return ERR_PTR(-EINVAL);
+
+	return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d..44f388c 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..2f43ec9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1101,6 +1101,15 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_NS
+	bool "CGroup Namespaces"
+	default n
+	help
+	  This options enables CGroup Namespaces which can be used to isolate
+	  cgroup paths. This feature is only useful when unified cgroup
+	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+	  option).
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..61c5791 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8552513..c04e971 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -196,6 +198,15 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 static void cgroup_pidlist_destroy_all(struct cgroup *cgrp);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -2333,6 +2344,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	struct task_struct *task;
 	int ret;
 
+	/* Only allow changing cgroups accessible within task's cgroup
+	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+	 * cgroupns->root_cgrp. */
+	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+		return -EPERM;
+
 	/* look up all src csets */
 	down_read(&css_set_rwsem);
 	rcu_read_lock();
@@ -4551,6 +4568,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
+	/* Allow mkdir only within process's cgroup namespace root. */
+	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4819,6 +4843,14 @@ static int cgroup_rmdir(struct kernfs_node *kn)
 	cgrp = cgroup_kn_lock_live(kn);
 	if (!cgrp)
 		return 0;
+
+	/* Allow rmdir only within process's cgroup namespace root.
+	 * The process can't delete its own root anyways. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+		cgroup_kn_unlock(kn);
+		return -EPERM;
+	}
+
 	cgroup_get(cgrp);	/* for @kn->priv clearing */
 
 	ret = cgroup_destroy_locked(cgrp);
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..a2e6804
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	cgrp = get_task_cgroup(current);
+
+	/* Creating new CGROUPNS is supported only when unified hierarchy is in
+	 * use. */
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto err_out_unlock;
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1..95981a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1747,7 +1747,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e78110..e20298c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-17 19:52       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 19:52 UTC (permalink / raw)
  To: tj, lizefan, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, Aditya Kali

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp. In the first version,
setns() is not supported for cgroup namespaces.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  32 ++++++++++
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 11 files changed, 276 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+	&cgroupns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4ea477f..d3c6070 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -469,6 +471,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -591,10 +600,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+	return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+		unsigned long flags,
+		struct user_namespace *user_ns,
+		struct cgroup_namespace *old_ns) {
+	if (flags & CLONE_NEWCGROUP)
+		return ERR_PTR(-EINVAL);
+
+	return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d..44f388c 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..2f43ec9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1101,6 +1101,15 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_NS
+	bool "CGroup Namespaces"
+	default n
+	help
+	  This options enables CGroup Namespaces which can be used to isolate
+	  cgroup paths. This feature is only useful when unified cgroup
+	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+	  option).
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..61c5791 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8552513..c04e971 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -196,6 +198,15 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 static void cgroup_pidlist_destroy_all(struct cgroup *cgrp);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -2333,6 +2344,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	struct task_struct *task;
 	int ret;
 
+	/* Only allow changing cgroups accessible within task's cgroup
+	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+	 * cgroupns->root_cgrp. */
+	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+		return -EPERM;
+
 	/* look up all src csets */
 	down_read(&css_set_rwsem);
 	rcu_read_lock();
@@ -4551,6 +4568,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
+	/* Allow mkdir only within process's cgroup namespace root. */
+	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4819,6 +4843,14 @@ static int cgroup_rmdir(struct kernfs_node *kn)
 	cgrp = cgroup_kn_lock_live(kn);
 	if (!cgrp)
 		return 0;
+
+	/* Allow rmdir only within process's cgroup namespace root.
+	 * The process can't delete its own root anyways. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+		cgroup_kn_unlock(kn);
+		return -EPERM;
+	}
+
 	cgroup_get(cgrp);	/* for @kn->priv clearing */
 
 	ret = cgroup_destroy_locked(cgrp);
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..a2e6804
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	cgrp = get_task_cgroup(current);
+
+	/* Creating new CGROUPNS is supported only when unified hierarchy is in
+	 * use. */
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto err_out_unlock;
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1..95981a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1747,7 +1747,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e78110..e20298c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-17 19:52       ` Aditya Kali
@ 2014-07-17 19:57           ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-17 19:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp. In the first version,
> setns() is not supported for cgroup namespaces.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.

What happens if someone moves a task in a cgroup namespace outside of
the namespace root cgroup?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-17 19:57           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-17 19:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, cgroups, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali@google.com> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp. In the first version,
> setns() is not supported for cgroup namespaces.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.

What happens if someone moves a task in a cgroup namespace outside of
the namespace root cgroup?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-17 19:57           ` Andy Lutomirski
@ 2014-07-17 20:55               ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 20:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp. In the first version,
>> setns() is not supported for cgroup namespaces.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>
> What happens if someone moves a task in a cgroup namespace outside of
> the namespace root cgroup?
>

Attempt to move a task outside of cgroupns root will fail with EPERM.
This is true irrespective of the privileges of the process attempting
this. Once cgroupns is created, the task will be confined to the
cgroup hierarchy under its cgroupns root until it dies.

> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-17 20:55               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-17 20:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, cgroups, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali@google.com> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp. In the first version,
>> setns() is not supported for cgroup namespaces.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>
> What happens if someone moves a task in a cgroup namespace outside of
> the namespace root cgroup?
>

Attempt to move a task outside of cgroupns root will fail with EPERM.
This is true irrespective of the privileges of the process attempting
this. Once cgroupns is created, the task will be confined to the
cgroup hierarchy under its cgroupns root until it dies.

> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (4 preceding siblings ...)
  2014-07-17 19:52       ` Aditya Kali
@ 2014-07-18 16:00     ` Serge Hallyn
  2014-07-24 16:10     ` Serge Hallyn
  2014-07-24 16:36     ` Serge Hallyn
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-18 16:00 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.

Hi,

So right now we basically do this in userspace using cgmanager.  Each
container/chroot/whatever that has a cgproxy is 'locked' under that
proxy's cgroup.  So if root in a container asks the cgproxy for the
cgroup of pid 2000, and cgproxy is in /lxc/u1 while pid 2000 in the
container is in /lxc/u1/service1, then the response will be '/service1'.
Same happens with creating cgroups, moving pids into cgroups, etc.

(Hoping to take a close look at this set early next week)

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-18 16:00     ` Serge Hallyn
  2014-07-17 19:52     ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-18 16:00 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.

Hi,

So right now we basically do this in userspace using cgmanager.  Each
container/chroot/whatever that has a cgproxy is 'locked' under that
proxy's cgroup.  So if root in a container asks the cgproxy for the
cgroup of pid 2000, and cgproxy is in /lxc/u1 while pid 2000 in the
container is in /lxc/u1/service1, then the response will be '/service1'.
Same happens with creating cgroups, moving pids into cgroups, etc.

(Hoping to take a close look at this set early next week)

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-18 16:00     ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-18 16:00 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.

Hi,

So right now we basically do this in userspace using cgmanager.  Each
container/chroot/whatever that has a cgproxy is 'locked' under that
proxy's cgroup.  So if root in a container asks the cgproxy for the
cgroup of pid 2000, and cgproxy is in /lxc/u1 while pid 2000 in the
container is in /lxc/u1/service1, then the response will be '/service1'.
Same happens with creating cgroups, moving pids into cgroups, etc.

(Hoping to take a close look at this set early next week)

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]               ` <CAGr1F2Ht1q_nYGJwmQvEEyj8r3R1stgD=g3s8_5zYOTogjz-UQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-18 16:51                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-18 16:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >> Introduce the ability to create new cgroup namespace. The newly created
> >> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> >> of creation of the cgroup namespace. The task that creates the new
> >> cgroup namespace and all its future children will now be restricted only
> >> to the cgroup hierarchy under this root_cgrp. In the first version,
> >> setns() is not supported for cgroup namespaces.
> >> The main purpose of cgroup namespace is to virtualize the contents
> >> of /proc/self/cgroup file. Processes inside a cgroup namespace
> >> are only able to see paths relative to their namespace root.
> >> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> >> to create completely virtualized containers without leaking system
> >> level cgroup hierarchy to the task.
> >
> > What happens if someone moves a task in a cgroup namespace outside of
> > the namespace root cgroup?
> >
>
> Attempt to move a task outside of cgroupns root will fail with EPERM.
> This is true irrespective of the privileges of the process attempting
> this. Once cgroupns is created, the task will be confined to the
> cgroup hierarchy under its cgroupns root until it dies.

Can a task in a non-init userns create a cgroupns?  If not, that's
unusual.  If so, is it problematic if they can prevent themselves from
being moved?

I hate to say it, but it might be worth requiring explicit permission
from the cgroup manager for this.  For example, there could be a new
cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
will fail with -EPERM unless the caller is in a may_share=1 cgroup.
may_unshare in a parent cgroup would not give child cgroups the
ability to unshare.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]               ` <CAGr1F2Ht1q_nYGJwmQvEEyj8r3R1stgD=g3s8_5zYOTogjz-UQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-18 16:51                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-18 16:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>
> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali@google.com> wrote:
> >> Introduce the ability to create new cgroup namespace. The newly created
> >> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> >> of creation of the cgroup namespace. The task that creates the new
> >> cgroup namespace and all its future children will now be restricted only
> >> to the cgroup hierarchy under this root_cgrp. In the first version,
> >> setns() is not supported for cgroup namespaces.
> >> The main purpose of cgroup namespace is to virtualize the contents
> >> of /proc/self/cgroup file. Processes inside a cgroup namespace
> >> are only able to see paths relative to their namespace root.
> >> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> >> to create completely virtualized containers without leaking system
> >> level cgroup hierarchy to the task.
> >
> > What happens if someone moves a task in a cgroup namespace outside of
> > the namespace root cgroup?
> >
>
> Attempt to move a task outside of cgroupns root will fail with EPERM.
> This is true irrespective of the privileges of the process attempting
> this. Once cgroupns is created, the task will be confined to the
> cgroup hierarchy under its cgroupns root until it dies.

Can a task in a non-init userns create a cgroupns?  If not, that's
unusual.  If so, is it problematic if they can prevent themselves from
being moved?

I hate to say it, but it might be worth requiring explicit permission
from the cgroup manager for this.  For example, there could be a new
cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
will fail with -EPERM unless the caller is in a may_share=1 cgroup.
may_unshare in a parent cgroup would not give child cgroups the
ability to unshare.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-18 16:51                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-18 16:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Linux API, Tejun Heo,
	Ingo Molnar

On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >> Introduce the ability to create new cgroup namespace. The newly created
> >> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> >> of creation of the cgroup namespace. The task that creates the new
> >> cgroup namespace and all its future children will now be restricted only
> >> to the cgroup hierarchy under this root_cgrp. In the first version,
> >> setns() is not supported for cgroup namespaces.
> >> The main purpose of cgroup namespace is to virtualize the contents
> >> of /proc/self/cgroup file. Processes inside a cgroup namespace
> >> are only able to see paths relative to their namespace root.
> >> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> >> to create completely virtualized containers without leaking system
> >> level cgroup hierarchy to the task.
> >
> > What happens if someone moves a task in a cgroup namespace outside of
> > the namespace root cgroup?
> >
>
> Attempt to move a task outside of cgroupns root will fail with EPERM.
> This is true irrespective of the privileges of the process attempting
> this. Once cgroupns is created, the task will be confined to the
> cgroup hierarchy under its cgroupns root until it dies.

Can a task in a non-init userns create a cgroupns?  If not, that's
unusual.  If so, is it problematic if they can prevent themselves from
being moved?

I hate to say it, but it might be worth requiring explicit permission
from the cgroup manager for this.  For example, there could be a new
cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
will fail with -EPERM unless the caller is in a may_share=1 cgroup.
may_unshare in a parent cgroup would not give child cgroups the
ability to unshare.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-18 16:51                 ` Andy Lutomirski
@ 2014-07-18 18:51                     ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-18 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> > What happens if someone moves a task in a cgroup namespace outside of
>> > the namespace root cgroup?
>> >
>>
>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>> This is true irrespective of the privileges of the process attempting
>> this. Once cgroupns is created, the task will be confined to the
>> cgroup hierarchy under its cgroupns root until it dies.
>
> Can a task in a non-init userns create a cgroupns?  If not, that's
> unusual.  If so, is it problematic if they can prevent themselves from
> being moved?
>

Currently, only a task with CAP_SYS_ADMIN in the init-userns can
create cgroupns. It is stricter than for other namespaces, yes.

> I hate to say it, but it might be worth requiring explicit permission
> from the cgroup manager for this.  For example, there could be a new
> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
> may_unshare in a parent cgroup would not give child cgroups the
> ability to unshare.
>

What you suggest can be done. The current patch-set punts the problem
of permission checking by only allowing unshare from a
capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
improvement to cgroupns feature if we want to open it to non-init
userns.

Being said that, I would argue that even if we don't have this
explicit permission and relax the check to non-init userns, it should
be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
you should probably be allowed to unshare() it). By unsharing
cgroupns, the tasks can only confine themselves further under its
cgroupns-root. As long as it cannot escape that hierarchy, it should
be fine.
In my experience, there is seldom a need to move tasks out of their
cgroup. At most, we create a sub-cgroup and move the task there (which
is allowed in their cgroupns). Even for a cgroup manager, I can't
think of a case where it will be useful to move a task from one cgroup
hierarchy to another. Such move seems overly complicated (even without
cgroup namespaces). The cgroup manager can just modify the settings of
the task's cgroup as needed or simply kill & restart the task in a new
container.


> --Andy


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-18 18:51                     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-18 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>>
>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > What happens if someone moves a task in a cgroup namespace outside of
>> > the namespace root cgroup?
>> >
>>
>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>> This is true irrespective of the privileges of the process attempting
>> this. Once cgroupns is created, the task will be confined to the
>> cgroup hierarchy under its cgroupns root until it dies.
>
> Can a task in a non-init userns create a cgroupns?  If not, that's
> unusual.  If so, is it problematic if they can prevent themselves from
> being moved?
>

Currently, only a task with CAP_SYS_ADMIN in the init-userns can
create cgroupns. It is stricter than for other namespaces, yes.

> I hate to say it, but it might be worth requiring explicit permission
> from the cgroup manager for this.  For example, there could be a new
> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
> may_unshare in a parent cgroup would not give child cgroups the
> ability to unshare.
>

What you suggest can be done. The current patch-set punts the problem
of permission checking by only allowing unshare from a
capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
improvement to cgroupns feature if we want to open it to non-init
userns.

Being said that, I would argue that even if we don't have this
explicit permission and relax the check to non-init userns, it should
be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
you should probably be allowed to unshare() it). By unsharing
cgroupns, the tasks can only confine themselves further under its
cgroupns-root. As long as it cannot escape that hierarchy, it should
be fine.
In my experience, there is seldom a need to move tasks out of their
cgroup. At most, we create a sub-cgroup and move the task there (which
is allowed in their cgroupns). Even for a cgroup manager, I can't
think of a case where it will be useful to move a task from one cgroup
hierarchy to another. Such move seems overly complicated (even without
cgroup namespaces). The cgroup manager can just modify the settings of
the task's cgroup as needed or simply kill & restart the task in a new
container.


> --Andy


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-18 18:51                     ` Aditya Kali
@ 2014-07-18 18:57                         ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-18 18:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> > What happens if someone moves a task in a cgroup namespace outside of
>>> > the namespace root cgroup?
>>> >
>>>
>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>> This is true irrespective of the privileges of the process attempting
>>> this. Once cgroupns is created, the task will be confined to the
>>> cgroup hierarchy under its cgroupns root until it dies.
>>
>> Can a task in a non-init userns create a cgroupns?  If not, that's
>> unusual.  If so, is it problematic if they can prevent themselves from
>> being moved?
>>
>
> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
> create cgroupns. It is stricter than for other namespaces, yes.

I'm slightly hesitant to have unshare(CLONE_NEWUSER |
CLONE_NEWCGROUPNS | ...) start having weird side effects that are
visible outside the namespace, especially when those side effects
don't happen (because the call fails entirely) if
unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
it, but it's weird.

>
>> I hate to say it, but it might be worth requiring explicit permission
>> from the cgroup manager for this.  For example, there could be a new
>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>> may_unshare in a parent cgroup would not give child cgroups the
>> ability to unshare.
>>
>
> What you suggest can be done. The current patch-set punts the problem
> of permission checking by only allowing unshare from a
> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
> improvement to cgroupns feature if we want to open it to non-init
> userns.
>
> Being said that, I would argue that even if we don't have this
> explicit permission and relax the check to non-init userns, it should
> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
> you should probably be allowed to unshare() it).

But non-init-userns tasks can't create cgroup hierarchies, unless I
misunderstand the current code.  And, if they can, I bet I can find
three or four serious security issues in an hour or two. :)

> By unsharing
> cgroupns, the tasks can only confine themselves further under its
> cgroupns-root. As long as it cannot escape that hierarchy, it should
> be fine.

But they can also *lock* their hierarchy.

> In my experience, there is seldom a need to move tasks out of their
> cgroup. At most, we create a sub-cgroup and move the task there (which
> is allowed in their cgroupns). Even for a cgroup manager, I can't
> think of a case where it will be useful to move a task from one cgroup
> hierarchy to another. Such move seems overly complicated (even without
> cgroup namespaces). The cgroup manager can just modify the settings of
> the task's cgroup as needed or simply kill & restart the task in a new
> container.
>

I do this all the time.  Maybe my new systemd overlords will make me
stop doing it, at which point my current production setup will blow
up.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-18 18:57                         ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-18 18:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali@google.com> wrote:
> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>>>
>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> > What happens if someone moves a task in a cgroup namespace outside of
>>> > the namespace root cgroup?
>>> >
>>>
>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>> This is true irrespective of the privileges of the process attempting
>>> this. Once cgroupns is created, the task will be confined to the
>>> cgroup hierarchy under its cgroupns root until it dies.
>>
>> Can a task in a non-init userns create a cgroupns?  If not, that's
>> unusual.  If so, is it problematic if they can prevent themselves from
>> being moved?
>>
>
> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
> create cgroupns. It is stricter than for other namespaces, yes.

I'm slightly hesitant to have unshare(CLONE_NEWUSER |
CLONE_NEWCGROUPNS | ...) start having weird side effects that are
visible outside the namespace, especially when those side effects
don't happen (because the call fails entirely) if
unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
it, but it's weird.

>
>> I hate to say it, but it might be worth requiring explicit permission
>> from the cgroup manager for this.  For example, there could be a new
>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>> may_unshare in a parent cgroup would not give child cgroups the
>> ability to unshare.
>>
>
> What you suggest can be done. The current patch-set punts the problem
> of permission checking by only allowing unshare from a
> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
> improvement to cgroupns feature if we want to open it to non-init
> userns.
>
> Being said that, I would argue that even if we don't have this
> explicit permission and relax the check to non-init userns, it should
> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
> you should probably be allowed to unshare() it).

But non-init-userns tasks can't create cgroup hierarchies, unless I
misunderstand the current code.  And, if they can, I bet I can find
three or four serious security issues in an hour or two. :)

> By unsharing
> cgroupns, the tasks can only confine themselves further under its
> cgroupns-root. As long as it cannot escape that hierarchy, it should
> be fine.

But they can also *lock* their hierarchy.

> In my experience, there is seldom a need to move tasks out of their
> cgroup. At most, we create a sub-cgroup and move the task there (which
> is allowed in their cgroupns). Even for a cgroup manager, I can't
> think of a case where it will be useful to move a task from one cgroup
> hierarchy to another. Such move seems overly complicated (even without
> cgroup namespaces). The cgroup manager can just modify the settings of
> the task's cgroup as needed or simply kill & restart the task in a new
> container.
>

I do this all the time.  Maybe my new systemd overlords will make me
stop doing it, at which point my current production setup will blow
up.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]                         ` <CALCETrVeeL71sfVdbzRx0FpGrvQKbviEmUcMEosbUU+UJNQu9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-21 22:11                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-21 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>
>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>> > the namespace root cgroup?
>>>> >
>>>>
>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>> This is true irrespective of the privileges of the process attempting
>>>> this. Once cgroupns is created, the task will be confined to the
>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>
>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>> unusual.  If so, is it problematic if they can prevent themselves from
>>> being moved?
>>>
>>
>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>> create cgroupns. It is stricter than for other namespaces, yes.
>
> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
> visible outside the namespace, especially when those side effects
> don't happen (because the call fails entirely) if
> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
> it, but it's weird.
>

I expect this to be only in the initial version of the patch. We can
make this consistent with other namespaces once we figure out how
cgroupns can be safely enabled for non-init-userns.

>>
>>> I hate to say it, but it might be worth requiring explicit permission
>>> from the cgroup manager for this.  For example, there could be a new
>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>> may_unshare in a parent cgroup would not give child cgroups the
>>> ability to unshare.
>>>
>>
>> What you suggest can be done. The current patch-set punts the problem
>> of permission checking by only allowing unshare from a
>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>> improvement to cgroupns feature if we want to open it to non-init
>> userns.
>>
>> Being said that, I would argue that even if we don't have this
>> explicit permission and relax the check to non-init userns, it should
>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>> you should probably be allowed to unshare() it).
>
> But non-init-userns tasks can't create cgroup hierarchies, unless I
> misunderstand the current code.  And, if they can, I bet I can find
> three or four serious security issues in an hour or two. :)
>

Task running in non-init userns can create cgroup hierarchies if you
chown/chgrp their cgroup root to the task user:

# while running as 'root' (uid=0)
$ cd  $CGROUP_MOUNT
$ mkdir -p batchjobs/c_job_id1/

# transfer ownership to the user (in this case 'nobody' (uid=99)).
$ chown nobody batchjobs/c_job_id1/
$ chgrp nobody batchjobs/c_job_id1/
$ ls -ld batchjobs/c_job_id1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 12:47 batchjobs/c_job_id1/

# enter container cgroup
$ echo 0 > batchjobs/c_job_id1/cgroup.procs

# unshare both userns and cgroupns
$ unshare -u -c
# setup uid_map and gid_map and export user '99' in the userns
#    $ cat /proc/<pid>/uid_map
#         0          0          1
#        99         99          1
#    $ cat /proc/<pid>/gid_map
#         0          0          1
#        99         99          1
# switch to user 'nobody'
$ su nobody
$ id
uid=99(nobody) gid=99(nobody) groups=99(nobody)

# Now user nobody running under non-init userns can create sub-cgroups
# under "batchjobs/c_job_id1/".
# PWD=$CGROUP_MOUNT/batchjobs/c_job_id1
$ mkdir sub_cgroup1
$ ls -ld sub_cgroup1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 13:11 sub_cgroup1/
$ echo 0 > sub_cgroup1/cgroup.procs
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgroup1
$ ls -l sub_cgroup1/
total 0
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.controllers
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.populated
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:12 cgroup.procs
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.subtree_control


This is a powerful feature as it allows non-root tasks to run
container-management tools and provision their resources properly. But
this makes implementing your suggestion of having 'cgroup.may_unshare'
file tricky as the cgroup owner (task) will be able to set it and
still unshare cgroupns. Instead, may be we could just check if the
task has appropriate (write?) permissions on the cgroup directory
before allowing nested cgroupns creation.

>> By unsharing
>> cgroupns, the tasks can only confine themselves further under its
>> cgroupns-root. As long as it cannot escape that hierarchy, it should
>> be fine.
>
> But they can also *lock* their hierarchy.
>

But locking the tasks inside the hierarchy is really what cgroupns
feature is trying to provide. I understand that this is a change in
expectation, but with unified hierarchy, there are already
restrictions on where tasks can be moved (only to leaf cgroups). With
cgroup namespaces, this becomes: "only to leaf cgroups within task's
cgroupns".

>> In my experience, there is seldom a need to move tasks out of their
>> cgroup. At most, we create a sub-cgroup and move the task there (which
>> is allowed in their cgroupns). Even for a cgroup manager, I can't
>> think of a case where it will be useful to move a task from one cgroup
>> hierarchy to another. Such move seems overly complicated (even without
>> cgroup namespaces). The cgroup manager can just modify the settings of
>> the task's cgroup as needed or simply kill & restart the task in a new
>> container.
>>
>
> I do this all the time.  Maybe my new systemd overlords will make me
> stop doing it, at which point my current production setup will blow
> up.
>

[shudder]
I am surprised that this even works correctly.

Either way, may be checking cgroup directory permissions will work for
you? i.e., if you "chown" a cgroup directory to the user, it should be
OK if the user's task unshares cgroupns under that cgroup and you
don't care about moving tasks from under that cgroup. Without
ownership of the cgroup directory, creation of cgroupns will be
disallowed. What do you think?


> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]                         ` <CALCETrVeeL71sfVdbzRx0FpGrvQKbviEmUcMEosbUU+UJNQu9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-21 22:11                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-21 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali@google.com> wrote:
>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>>>>
>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>> > the namespace root cgroup?
>>>> >
>>>>
>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>> This is true irrespective of the privileges of the process attempting
>>>> this. Once cgroupns is created, the task will be confined to the
>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>
>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>> unusual.  If so, is it problematic if they can prevent themselves from
>>> being moved?
>>>
>>
>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>> create cgroupns. It is stricter than for other namespaces, yes.
>
> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
> visible outside the namespace, especially when those side effects
> don't happen (because the call fails entirely) if
> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
> it, but it's weird.
>

I expect this to be only in the initial version of the patch. We can
make this consistent with other namespaces once we figure out how
cgroupns can be safely enabled for non-init-userns.

>>
>>> I hate to say it, but it might be worth requiring explicit permission
>>> from the cgroup manager for this.  For example, there could be a new
>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>> may_unshare in a parent cgroup would not give child cgroups the
>>> ability to unshare.
>>>
>>
>> What you suggest can be done. The current patch-set punts the problem
>> of permission checking by only allowing unshare from a
>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>> improvement to cgroupns feature if we want to open it to non-init
>> userns.
>>
>> Being said that, I would argue that even if we don't have this
>> explicit permission and relax the check to non-init userns, it should
>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>> you should probably be allowed to unshare() it).
>
> But non-init-userns tasks can't create cgroup hierarchies, unless I
> misunderstand the current code.  And, if they can, I bet I can find
> three or four serious security issues in an hour or two. :)
>

Task running in non-init userns can create cgroup hierarchies if you
chown/chgrp their cgroup root to the task user:

# while running as 'root' (uid=0)
$ cd  $CGROUP_MOUNT
$ mkdir -p batchjobs/c_job_id1/

# transfer ownership to the user (in this case 'nobody' (uid=99)).
$ chown nobody batchjobs/c_job_id1/
$ chgrp nobody batchjobs/c_job_id1/
$ ls -ld batchjobs/c_job_id1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 12:47 batchjobs/c_job_id1/

# enter container cgroup
$ echo 0 > batchjobs/c_job_id1/cgroup.procs

# unshare both userns and cgroupns
$ unshare -u -c
# setup uid_map and gid_map and export user '99' in the userns
#    $ cat /proc/<pid>/uid_map
#         0          0          1
#        99         99          1
#    $ cat /proc/<pid>/gid_map
#         0          0          1
#        99         99          1
# switch to user 'nobody'
$ su nobody
$ id
uid=99(nobody) gid=99(nobody) groups=99(nobody)

# Now user nobody running under non-init userns can create sub-cgroups
# under "batchjobs/c_job_id1/".
# PWD=$CGROUP_MOUNT/batchjobs/c_job_id1
$ mkdir sub_cgroup1
$ ls -ld sub_cgroup1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 13:11 sub_cgroup1/
$ echo 0 > sub_cgroup1/cgroup.procs
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgroup1
$ ls -l sub_cgroup1/
total 0
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.controllers
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.populated
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:12 cgroup.procs
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.subtree_control


This is a powerful feature as it allows non-root tasks to run
container-management tools and provision their resources properly. But
this makes implementing your suggestion of having 'cgroup.may_unshare'
file tricky as the cgroup owner (task) will be able to set it and
still unshare cgroupns. Instead, may be we could just check if the
task has appropriate (write?) permissions on the cgroup directory
before allowing nested cgroupns creation.

>> By unsharing
>> cgroupns, the tasks can only confine themselves further under its
>> cgroupns-root. As long as it cannot escape that hierarchy, it should
>> be fine.
>
> But they can also *lock* their hierarchy.
>

But locking the tasks inside the hierarchy is really what cgroupns
feature is trying to provide. I understand that this is a change in
expectation, but with unified hierarchy, there are already
restrictions on where tasks can be moved (only to leaf cgroups). With
cgroup namespaces, this becomes: "only to leaf cgroups within task's
cgroupns".

>> In my experience, there is seldom a need to move tasks out of their
>> cgroup. At most, we create a sub-cgroup and move the task there (which
>> is allowed in their cgroupns). Even for a cgroup manager, I can't
>> think of a case where it will be useful to move a task from one cgroup
>> hierarchy to another. Such move seems overly complicated (even without
>> cgroup namespaces). The cgroup manager can just modify the settings of
>> the task's cgroup as needed or simply kill & restart the task in a new
>> container.
>>
>
> I do this all the time.  Maybe my new systemd overlords will make me
> stop doing it, at which point my current production setup will blow
> up.
>

[shudder]
I am surprised that this even works correctly.

Either way, may be checking cgroup directory permissions will work for
you? i.e., if you "chown" a cgroup directory to the user, it should be
OK if the user's task unshares cgroupns under that cgroup and you
don't care about moving tasks from under that cgroup. Without
ownership of the cgroup directory, creation of cgroupns will be
disallowed. What do you think?


> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-21 22:11                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-21 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Linux API, Tejun Heo,
	Ingo Molnar

On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>
>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>> > the namespace root cgroup?
>>>> >
>>>>
>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>> This is true irrespective of the privileges of the process attempting
>>>> this. Once cgroupns is created, the task will be confined to the
>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>
>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>> unusual.  If so, is it problematic if they can prevent themselves from
>>> being moved?
>>>
>>
>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>> create cgroupns. It is stricter than for other namespaces, yes.
>
> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
> visible outside the namespace, especially when those side effects
> don't happen (because the call fails entirely) if
> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
> it, but it's weird.
>

I expect this to be only in the initial version of the patch. We can
make this consistent with other namespaces once we figure out how
cgroupns can be safely enabled for non-init-userns.

>>
>>> I hate to say it, but it might be worth requiring explicit permission
>>> from the cgroup manager for this.  For example, there could be a new
>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>> may_unshare in a parent cgroup would not give child cgroups the
>>> ability to unshare.
>>>
>>
>> What you suggest can be done. The current patch-set punts the problem
>> of permission checking by only allowing unshare from a
>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>> improvement to cgroupns feature if we want to open it to non-init
>> userns.
>>
>> Being said that, I would argue that even if we don't have this
>> explicit permission and relax the check to non-init userns, it should
>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>> you should probably be allowed to unshare() it).
>
> But non-init-userns tasks can't create cgroup hierarchies, unless I
> misunderstand the current code.  And, if they can, I bet I can find
> three or four serious security issues in an hour or two. :)
>

Task running in non-init userns can create cgroup hierarchies if you
chown/chgrp their cgroup root to the task user:

# while running as 'root' (uid=0)
$ cd  $CGROUP_MOUNT
$ mkdir -p batchjobs/c_job_id1/

# transfer ownership to the user (in this case 'nobody' (uid=99)).
$ chown nobody batchjobs/c_job_id1/
$ chgrp nobody batchjobs/c_job_id1/
$ ls -ld batchjobs/c_job_id1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 12:47 batchjobs/c_job_id1/

# enter container cgroup
$ echo 0 > batchjobs/c_job_id1/cgroup.procs

# unshare both userns and cgroupns
$ unshare -u -c
# setup uid_map and gid_map and export user '99' in the userns
#    $ cat /proc/<pid>/uid_map
#         0          0          1
#        99         99          1
#    $ cat /proc/<pid>/gid_map
#         0          0          1
#        99         99          1
# switch to user 'nobody'
$ su nobody
$ id
uid=99(nobody) gid=99(nobody) groups=99(nobody)

# Now user nobody running under non-init userns can create sub-cgroups
# under "batchjobs/c_job_id1/".
# PWD=$CGROUP_MOUNT/batchjobs/c_job_id1
$ mkdir sub_cgroup1
$ ls -ld sub_cgroup1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 13:11 sub_cgroup1/
$ echo 0 > sub_cgroup1/cgroup.procs
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgroup1
$ ls -l sub_cgroup1/
total 0
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.controllers
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.populated
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:12 cgroup.procs
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.subtree_control


This is a powerful feature as it allows non-root tasks to run
container-management tools and provision their resources properly. But
this makes implementing your suggestion of having 'cgroup.may_unshare'
file tricky as the cgroup owner (task) will be able to set it and
still unshare cgroupns. Instead, may be we could just check if the
task has appropriate (write?) permissions on the cgroup directory
before allowing nested cgroupns creation.

>> By unsharing
>> cgroupns, the tasks can only confine themselves further under its
>> cgroupns-root. As long as it cannot escape that hierarchy, it should
>> be fine.
>
> But they can also *lock* their hierarchy.
>

But locking the tasks inside the hierarchy is really what cgroupns
feature is trying to provide. I understand that this is a change in
expectation, but with unified hierarchy, there are already
restrictions on where tasks can be moved (only to leaf cgroups). With
cgroup namespaces, this becomes: "only to leaf cgroups within task's
cgroupns".

>> In my experience, there is seldom a need to move tasks out of their
>> cgroup. At most, we create a sub-cgroup and move the task there (which
>> is allowed in their cgroupns). Even for a cgroup manager, I can't
>> think of a case where it will be useful to move a task from one cgroup
>> hierarchy to another. Such move seems overly complicated (even without
>> cgroup namespaces). The cgroup manager can just modify the settings of
>> the task's cgroup as needed or simply kill & restart the task in a new
>> container.
>>
>
> I do this all the time.  Maybe my new systemd overlords will make me
> stop doing it, at which point my current production setup will blow
> up.
>

[shudder]
I am surprised that this even works correctly.

Either way, may be checking cgroup directory permissions will work for
you? i.e., if you "chown" a cgroup directory to the user, it should be
OK if the user's task unshares cgroupns under that cgroup and you
don't care about moving tasks from under that cgroup. Without
ownership of the cgroup directory, creation of cgroupns will be
disallowed. What do you think?


> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]                           ` <CAGr1F2Fd_4=WUm4STPd4kdd5tNLO6aQ1OOQMKnRqyOKZSGvCpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-21 22:16                             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-21 22:16 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>
>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>> > the namespace root cgroup?
>>>>> >
>>>>>
>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>> This is true irrespective of the privileges of the process attempting
>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>
>>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>>> unusual.  If so, is it problematic if they can prevent themselves from
>>>> being moved?
>>>>
>>>
>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>> create cgroupns. It is stricter than for other namespaces, yes.
>>
>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>> visible outside the namespace, especially when those side effects
>> don't happen (because the call fails entirely) if
>> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
>> it, but it's weird.
>>
>
> I expect this to be only in the initial version of the patch. We can
> make this consistent with other namespaces once we figure out how
> cgroupns can be safely enabled for non-init-userns.
>
>>>
>>>> I hate to say it, but it might be worth requiring explicit permission
>>>> from the cgroup manager for this.  For example, there could be a new
>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>> ability to unshare.
>>>>
>>>
>>> What you suggest can be done. The current patch-set punts the problem
>>> of permission checking by only allowing unshare from a
>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>> improvement to cgroupns feature if we want to open it to non-init
>>> userns.
>>>
>>> Being said that, I would argue that even if we don't have this
>>> explicit permission and relax the check to non-init userns, it should
>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>> you should probably be allowed to unshare() it).
>>
>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>> misunderstand the current code.  And, if they can, I bet I can find
>> three or four serious security issues in an hour or two. :)
>>
>
> Task running in non-init userns can create cgroup hierarchies if you
> chown/chgrp their cgroup root to the task user:

Won't the systemd people hate you forever for this suggestion?  (I do
exactly this myself...)


> This is a powerful feature as it allows non-root tasks to run
> container-management tools and provision their resources properly. But
> this makes implementing your suggestion of having 'cgroup.may_unshare'
> file tricky as the cgroup owner (task) will be able to set it and
> still unshare cgroupns. Instead, may be we could just check if the
> task has appropriate (write?) permissions on the cgroup directory
> before allowing nested cgroupns creation.

I bet that systemd will want to set may_unshare but not give write
access.  Who knows?

> [shudder]
> I am surprised that this even works correctly.
>
> Either way, may be checking cgroup directory permissions will work for
> you? i.e., if you "chown" a cgroup directory to the user, it should be
> OK if the user's task unshares cgroupns under that cgroup and you
> don't care about moving tasks from under that cgroup. Without
> ownership of the cgroup directory, creation of cgroupns will be
> disallowed. What do you think?

I think this is *safe* but may not useful for eventual systemd stuff.
Not really sure.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
       [not found]                           ` <CAGr1F2Fd_4=WUm4STPd4kdd5tNLO6aQ1OOQMKnRqyOKZSGvCpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-21 22:16                             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-21 22:16 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <adityakali@google.com> wrote:
> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali@google.com> wrote:
>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>>>>>
>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>> > the namespace root cgroup?
>>>>> >
>>>>>
>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>> This is true irrespective of the privileges of the process attempting
>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>
>>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>>> unusual.  If so, is it problematic if they can prevent themselves from
>>>> being moved?
>>>>
>>>
>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>> create cgroupns. It is stricter than for other namespaces, yes.
>>
>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>> visible outside the namespace, especially when those side effects
>> don't happen (because the call fails entirely) if
>> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
>> it, but it's weird.
>>
>
> I expect this to be only in the initial version of the patch. We can
> make this consistent with other namespaces once we figure out how
> cgroupns can be safely enabled for non-init-userns.
>
>>>
>>>> I hate to say it, but it might be worth requiring explicit permission
>>>> from the cgroup manager for this.  For example, there could be a new
>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>> ability to unshare.
>>>>
>>>
>>> What you suggest can be done. The current patch-set punts the problem
>>> of permission checking by only allowing unshare from a
>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>> improvement to cgroupns feature if we want to open it to non-init
>>> userns.
>>>
>>> Being said that, I would argue that even if we don't have this
>>> explicit permission and relax the check to non-init userns, it should
>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>> you should probably be allowed to unshare() it).
>>
>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>> misunderstand the current code.  And, if they can, I bet I can find
>> three or four serious security issues in an hour or two. :)
>>
>
> Task running in non-init userns can create cgroup hierarchies if you
> chown/chgrp their cgroup root to the task user:

Won't the systemd people hate you forever for this suggestion?  (I do
exactly this myself...)


> This is a powerful feature as it allows non-root tasks to run
> container-management tools and provision their resources properly. But
> this makes implementing your suggestion of having 'cgroup.may_unshare'
> file tricky as the cgroup owner (task) will be able to set it and
> still unshare cgroupns. Instead, may be we could just check if the
> task has appropriate (write?) permissions on the cgroup directory
> before allowing nested cgroupns creation.

I bet that systemd will want to set may_unshare but not give write
access.  Who knows?

> [shudder]
> I am surprised that this even works correctly.
>
> Either way, may be checking cgroup directory permissions will work for
> you? i.e., if you "chown" a cgroup directory to the user, it should be
> OK if the user's task unshares cgroupns under that cgroup and you
> don't care about moving tasks from under that cgroup. Without
> ownership of the cgroup directory, creation of cgroupns will be
> disallowed. What do you think?

I think this is *safe* but may not useful for eventual systemd stuff.
Not really sure.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-21 22:16                             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-21 22:16 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Linux API, Tejun Heo,
	Ingo Molnar

On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>
>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>> > the namespace root cgroup?
>>>>> >
>>>>>
>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>> This is true irrespective of the privileges of the process attempting
>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>
>>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>>> unusual.  If so, is it problematic if they can prevent themselves from
>>>> being moved?
>>>>
>>>
>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>> create cgroupns. It is stricter than for other namespaces, yes.
>>
>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>> visible outside the namespace, especially when those side effects
>> don't happen (because the call fails entirely) if
>> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
>> it, but it's weird.
>>
>
> I expect this to be only in the initial version of the patch. We can
> make this consistent with other namespaces once we figure out how
> cgroupns can be safely enabled for non-init-userns.
>
>>>
>>>> I hate to say it, but it might be worth requiring explicit permission
>>>> from the cgroup manager for this.  For example, there could be a new
>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>> ability to unshare.
>>>>
>>>
>>> What you suggest can be done. The current patch-set punts the problem
>>> of permission checking by only allowing unshare from a
>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>> improvement to cgroupns feature if we want to open it to non-init
>>> userns.
>>>
>>> Being said that, I would argue that even if we don't have this
>>> explicit permission and relax the check to non-init userns, it should
>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>> you should probably be allowed to unshare() it).
>>
>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>> misunderstand the current code.  And, if they can, I bet I can find
>> three or four serious security issues in an hour or two. :)
>>
>
> Task running in non-init userns can create cgroup hierarchies if you
> chown/chgrp their cgroup root to the task user:

Won't the systemd people hate you forever for this suggestion?  (I do
exactly this myself...)


> This is a powerful feature as it allows non-root tasks to run
> container-management tools and provision their resources properly. But
> this makes implementing your suggestion of having 'cgroup.may_unshare'
> file tricky as the cgroup owner (task) will be able to set it and
> still unshare cgroupns. Instead, may be we could just check if the
> task has appropriate (write?) permissions on the cgroup directory
> before allowing nested cgroupns creation.

I bet that systemd will want to set may_unshare but not give write
access.  Who knows?

> [shudder]
> I am surprised that this even works correctly.
>
> Either way, may be checking cgroup directory permissions will work for
> you? i.e., if you "chown" a cgroup directory to the user, it should be
> OK if the user's task unshares cgroupns under that cgroup and you
> don't care about moving tasks from under that cgroup. Without
> ownership of the cgroup directory, creation of cgroupns will be
> disallowed. What do you think?

I think this is *safe* but may not useful for eventual systemd stuff.
Not really sure.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
  2014-07-21 22:16                             ` Andy Lutomirski
@ 2014-07-23 19:52                                 ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-23 19:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Jul 21, 2014 at 3:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>>> > the namespace root cgroup?
>>>>>> >
>>>>>>
>>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>>> This is true irrespective of the privileges of the process attempting
>>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>>
>>>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>>>> unusual.  If so, is it problematic if they can prevent themselves from
>>>>> being moved?
>>>>>
>>>>
>>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>>> create cgroupns. It is stricter than for other namespaces, yes.
>>>
>>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>>> visible outside the namespace, especially when those side effects
>>> don't happen (because the call fails entirely) if
>>> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
>>> it, but it's weird.
>>>
>>
>> I expect this to be only in the initial version of the patch. We can
>> make this consistent with other namespaces once we figure out how
>> cgroupns can be safely enabled for non-init-userns.
>>
>>>>
>>>>> I hate to say it, but it might be worth requiring explicit permission
>>>>> from the cgroup manager for this.  For example, there could be a new
>>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>>> ability to unshare.
>>>>>
>>>>
>>>> What you suggest can be done. The current patch-set punts the problem
>>>> of permission checking by only allowing unshare from a
>>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>>> improvement to cgroupns feature if we want to open it to non-init
>>>> userns.
>>>>
>>>> Being said that, I would argue that even if we don't have this
>>>> explicit permission and relax the check to non-init userns, it should
>>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>>> you should probably be allowed to unshare() it).
>>>
>>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>>> misunderstand the current code.  And, if they can, I bet I can find
>>> three or four serious security issues in an hour or two. :)
>>>
>>
>> Task running in non-init userns can create cgroup hierarchies if you
>> chown/chgrp their cgroup root to the task user:
>
> Won't the systemd people hate you forever for this suggestion?  (I do
> exactly this myself...)
>

I was actually thinking this feature will really simplify container
management tools (since cgroupns allows you to recursively run them
inside containers without any hacks). I would appreciate any feedback
from them on how we can improve this to help their usecase.

Thanks for your comments!

>
>> This is a powerful feature as it allows non-root tasks to run
>> container-management tools and provision their resources properly. But
>> this makes implementing your suggestion of having 'cgroup.may_unshare'
>> file tricky as the cgroup owner (task) will be able to set it and
>> still unshare cgroupns. Instead, may be we could just check if the
>> task has appropriate (write?) permissions on the cgroup directory
>> before allowing nested cgroupns creation.
>
> I bet that systemd will want to set may_unshare but not give write
> access.  Who knows?
>
>> [shudder]
>> I am surprised that this even works correctly.
>>
>> Either way, may be checking cgroup directory permissions will work for
>> you? i.e., if you "chown" a cgroup directory to the user, it should be
>> OK if the user's task unshares cgroupns under that cgroup and you
>> don't care about moving tasks from under that cgroup. Without
>> ownership of the cgroup directory, creation of cgroupns will be
>> disallowed. What do you think?
>
> I think this is *safe* but may not useful for eventual systemd stuff.
> Not really sure.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 5/5] cgroup: introduce cgroup namespaces
@ 2014-07-23 19:52                                 ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-23 19:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel, cgroups, Li Zefan, Linux API,
	Tejun Heo, Ingo Molnar

On Mon, Jul 21, 2014 at 3:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <adityakali@google.com> wrote:
>>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <adityakali@google.com> wrote:
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>>> > the namespace root cgroup?
>>>>>> >
>>>>>>
>>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>>> This is true irrespective of the privileges of the process attempting
>>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>>
>>>>> Can a task in a non-init userns create a cgroupns?  If not, that's
>>>>> unusual.  If so, is it problematic if they can prevent themselves from
>>>>> being moved?
>>>>>
>>>>
>>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>>> create cgroupns. It is stricter than for other namespaces, yes.
>>>
>>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>>> visible outside the namespace, especially when those side effects
>>> don't happen (because the call fails entirely) if
>>> unshare(CLONE_NEWUSER) happens first.  I don't see a real problem with
>>> it, but it's weird.
>>>
>>
>> I expect this to be only in the initial version of the patch. We can
>> make this consistent with other namespaces once we figure out how
>> cgroupns can be safely enabled for non-init-userns.
>>
>>>>
>>>>> I hate to say it, but it might be worth requiring explicit permission
>>>>> from the cgroup manager for this.  For example, there could be a new
>>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>>> ability to unshare.
>>>>>
>>>>
>>>> What you suggest can be done. The current patch-set punts the problem
>>>> of permission checking by only allowing unshare from a
>>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>>> improvement to cgroupns feature if we want to open it to non-init
>>>> userns.
>>>>
>>>> Being said that, I would argue that even if we don't have this
>>>> explicit permission and relax the check to non-init userns, it should
>>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>>> you should probably be allowed to unshare() it).
>>>
>>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>>> misunderstand the current code.  And, if they can, I bet I can find
>>> three or four serious security issues in an hour or two. :)
>>>
>>
>> Task running in non-init userns can create cgroup hierarchies if you
>> chown/chgrp their cgroup root to the task user:
>
> Won't the systemd people hate you forever for this suggestion?  (I do
> exactly this myself...)
>

I was actually thinking this feature will really simplify container
management tools (since cgroupns allows you to recursively run them
inside containers without any hacks). I would appreciate any feedback
from them on how we can improve this to help their usecase.

Thanks for your comments!

>
>> This is a powerful feature as it allows non-root tasks to run
>> container-management tools and provision their resources properly. But
>> this makes implementing your suggestion of having 'cgroup.may_unshare'
>> file tricky as the cgroup owner (task) will be able to set it and
>> still unshare cgroupns. Instead, may be we could just check if the
>> task has appropriate (write?) permissions on the cgroup directory
>> before allowing nested cgroupns creation.
>
> I bet that systemd will want to set may_unshare but not give write
> access.  Who knows?
>
>> [shudder]
>> I am surprised that this even works correctly.
>>
>> Either way, may be checking cgroup directory permissions will work for
>> you? i.e., if you "chown" a cgroup directory to the user, it should be
>> OK if the user's task unshares cgroupns under that cgroup and you
>> don't care about moving tasks from under that cgroup. Without
>> ownership of the cgroup directory, creation of cgroupns will be
>> disallowed. What do you think?
>
> I think this is *safe* but may not useful for eventual systemd stuff.
> Not really sure.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
       [not found]       ` <1405626731-12220-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 15:10         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 15:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/dir.c        | 51 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..2224f08 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
>  	*--p = '\0';

I realize this is currently couldn't happen (hm, well through the
EXPORT_SYMBOL_GPL(kernfs_path) it actually could), and it's the same in the
current code, but could you add a BUG_ON(!buflen) here?

Otherwise looks good to me.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>


>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;
>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 20f4935..1627341 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
       [not found]       ` <1405626731-12220-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 15:10         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 15:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  fs/kernfs/dir.c        | 51 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..2224f08 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
>  	*--p = '\0';

I realize this is currently couldn't happen (hm, well through the
EXPORT_SYMBOL_GPL(kernfs_path) it actually could), and it's the same in the
current code, but could you add a BUG_ON(!buflen) here?

Otherwise looks good to me.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>


>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;
>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 20f4935..1627341 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
@ 2014-07-24 15:10         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 15:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/dir.c        | 51 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..2224f08 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
>  	*--p = '\0';

I realize this is currently couldn't happen (hm, well through the
EXPORT_SYMBOL_GPL(kernfs_path) it actually could), and it's the same in the
current code, but could you add a BUG_ON(!buflen) here?

Otherwise looks good to me.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>


>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;
>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 20f4935..1627341 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (5 preceding siblings ...)
  2014-07-18 16:00     ` [PATCH 0/5] RFC: CGroup Namespaces Serge Hallyn
@ 2014-07-24 16:10     ` Serge Hallyn
  2014-07-24 16:36     ` Serge Hallyn
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.

Hi,

A few questions/comments:

1. Based on this description, am I to understand that after doing a
   cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
   still mount the global root cgroup?  Any plans on "changing" that?
   Will attempts to change settings of a cgroup which is not under
   our current ns be rejected?  (That should be easy to do given your
   patch 1/5).  Sorry if it's done in the set, I'm jumping around...

2. What would be the reprecussions of allowing cgroupns unshare so
   long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
   created your current ns cgroup?  It'd be a shame if that wasn't
   on the roadmap.

3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
   makes me wonder whether it wouldn't be more appropriate to leave
   /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
   (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
   would simply be empty (or say (invalid) or (unreachable)) from a
   sibling ns.  That will give criu and admin tools like lxc/docker all
   they need to do simple cgroup setup.

> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.
> 
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.
> 
>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.
> 
> ---
>  fs/kernfs/dir.c                  |  51 +++++++++++++---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  36 ++++++++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/kernfs.h           |   3 +
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  include/uapi/linux/sched.h       |   3 +-
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  75 +++++++++++++++++------
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  14 files changed, 364 insertions(+), 34 deletions(-)
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
> 
> [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
> [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCH 3/5] cgroup: add function to get task's cgroup on default
> [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
> [PATCH 5/5] cgroup: introduce cgroup namespaces
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 16:10     ` Serge Hallyn
  2014-07-17 19:52     ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.

Hi,

A few questions/comments:

1. Based on this description, am I to understand that after doing a
   cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
   still mount the global root cgroup?  Any plans on "changing" that?
   Will attempts to change settings of a cgroup which is not under
   our current ns be rejected?  (That should be easy to do given your
   patch 1/5).  Sorry if it's done in the set, I'm jumping around...

2. What would be the reprecussions of allowing cgroupns unshare so
   long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
   created your current ns cgroup?  It'd be a shame if that wasn't
   on the roadmap.

3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
   makes me wonder whether it wouldn't be more appropriate to leave
   /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
   (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
   would simply be empty (or say (invalid) or (unreachable)) from a
   sibling ns.  That will give criu and admin tools like lxc/docker all
   they need to do simple cgroup setup.

> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.
> 
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.
> 
>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.
> 
> ---
>  fs/kernfs/dir.c                  |  51 +++++++++++++---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  36 ++++++++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/kernfs.h           |   3 +
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  include/uapi/linux/sched.h       |   3 +-
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  75 +++++++++++++++++------
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  14 files changed, 364 insertions(+), 34 deletions(-)
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
> 
> [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
> [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCH 3/5] cgroup: add function to get task's cgroup on default
> [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
> [PATCH 5/5] cgroup: introduce cgroup namespaces
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-24 16:10     ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.

Hi,

A few questions/comments:

1. Based on this description, am I to understand that after doing a
   cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
   still mount the global root cgroup?  Any plans on "changing" that?
   Will attempts to change settings of a cgroup which is not under
   our current ns be rejected?  (That should be easy to do given your
   patch 1/5).  Sorry if it's done in the set, I'm jumping around...

2. What would be the reprecussions of allowing cgroupns unshare so
   long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
   created your current ns cgroup?  It'd be a shame if that wasn't
   on the roadmap.

3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
   makes me wonder whether it wouldn't be more appropriate to leave
   /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
   (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
   would simply be empty (or say (invalid) or (unreachable)) from a
   sibling ns.  That will give criu and admin tools like lxc/docker all
   they need to do simple cgroup setup.

> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.
> 
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.
> 
>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.
> 
> ---
>  fs/kernfs/dir.c                  |  51 +++++++++++++---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  36 ++++++++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/kernfs.h           |   3 +
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  include/uapi/linux/sched.h       |   3 +-
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  75 +++++++++++++++++------
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  14 files changed, 364 insertions(+), 34 deletions(-)
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
> 
> [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
> [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCH 3/5] cgroup: add function to get task's cgroup on default
> [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
> [PATCH 5/5] cgroup: introduce cgroup namespaces
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (6 preceding siblings ...)
  2014-07-24 16:10     ` Serge Hallyn
@ 2014-07-24 16:36     ` Serge Hallyn
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:36 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.

This combined with the full-path reporting for peer ns cgroups could make
for fun antics when attaching to an existing container (since we'd have
to unshare into a new ns cgroup with the same roto as the container).
I understand you are implying this will be fixed soon though.

>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.

That would be nice.

>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.

Sorry - I see this answers the first part of a question in my previous email.
However, the question of whether changes to limits in cgroups which are not
under our cgroup-ns-root are allowed.

Admittedly the current case with cgmanager is the same - in that it depends
on proper setup of the container - but cgmanager is geared to recommend
not mounting the cgroups in the container at all (and we can reject such
mounts in the contaienr altogether with no loss in functionality) whereas
you are here encouraging such mounts.  Which is fine - so long as you then
fully address the potential issues.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 16:36     ` Serge Hallyn
  2014-07-17 19:52     ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:36 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.

This combined with the full-path reporting for peer ns cgroups could make
for fun antics when attaching to an existing container (since we'd have
to unshare into a new ns cgroup with the same roto as the container).
I understand you are implying this will be fixed soon though.

>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.

That would be nice.

>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.

Sorry - I see this answers the first part of a question in my previous email.
However, the question of whether changes to limits in cgroups which are not
under our cgroup-ns-root are allowed.

Admittedly the current case with cgmanager is the same - in that it depends
on proper setup of the container - but cgmanager is geared to recommend
not mounting the cgroups in the container at all (and we can reject such
mounts in the contaienr altogether with no loss in functionality) whereas
you are here encouraging such mounts.  Which is fine - so long as you then
fully address the potential issues.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-24 16:36     ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:36 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
> 
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
> 
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel.
> 
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
> 
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> 
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> 
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by bind-mounting for the
>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>   cgroup view inside the container.
> 
>   In its current simplistic form, the cgroup namespaces provide
>   following behavior:
> 
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
> 
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> 
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> 
>   (c) From a sibling cgroupns, the real path will be visible:
>       [ns2]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place. This can be
>        detected changed if desired.)
> 
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
> 
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
> 
>   (5) setns() is not supported for cgroup namespace in the initial
>       version.

This combined with the full-path reporting for peer ns cgroups could make
for fun antics when attaching to an existing container (since we'd have
to unshare into a new ns cgroup with the same roto as the container).
I understand you are implying this will be fixed soon though.

>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
> 
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
> 
> Implementation
>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
> 
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.

That would be nice.

>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>       command when ran from inside a cgroupns should only mount the
>       hierarchy from cgroupns-root cgroup:
>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>       # -o __DEVEL__sane_behavior should be implicit
> 
>       This is similar to how procfs can be mounted for every PIDNS. This
>       may have some usecases.

Sorry - I see this answers the first part of a question in my previous email.
However, the question of whether changes to limits in cgroups which are not
under our cgroup-ns-root are allowed.

Admittedly the current case with cgmanager is the same - in that it depends
on proper setup of the container - but cgmanager is geared to recommend
not mounting the cgroups in the container at all (and we can reject such
mounts in the contaienr altogether with no loss in functionality) whereas
you are here encouraging such mounts.  Which is fine - so long as you then
fully address the potential issues.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy
       [not found]       ` <1405626731-12220-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 16:59         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:59 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5223c5..707c302 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1e94b71..1671345 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy
       [not found]       ` <1405626731-12220-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 16:59         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:59 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5223c5..707c302 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1e94b71..1671345 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy
@ 2014-07-24 16:59         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 16:59 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5223c5..707c302 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1e94b71..1671345 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]     ` <1405626731-12220-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 17:01       ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:01 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 

This is fine and I'm not looking to bikeshed, but am wondering - did
you consider any other ways beside unshare (i.e. a new mount option
to cgroupfs)?  If so, do you have a list of the downsides of those?
(I mainly ask bc clone flags are still a scarce commodity)

> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]     ` <1405626731-12220-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 17:01       ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:01 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 

This is fine and I'm not looking to bikeshed, but am wondering - did
you consider any other ways beside unshare (i.e. a new mount option
to cgroupfs)?  If so, do you have a list of the downsides of those?
(I mainly ask bc clone flags are still a scarce commodity)

> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-07-24 17:01       ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:01 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 

This is fine and I'm not looking to bikeshed, but am wondering - did
you consider any other ways beside unshare (i.e. a new mount option
to cgroupfs)?  If so, do you have a list of the downsides of those?
(I mainly ask bc clone flags are still a scarce commodity)

> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
       [not found]       ` <1405626731-12220-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 17:03         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:03 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 17 +++++++++++++++++
>  kernel/cgroup.c        | 18 ------------------
>  2 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 707c302..4ea477f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1671345..8552513 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
>  static struct cftype cgroup_dfl_base_files[];
>  static struct cftype cgroup_legacy_base_files[];
>  
> -static void cgroup_put(struct cgroup *cgrp);
>  static int rebind_subsystems(struct cgroup_root *dst_root,
>  			     unsigned int ss_mask);
>  static int cgroup_destroy_locked(struct cgroup *cgrp);
> @@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
       [not found]       ` <1405626731-12220-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-07-24 17:03         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:03 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, cgroups, linux-kernel, linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>

> ---
>  include/linux/cgroup.h | 17 +++++++++++++++++
>  kernel/cgroup.c        | 18 ------------------
>  2 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 707c302..4ea477f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1671345..8552513 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
>  static struct cftype cgroup_dfl_base_files[];
>  static struct cftype cgroup_legacy_base_files[];
>  
> -static void cgroup_put(struct cgroup *cgrp);
>  static int rebind_subsystems(struct cgroup_root *dst_root,
>  			     unsigned int ss_mask);
>  static int cgroup_destroy_locked(struct cgroup *cgrp);
> @@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
@ 2014-07-24 17:03         ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-07-24 17:03 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 17 +++++++++++++++++
>  kernel/cgroup.c        | 18 ------------------
>  2 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 707c302..4ea477f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1671345..8552513 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
>  static struct cftype cgroup_dfl_base_files[];
>  static struct cftype cgroup_legacy_base_files[];
>  
> -static void cgroup_put(struct cgroup *cgrp);
>  static int rebind_subsystems(struct cgroup_root *dst_root,
>  			     unsigned int ss_mask);
>  static int cgroup_destroy_locked(struct cgroup *cgrp);
> @@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.0.0.526.g5318336
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
  2014-07-24 16:36     ` Serge Hallyn
  (?)
  (?)
@ 2014-07-25 19:29     ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-25 19:29 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andy Lutomirski, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Ingo Molnar

Thank you for your review. I have tried to respond to both your emails here.

On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
> Hi,
>
> A few questions/comments:
>
> 1. Based on this description, am I to understand that after doing a
>    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>    still mount the global root cgroup?  Any plans on "changing" that?

This is suggested in the "Possible Extensions of CGROUPNS" section.
More details below.

>    Will attempts to change settings of a cgroup which is not under
>    our current ns be rejected?  (That should be easy to do given your
>    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
>

Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
read/write of actual cgroup properties are not prevented. Usual
permission checks continue to apply for those. I was hoping that
should be enough, but see more comments towards the end.

> 2. What would be the reprecussions of allowing cgroupns unshare so
>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>    created your current ns cgroup?  It'd be a shame if that wasn't
>    on the roadmap.
>

Its certainly on the roadmap, just that some logistics were not clear
at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
processes, we may need some kind of explicit permission from the
cgroup subsystem to allow this. One approach could be an explicit
cgroup.may_unshare setting. Alternatively, the cgroup directory (which
is going to become the cgroupns-root) ownership could also be used
here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
cgroup directory. There seems to be already a function that allows
similar thing and might be sufficient:

/**
 * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
 * @inode: The inode in question
 * @cap: The capability in question
 *
 * Return true if the current task has the given capability targeted at
 * its own user namespace and that the given inode's uid and gid are
 * mapped into the current user namespace.
 */
bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)

What do you think? We can enable this for non-init userns once this is
decided on.


> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>    makes me wonder whether it wouldn't be more appropriate to leave
>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>    would simply be empty (or say (invalid) or (unreachable)) from a
>    sibling ns.  That will give criu and admin tools like lxc/docker all
>    they need to do simple cgroup setup.
>

It may work for lxc/docker and new applications that use the new
interface. But its difficult to change numerous existing user
applications and libraries that depend on /proc/self/cgroup. Moreover,
even with the new interface, /proc/self/cgroup will continue to leak
system level cgroup information. And fixing this leak is critical to
make the container migratable.

Its easy to correctly handle the read of /proc/<pid>/cgroup from a
sibling cgroupns. Instead of showing unfiltered view, we could just
not show anything (same behavior when the cgroup hierarchy is not
mounted). Will that be more acceptable? I can make that change in the
next version of this series.


>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel.
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by bind-mounting for the
>>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>>   cgroup view inside the container.
>>
>>   In its current simplistic form, the cgroup namespaces provide
>>   following behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns, the real path will be visible:
>>       [ns2]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place. This can be
>>        detected changed if desired.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) setns() is not supported for cgroup namespace in the initial
>>       version.
>
> This combined with the full-path reporting for peer ns cgroups could make
> for fun antics when attaching to an existing container (since we'd have
> to unshare into a new ns cgroup with the same roto as the container).
> I understand you are implying this will be fixed soon though.
>

I am thinking the setns() will be only allowed if
target_cgrpns->cgroupns_root is_descendant_of
current_cgrpns->cgroupns_root. i.e., you will only be setns to a
cgroup namespace which is rooted deeper in hierarchy than your own (in
addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

In addition to this, we need to decide whether its OK for setns() to
also change the cgroup of the task. Consider following example:

[A] ----> [B] ----> C
    ----> D

[A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
under cgroupns [A]) attempts to setns() to cgroupns [B], then its
cgroup should change from /A/D to /A/B. I am concerned about the
side-effects this might cause. Though otherwise, this is a very useful
feature for containers. One could argue that this is similar to
setns() to a mount-namespace which is pivot_root'd somewhere else (in
which case, the attaching task's root "/" moves implicitly with
setns).

Alternatively, we could only allow setns() if
target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
example again, if process in Cgroup D wants to setns() to cgroupns
[B], then it will first need to move to Cgroup B, and only then the
setns() will succeed. This makes sure that there is no implicit cgroup
move.

WDYT? I haven't prototyped this yet, but will send out a patch after
this series is accepted.

>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
> That would be nice.
>
>>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>>       command when ran from inside a cgroupns should only mount the
>>       hierarchy from cgroupns-root cgroup:
>>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>>       # -o __DEVEL__sane_behavior should be implicit
>>
>>       This is similar to how procfs can be mounted for every PIDNS. This
>>       may have some usecases.
>
> Sorry - I see this answers the first part of a question in my previous email.
> However, the question of whether changes to limits in cgroups which are not
> under our cgroup-ns-root are allowed.
>
> Admittedly the current case with cgmanager is the same - in that it depends
> on proper setup of the container - but cgmanager is geared to recommend
> not mounting the cgroups in the container at all (and we can reject such
> mounts in the contaienr altogether with no loss in functionality) whereas
> you are here encouraging such mounts.  Which is fine - so long as you then
> fully address the potential issues.

It will be nice to have this, but frankly, it may add a bit of
complexity in the cgroup/kernfs code (I will have to prototype and
see). Also same behavior can be obtained simply by bind-mounting
cgroupns-root inside the container. So I am currently inclining
towards rejecting such mounts in favor of simplicity.

Regarding disallowing writes to cgroup files outside of your
cgroupns-root, I think it should possible implement it easily. I will
include it in the next revision of this series.

Thanks,
-- 
Aditya
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
  2014-07-24 16:36     ` Serge Hallyn
@ 2014-07-25 19:29       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-25 19:29 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tejun Heo, Li Zefan, cgroups, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Andy Lutomirski

Thank you for your review. I have tried to respond to both your emails here.

On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
> Hi,
>
> A few questions/comments:
>
> 1. Based on this description, am I to understand that after doing a
>    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>    still mount the global root cgroup?  Any plans on "changing" that?

This is suggested in the "Possible Extensions of CGROUPNS" section.
More details below.

>    Will attempts to change settings of a cgroup which is not under
>    our current ns be rejected?  (That should be easy to do given your
>    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
>

Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
read/write of actual cgroup properties are not prevented. Usual
permission checks continue to apply for those. I was hoping that
should be enough, but see more comments towards the end.

> 2. What would be the reprecussions of allowing cgroupns unshare so
>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>    created your current ns cgroup?  It'd be a shame if that wasn't
>    on the roadmap.
>

Its certainly on the roadmap, just that some logistics were not clear
at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
processes, we may need some kind of explicit permission from the
cgroup subsystem to allow this. One approach could be an explicit
cgroup.may_unshare setting. Alternatively, the cgroup directory (which
is going to become the cgroupns-root) ownership could also be used
here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
cgroup directory. There seems to be already a function that allows
similar thing and might be sufficient:

/**
 * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
 * @inode: The inode in question
 * @cap: The capability in question
 *
 * Return true if the current task has the given capability targeted at
 * its own user namespace and that the given inode's uid and gid are
 * mapped into the current user namespace.
 */
bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)

What do you think? We can enable this for non-init userns once this is
decided on.


> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>    makes me wonder whether it wouldn't be more appropriate to leave
>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>    would simply be empty (or say (invalid) or (unreachable)) from a
>    sibling ns.  That will give criu and admin tools like lxc/docker all
>    they need to do simple cgroup setup.
>

It may work for lxc/docker and new applications that use the new
interface. But its difficult to change numerous existing user
applications and libraries that depend on /proc/self/cgroup. Moreover,
even with the new interface, /proc/self/cgroup will continue to leak
system level cgroup information. And fixing this leak is critical to
make the container migratable.

Its easy to correctly handle the read of /proc/<pid>/cgroup from a
sibling cgroupns. Instead of showing unfiltered view, we could just
not show anything (same behavior when the cgroup hierarchy is not
mounted). Will that be more acceptable? I can make that change in the
next version of this series.


>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel.
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by bind-mounting for the
>>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>>   cgroup view inside the container.
>>
>>   In its current simplistic form, the cgroup namespaces provide
>>   following behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns, the real path will be visible:
>>       [ns2]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place. This can be
>>        detected changed if desired.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) setns() is not supported for cgroup namespace in the initial
>>       version.
>
> This combined with the full-path reporting for peer ns cgroups could make
> for fun antics when attaching to an existing container (since we'd have
> to unshare into a new ns cgroup with the same roto as the container).
> I understand you are implying this will be fixed soon though.
>

I am thinking the setns() will be only allowed if
target_cgrpns->cgroupns_root is_descendant_of
current_cgrpns->cgroupns_root. i.e., you will only be setns to a
cgroup namespace which is rooted deeper in hierarchy than your own (in
addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

In addition to this, we need to decide whether its OK for setns() to
also change the cgroup of the task. Consider following example:

[A] ----> [B] ----> C
    ----> D

[A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
under cgroupns [A]) attempts to setns() to cgroupns [B], then its
cgroup should change from /A/D to /A/B. I am concerned about the
side-effects this might cause. Though otherwise, this is a very useful
feature for containers. One could argue that this is similar to
setns() to a mount-namespace which is pivot_root'd somewhere else (in
which case, the attaching task's root "/" moves implicitly with
setns).

Alternatively, we could only allow setns() if
target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
example again, if process in Cgroup D wants to setns() to cgroupns
[B], then it will first need to move to Cgroup B, and only then the
setns() will succeed. This makes sure that there is no implicit cgroup
move.

WDYT? I haven't prototyped this yet, but will send out a patch after
this series is accepted.

>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
> That would be nice.
>
>>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>>       command when ran from inside a cgroupns should only mount the
>>       hierarchy from cgroupns-root cgroup:
>>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>>       # -o __DEVEL__sane_behavior should be implicit
>>
>>       This is similar to how procfs can be mounted for every PIDNS. This
>>       may have some usecases.
>
> Sorry - I see this answers the first part of a question in my previous email.
> However, the question of whether changes to limits in cgroups which are not
> under our cgroup-ns-root are allowed.
>
> Admittedly the current case with cgmanager is the same - in that it depends
> on proper setup of the container - but cgmanager is geared to recommend
> not mounting the cgroups in the container at all (and we can reject such
> mounts in the contaienr altogether with no loss in functionality) whereas
> you are here encouraging such mounts.  Which is fine - so long as you then
> fully address the potential issues.

It will be nice to have this, but frankly, it may add a bit of
complexity in the cgroup/kernfs code (I will have to prototype and
see). Also same behavior can be obtained simply by bind-mounting
cgroupns-root inside the container. So I am currently inclining
towards rejecting such mounts in favor of simplicity.

Regarding disallowing writes to cgroup files outside of your
cgroupns-root, I think it should possible implement it easily. I will
include it in the next revision of this series.

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-25 19:29       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-25 19:29 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tejun Heo, Li Zefan, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Andy Lutomirski

Thank you for your review. I have tried to respond to both your emails here.

On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
> Hi,
>
> A few questions/comments:
>
> 1. Based on this description, am I to understand that after doing a
>    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>    still mount the global root cgroup?  Any plans on "changing" that?

This is suggested in the "Possible Extensions of CGROUPNS" section.
More details below.

>    Will attempts to change settings of a cgroup which is not under
>    our current ns be rejected?  (That should be easy to do given your
>    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
>

Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
read/write of actual cgroup properties are not prevented. Usual
permission checks continue to apply for those. I was hoping that
should be enough, but see more comments towards the end.

> 2. What would be the reprecussions of allowing cgroupns unshare so
>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>    created your current ns cgroup?  It'd be a shame if that wasn't
>    on the roadmap.
>

Its certainly on the roadmap, just that some logistics were not clear
at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
processes, we may need some kind of explicit permission from the
cgroup subsystem to allow this. One approach could be an explicit
cgroup.may_unshare setting. Alternatively, the cgroup directory (which
is going to become the cgroupns-root) ownership could also be used
here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
cgroup directory. There seems to be already a function that allows
similar thing and might be sufficient:

/**
 * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
 * @inode: The inode in question
 * @cap: The capability in question
 *
 * Return true if the current task has the given capability targeted at
 * its own user namespace and that the given inode's uid and gid are
 * mapped into the current user namespace.
 */
bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)

What do you think? We can enable this for non-init userns once this is
decided on.


> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>    makes me wonder whether it wouldn't be more appropriate to leave
>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>    would simply be empty (or say (invalid) or (unreachable)) from a
>    sibling ns.  That will give criu and admin tools like lxc/docker all
>    they need to do simple cgroup setup.
>

It may work for lxc/docker and new applications that use the new
interface. But its difficult to change numerous existing user
applications and libraries that depend on /proc/self/cgroup. Moreover,
even with the new interface, /proc/self/cgroup will continue to leak
system level cgroup information. And fixing this leak is critical to
make the container migratable.

Its easy to correctly handle the read of /proc/<pid>/cgroup from a
sibling cgroupns. Instead of showing unfiltered view, we could just
not show anything (same behavior when the cgroup hierarchy is not
mounted). Will that be more acceptable? I can make that change in the
next version of this series.


>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel.
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by bind-mounting for the
>>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>>   cgroup view inside the container.
>>
>>   In its current simplistic form, the cgroup namespaces provide
>>   following behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns, the real path will be visible:
>>       [ns2]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place. This can be
>>        detected changed if desired.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) setns() is not supported for cgroup namespace in the initial
>>       version.
>
> This combined with the full-path reporting for peer ns cgroups could make
> for fun antics when attaching to an existing container (since we'd have
> to unshare into a new ns cgroup with the same roto as the container).
> I understand you are implying this will be fixed soon though.
>

I am thinking the setns() will be only allowed if
target_cgrpns->cgroupns_root is_descendant_of
current_cgrpns->cgroupns_root. i.e., you will only be setns to a
cgroup namespace which is rooted deeper in hierarchy than your own (in
addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

In addition to this, we need to decide whether its OK for setns() to
also change the cgroup of the task. Consider following example:

[A] ----> [B] ----> C
    ----> D

[A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
under cgroupns [A]) attempts to setns() to cgroupns [B], then its
cgroup should change from /A/D to /A/B. I am concerned about the
side-effects this might cause. Though otherwise, this is a very useful
feature for containers. One could argue that this is similar to
setns() to a mount-namespace which is pivot_root'd somewhere else (in
which case, the attaching task's root "/" moves implicitly with
setns).

Alternatively, we could only allow setns() if
target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
example again, if process in Cgroup D wants to setns() to cgroupns
[B], then it will first need to move to Cgroup B, and only then the
setns() will succeed. This makes sure that there is no implicit cgroup
move.

WDYT? I haven't prototyped this yet, but will send out a patch after
this series is accepted.

>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
> That would be nice.
>
>>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>>       command when ran from inside a cgroupns should only mount the
>>       hierarchy from cgroupns-root cgroup:
>>       $ mount -t cgroup cgroup <cgroup-mountpoint>
>>       # -o __DEVEL__sane_behavior should be implicit
>>
>>       This is similar to how procfs can be mounted for every PIDNS. This
>>       may have some usecases.
>
> Sorry - I see this answers the first part of a question in my previous email.
> However, the question of whether changes to limits in cgroups which are not
> under our cgroup-ns-root are allowed.
>
> Admittedly the current case with cgmanager is the same - in that it depends
> on proper setup of the container - but cgmanager is geared to recommend
> not mounting the cgroups in the container at all (and we can reject such
> mounts in the contaienr altogether with no loss in functionality) whereas
> you are here encouraging such mounts.  Which is fine - so long as you then
> fully address the potential issues.

It will be nice to have this, but frankly, it may add a bit of
complexity in the cgroup/kernfs code (I will have to prototype and
see). Also same behavior can be obtained simply by bind-mounting
cgroupns-root inside the container. So I am currently inclining
towards rejecting such mounts in favor of simplicity.

Regarding disallowing writes to cgroup files outside of your
cgroupns-root, I think it should possible implement it easily. I will
include it in the next revision of this series.

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]       ` <CAGr1F2GcAema-E2q6PFj=R0Z505iD7JshrMuMdfPTJ95wMiQMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-25 20:27         ` Andy Lutomirski
  2014-07-29  4:51         ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-25 20:27 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Jul 25, 2014 at 12:29 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Thank you for your review. I have tried to respond to both your emails here.
>
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
>> 2. What would be the reprecussions of allowing cgroupns unshare so
>>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>>    created your current ns cgroup?  It'd be a shame if that wasn't
>>    on the roadmap.
>>
>
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit
> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
>
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>
> What do you think? We can enable this for non-init userns once this is
> decided on.
>

I think I'd rather it just check that it's owned by the userns owner
if we were going down that route.  But maybe there's a good reason to
do it this way.

>
>> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>>    makes me wonder whether it wouldn't be more appropriate to leave
>>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>>    would simply be empty (or say (invalid) or (unreachable)) from a
>>    sibling ns.  That will give criu and admin tools like lxc/docker all
>>    they need to do simple cgroup setup.
>>
>
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
>
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.
>
>


>>>   (5) setns() is not supported for cgroup namespace in the initial
>>>       version.
>>
>> This combined with the full-path reporting for peer ns cgroups could make
>> for fun antics when attaching to an existing container (since we'd have
>> to unshare into a new ns cgroup with the same roto as the container).
>> I understand you are implying this will be fixed soon though.
>>
>
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

I'm not sure why the capable_wrt_inode_uidgid is needed here -- I
imagine that the hierarchy check and the usual CAP_SYS_ADMIN check on
the cgroupns's userns would be sufficient.

>
> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
>
> [A] ----> [B] ----> C
>     ----> D
>
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

Off the top of my head, I think that making setns do this would be too
magical.  How about just requiring that you already be in (a
descendent of) the requested cgroupns's root cgroup if you try to
setns?

>
> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I like this one, but I think that descendant cgroups should probably
be allowed, too.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]       ` <CAGr1F2GcAema-E2q6PFj=R0Z505iD7JshrMuMdfPTJ95wMiQMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-25 20:27         ` Andy Lutomirski
  2014-07-29  4:51         ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-25 20:27 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge Hallyn, Tejun Heo, Li Zefan, cgroups, linux-kernel,
	Linux API, Ingo Molnar, Linux Containers

On Fri, Jul 25, 2014 at 12:29 PM, Aditya Kali <adityakali@google.com> wrote:
> Thank you for your review. I have tried to respond to both your emails here.
>
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>> 2. What would be the reprecussions of allowing cgroupns unshare so
>>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>>    created your current ns cgroup?  It'd be a shame if that wasn't
>>    on the roadmap.
>>
>
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit
> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
>
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>
> What do you think? We can enable this for non-init userns once this is
> decided on.
>

I think I'd rather it just check that it's owned by the userns owner
if we were going down that route.  But maybe there's a good reason to
do it this way.

>
>> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>>    makes me wonder whether it wouldn't be more appropriate to leave
>>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>>    would simply be empty (or say (invalid) or (unreachable)) from a
>>    sibling ns.  That will give criu and admin tools like lxc/docker all
>>    they need to do simple cgroup setup.
>>
>
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
>
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.
>
>


>>>   (5) setns() is not supported for cgroup namespace in the initial
>>>       version.
>>
>> This combined with the full-path reporting for peer ns cgroups could make
>> for fun antics when attaching to an existing container (since we'd have
>> to unshare into a new ns cgroup with the same roto as the container).
>> I understand you are implying this will be fixed soon though.
>>
>
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

I'm not sure why the capable_wrt_inode_uidgid is needed here -- I
imagine that the hierarchy check and the usual CAP_SYS_ADMIN check on
the cgroupns's userns would be sufficient.

>
> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
>
> [A] ----> [B] ----> C
>     ----> D
>
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

Off the top of my head, I think that making setns do this would be too
magical.  How about just requiring that you already be in (a
descendent of) the requested cgroupns's root cgroup if you try to
setns?

>
> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I like this one, but I think that descendant cgroups should probably
be allowed, too.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-25 20:27         ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-25 20:27 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge Hallyn, Tejun Heo, Li Zefan,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers

On Fri, Jul 25, 2014 at 12:29 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Thank you for your review. I have tried to respond to both your emails here.
>
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
>> 2. What would be the reprecussions of allowing cgroupns unshare so
>>    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>>    created your current ns cgroup?  It'd be a shame if that wasn't
>>    on the roadmap.
>>
>
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit
> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
>
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>
> What do you think? We can enable this for non-init userns once this is
> decided on.
>

I think I'd rather it just check that it's owned by the userns owner
if we were going down that route.  But maybe there's a good reason to
do it this way.

>
>> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>>    makes me wonder whether it wouldn't be more appropriate to leave
>>    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>>    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
>>    would simply be empty (or say (invalid) or (unreachable)) from a
>>    sibling ns.  That will give criu and admin tools like lxc/docker all
>>    they need to do simple cgroup setup.
>>
>
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
>
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.
>
>


>>>   (5) setns() is not supported for cgroup namespace in the initial
>>>       version.
>>
>> This combined with the full-path reporting for peer ns cgroups could make
>> for fun antics when attaching to an existing container (since we'd have
>> to unshare into a new ns cgroup with the same roto as the container).
>> I understand you are implying this will be fixed soon though.
>>
>
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

I'm not sure why the capable_wrt_inode_uidgid is needed here -- I
imagine that the hierarchy check and the usual CAP_SYS_ADMIN check on
the cgroupns's userns would be sufficient.

>
> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
>
> [A] ----> [B] ----> C
>     ----> D
>
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

Off the top of my head, I think that making setns do this would be too
magical.  How about just requiring that you already be in (a
descendent of) the requested cgroupns's root cgroup if you try to
setns?

>
> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I like this one, but I think that descendant cgroups should probably
be allowed, too.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]       ` <CAGr1F2GcAema-E2q6PFj=R0Z505iD7JshrMuMdfPTJ95wMiQMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-07-25 20:27         ` Andy Lutomirski
@ 2014-07-29  4:51         ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-07-29  4:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Aditya Kali (adityakali@google.com):
> Thank you for your review. I have tried to respond to both your emails here.
> 
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> Background
> >>   Cgroups and Namespaces are used together to create “virtual”
> >>   containers that isolates the host environment from the processes
> >>   running in container. But since cgroups themselves are not
> >>   “virtualized”, the task is always able to see global cgroups view
> >>   through cgroupfs mount and via /proc/self/cgroup file.
> >>
> > Hi,
> >
> > A few questions/comments:
> >
> > 1. Based on this description, am I to understand that after doing a
> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >    still mount the global root cgroup?  Any plans on "changing" that?
> 
> This is suggested in the "Possible Extensions of CGROUPNS" section.
> More details below.
> 
> >    Will attempts to change settings of a cgroup which is not under
> >    our current ns be rejected?  (That should be easy to do given your
> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
> >
> 
> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> read/write of actual cgroup properties are not prevented. Usual
> permission checks continue to apply for those. I was hoping that
> should be enough, but see more comments towards the end.
> 
> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >    created your current ns cgroup?  It'd be a shame if that wasn't
> >    on the roadmap.
> >
> 
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit

So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
you're fine.

The only real problem I can think of with unsharing a cgroup_ns is that
you could lock a setuid-root application someplace it wasn't expecting.
The above check guarantees that you were privileged enough that you'd
better be trusted in this user namespace.

(Unless there is some possible interaction I'm overlooking)

> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
> 
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
> 
> What do you think? We can enable this for non-init userns once this is
> decided on.

I don't think it's needed... (until you show how wrong I am above :)

> > 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
> >    makes me wonder whether it wouldn't be more appropriate to leave
> >    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
> >    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
> >    would simply be empty (or say (invalid) or (unreachable)) from a
> >    sibling ns.  That will give criu and admin tools like lxc/docker all
> >    they need to do simple cgroup setup.
> >
> 
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
> 
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.

It'll be acceptable so long as setns(CLONE_NEWCGROUP) is supported.

> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   This exposure of cgroup names to the processes running inside a
> >>   container results in some problems:
> >>   (1) The container names are typically host-container-management-agent
> >>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
> >>       leaking the hierarchy) reveals too much information about the host
> >>       system.
> >>   (2) It makes the container migration across machines (CRIU) more
> >>       difficult as the container names need to be unique across the
> >>       machines in the migration domain.
> >>   (3) It makes it difficult to run container management tools (like
> >>       docker/libcontainer, lmctfy, etc.) within virtual containers
> >>       without adding dependency on some state/agent present outside the
> >>       container.
> >>
> >>   Note that the feature proposed here is completely different than the
> >>   “ns cgroup” feature which existed in the linux kernel until recently.
> >>   The ns cgroup also attempted to connect cgroups and namespaces by
> >>   creating a new cgroup every time a new namespace was created. It did
> >>   not solve any of the above mentioned problems and was later dropped
> >>   from the kernel.
> >>
> >> Introducing CGroup Namespaces
> >>   With unified cgroup hierarchy
> >>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> >>   have a much more coherent cgroup view and its easy to associate a
> >>   container with a single cgroup. This also allows us to virtualize the
> >>   cgroup view for tasks inside the container.
> >>
> >>   The new CGroup Namespace allows a process to “unshare” its cgroup
> >>   hierarchy starting from the cgroup its currently in.
> >>   For Ex:
> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>   $ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> >>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> >>   [ns]$ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> >>   # From within new cgroupns, process sees that its in the root cgroup
> >>   [ns]$ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>
> >>   # From global cgroupns:
> >>   $ cat /proc/<pid>/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   The virtualization of /proc/self/cgroup file combined with restricting
> >>   the view of cgroup hierarchy by bind-mounting for the
> >>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> >>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> >>   cgroup view inside the container.
> >>
> >>   In its current simplistic form, the cgroup namespaces provide
> >>   following behavior:
> >>
> >>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> >>       the process calling unshare is running.
> >>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> >>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> >>       For the init_cgroup_ns, this is the real root (“/”) cgroup
> >>       (identified in code as cgrp_dfl_root.cgrp).
> >>
> >>   (2) The cgroupns-root cgroup does not change even if the namespace
> >>       creator process later moves to a different cgroup.
> >>       $ ~/unshare -c # unshare cgroupns in some cgroup
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>       [ns]$ mkdir sub_cgrp_1
> >>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (3) Each process gets its CGROUPNS specific view of
> >>       /proc/<pid>/cgroup.
> >>   (a) Processes running inside the cgroup namespace will be able to see
> >>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> >>       [ns]$ sleep 100000 &  # From within unshared cgroupns
> >>       [1] 7353
> >>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (b) From global cgroupns, the real cgroup path will be visible:
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>
> >>   (c) From a sibling cgroupns, the real path will be visible:
> >>       [ns2]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       (In correct container setup though, it should not be possible to
> >>        access PIDs in another container in the first place. This can be
> >>        detected changed if desired.)
> >>
> >>   (4) Processes inside a cgroupns are not allowed to move out of the
> >>       cgroupns-root. This is true even if a privileged process in global
> >>       cgroupns tries to move the process out of its cgroupns-root.
> >>
> >>       # From global cgroupns
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
> >>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> >>       -bash: echo: write error: Operation not permitted
> >>
> >>   (5) setns() is not supported for cgroup namespace in the initial
> >>       version.
> >
> > This combined with the full-path reporting for peer ns cgroups could make
> > for fun antics when attaching to an existing container (since we'd have
> > to unshare into a new ns cgroup with the same roto as the container).
> > I understand you are implying this will be fixed soon though.
> >
> 
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

Certainly.

> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
> 
> [A] ----> [B] ----> C
>     ----> D
> 
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

This is what I'd expect.

> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I'm ok with the restriction if it makes the patchset easier for you -
i.e. you not having to man-handle me into another cgroup.  Though I
wouldn't expect the locking for that to be an obstacle...

> WDYT? I haven't prototyped this yet, but will send out a patch after
> this series is accepted.

Either one is fine with me.

> >>   (6) When some thread from a multi-threaded process unshares its
> >>       cgroup-namespace, the new cgroupns gets applied to the entire
> >>       process (all the threads). This should be OK since
> >>       unified-hierarchy only allows process-level containerization. So
> >>       all the threads in the process will have the same cgroup. And both
> >>       - changing cgroups and unsharing namespaces - are protected under
> >>       threadgroup_lock(task).
> >>
> >>   (7) The cgroup namespace is alive as long as there is atleast 1
> >>       process inside it. When the last process exits, the cgroup
> >>       namespace is destroyed. The cgroupns-root and the actual cgroups
> >>       remain though.
> >>
> >> Implementation
> >>   The current patch-set is based on top of Tejun's cgroup tree (for-next
> >>   branch). Its fairly non-intrusive and provides above mentioned
> >>   features.
> >>
> >> Possible extensions of CGROUPNS:
> >>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> >>       capabilities to restrict cgroups to administrative users. CGroup
> >>       namespaces could be of help here. With cgroup namespaces, it might
> >>       be possible to delegate administration of sub-cgroups under a
> >>       cgroupns-root to the cgroupns owner.
> >
> > That would be nice.
> >
> >>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> >>       command when ran from inside a cgroupns should only mount the
> >>       hierarchy from cgroupns-root cgroup:
> >>       $ mount -t cgroup cgroup <cgroup-mountpoint>
> >>       # -o __DEVEL__sane_behavior should be implicit
> >>
> >>       This is similar to how procfs can be mounted for every PIDNS. This
> >>       may have some usecases.
> >
> > Sorry - I see this answers the first part of a question in my previous email.
> > However, the question of whether changes to limits in cgroups which are not
> > under our cgroup-ns-root are allowed.
> >
> > Admittedly the current case with cgmanager is the same - in that it depends
> > on proper setup of the container - but cgmanager is geared to recommend
> > not mounting the cgroups in the container at all (and we can reject such
> > mounts in the contaienr altogether with no loss in functionality) whereas
> > you are here encouraging such mounts.  Which is fine - so long as you then
> > fully address the potential issues.
> 
> It will be nice to have this, but frankly, it may add a bit of
> complexity in the cgroup/kernfs code (I will have to prototype and
> see). Also same behavior can be obtained simply by bind-mounting
> cgroupns-root inside the container. So I am currently inclining
> towards rejecting such mounts in favor of simplicity.

Not having to track what to bind-mount where is a very nice
simplification though.  In lxc with cgmanager, we are now able to always
simply bind-mount /sys/fs/cgroup/cgmanager from the host into the
container.  Nothing more needed for the container to be able to manage
its own cgroup and start its own containers.  Likewise, if mount -t
cgroup were filtered to cgroupns, then lxc could simply not mount
anything into the container at all.  If it mount -t cgroup is not
filtered wrt cgroupns, then we'd have to go back to, at container start,
finding the mountpoint for every subsystem, calculating the container's
cgroup there, and bind-mounting them into the container.

> Regarding disallowing writes to cgroup files outside of your
> cgroupns-root, I think it should possible implement it easily. I will
> include it in the next revision of this series.

Great - thanks.

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
       [not found]       ` <CAGr1F2GcAema-E2q6PFj=R0Z505iD7JshrMuMdfPTJ95wMiQMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-07-29  4:51         ` Serge E. Hallyn
  2014-07-29  4:51         ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-07-29  4:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge Hallyn, Linux API, Linux Containers, linux-kernel,
	Andy Lutomirski, Tejun Heo, cgroups, Ingo Molnar

Quoting Aditya Kali (adityakali@google.com):
> Thank you for your review. I have tried to respond to both your emails here.
> 
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> Background
> >>   Cgroups and Namespaces are used together to create “virtual”
> >>   containers that isolates the host environment from the processes
> >>   running in container. But since cgroups themselves are not
> >>   “virtualized”, the task is always able to see global cgroups view
> >>   through cgroupfs mount and via /proc/self/cgroup file.
> >>
> > Hi,
> >
> > A few questions/comments:
> >
> > 1. Based on this description, am I to understand that after doing a
> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >    still mount the global root cgroup?  Any plans on "changing" that?
> 
> This is suggested in the "Possible Extensions of CGROUPNS" section.
> More details below.
> 
> >    Will attempts to change settings of a cgroup which is not under
> >    our current ns be rejected?  (That should be easy to do given your
> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
> >
> 
> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> read/write of actual cgroup properties are not prevented. Usual
> permission checks continue to apply for those. I was hoping that
> should be enough, but see more comments towards the end.
> 
> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >    created your current ns cgroup?  It'd be a shame if that wasn't
> >    on the roadmap.
> >
> 
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit

So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
you're fine.

The only real problem I can think of with unsharing a cgroup_ns is that
you could lock a setuid-root application someplace it wasn't expecting.
The above check guarantees that you were privileged enough that you'd
better be trusted in this user namespace.

(Unless there is some possible interaction I'm overlooking)

> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
> 
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
> 
> What do you think? We can enable this for non-init userns once this is
> decided on.

I don't think it's needed... (until you show how wrong I am above :)

> > 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
> >    makes me wonder whether it wouldn't be more appropriate to leave
> >    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
> >    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
> >    would simply be empty (or say (invalid) or (unreachable)) from a
> >    sibling ns.  That will give criu and admin tools like lxc/docker all
> >    they need to do simple cgroup setup.
> >
> 
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
> 
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.

It'll be acceptable so long as setns(CLONE_NEWCGROUP) is supported.

> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   This exposure of cgroup names to the processes running inside a
> >>   container results in some problems:
> >>   (1) The container names are typically host-container-management-agent
> >>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
> >>       leaking the hierarchy) reveals too much information about the host
> >>       system.
> >>   (2) It makes the container migration across machines (CRIU) more
> >>       difficult as the container names need to be unique across the
> >>       machines in the migration domain.
> >>   (3) It makes it difficult to run container management tools (like
> >>       docker/libcontainer, lmctfy, etc.) within virtual containers
> >>       without adding dependency on some state/agent present outside the
> >>       container.
> >>
> >>   Note that the feature proposed here is completely different than the
> >>   “ns cgroup” feature which existed in the linux kernel until recently.
> >>   The ns cgroup also attempted to connect cgroups and namespaces by
> >>   creating a new cgroup every time a new namespace was created. It did
> >>   not solve any of the above mentioned problems and was later dropped
> >>   from the kernel.
> >>
> >> Introducing CGroup Namespaces
> >>   With unified cgroup hierarchy
> >>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> >>   have a much more coherent cgroup view and its easy to associate a
> >>   container with a single cgroup. This also allows us to virtualize the
> >>   cgroup view for tasks inside the container.
> >>
> >>   The new CGroup Namespace allows a process to “unshare” its cgroup
> >>   hierarchy starting from the cgroup its currently in.
> >>   For Ex:
> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>   $ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> >>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> >>   [ns]$ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> >>   # From within new cgroupns, process sees that its in the root cgroup
> >>   [ns]$ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>
> >>   # From global cgroupns:
> >>   $ cat /proc/<pid>/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   The virtualization of /proc/self/cgroup file combined with restricting
> >>   the view of cgroup hierarchy by bind-mounting for the
> >>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> >>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> >>   cgroup view inside the container.
> >>
> >>   In its current simplistic form, the cgroup namespaces provide
> >>   following behavior:
> >>
> >>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> >>       the process calling unshare is running.
> >>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> >>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> >>       For the init_cgroup_ns, this is the real root (“/”) cgroup
> >>       (identified in code as cgrp_dfl_root.cgrp).
> >>
> >>   (2) The cgroupns-root cgroup does not change even if the namespace
> >>       creator process later moves to a different cgroup.
> >>       $ ~/unshare -c # unshare cgroupns in some cgroup
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>       [ns]$ mkdir sub_cgrp_1
> >>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (3) Each process gets its CGROUPNS specific view of
> >>       /proc/<pid>/cgroup.
> >>   (a) Processes running inside the cgroup namespace will be able to see
> >>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> >>       [ns]$ sleep 100000 &  # From within unshared cgroupns
> >>       [1] 7353
> >>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (b) From global cgroupns, the real cgroup path will be visible:
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>
> >>   (c) From a sibling cgroupns, the real path will be visible:
> >>       [ns2]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       (In correct container setup though, it should not be possible to
> >>        access PIDs in another container in the first place. This can be
> >>        detected changed if desired.)
> >>
> >>   (4) Processes inside a cgroupns are not allowed to move out of the
> >>       cgroupns-root. This is true even if a privileged process in global
> >>       cgroupns tries to move the process out of its cgroupns-root.
> >>
> >>       # From global cgroupns
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
> >>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> >>       -bash: echo: write error: Operation not permitted
> >>
> >>   (5) setns() is not supported for cgroup namespace in the initial
> >>       version.
> >
> > This combined with the full-path reporting for peer ns cgroups could make
> > for fun antics when attaching to an existing container (since we'd have
> > to unshare into a new ns cgroup with the same roto as the container).
> > I understand you are implying this will be fixed soon though.
> >
> 
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

Certainly.

> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
> 
> [A] ----> [B] ----> C
>     ----> D
> 
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

This is what I'd expect.

> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I'm ok with the restriction if it makes the patchset easier for you -
i.e. you not having to man-handle me into another cgroup.  Though I
wouldn't expect the locking for that to be an obstacle...

> WDYT? I haven't prototyped this yet, but will send out a patch after
> this series is accepted.

Either one is fine with me.

> >>   (6) When some thread from a multi-threaded process unshares its
> >>       cgroup-namespace, the new cgroupns gets applied to the entire
> >>       process (all the threads). This should be OK since
> >>       unified-hierarchy only allows process-level containerization. So
> >>       all the threads in the process will have the same cgroup. And both
> >>       - changing cgroups and unsharing namespaces - are protected under
> >>       threadgroup_lock(task).
> >>
> >>   (7) The cgroup namespace is alive as long as there is atleast 1
> >>       process inside it. When the last process exits, the cgroup
> >>       namespace is destroyed. The cgroupns-root and the actual cgroups
> >>       remain though.
> >>
> >> Implementation
> >>   The current patch-set is based on top of Tejun's cgroup tree (for-next
> >>   branch). Its fairly non-intrusive and provides above mentioned
> >>   features.
> >>
> >> Possible extensions of CGROUPNS:
> >>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> >>       capabilities to restrict cgroups to administrative users. CGroup
> >>       namespaces could be of help here. With cgroup namespaces, it might
> >>       be possible to delegate administration of sub-cgroups under a
> >>       cgroupns-root to the cgroupns owner.
> >
> > That would be nice.
> >
> >>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> >>       command when ran from inside a cgroupns should only mount the
> >>       hierarchy from cgroupns-root cgroup:
> >>       $ mount -t cgroup cgroup <cgroup-mountpoint>
> >>       # -o __DEVEL__sane_behavior should be implicit
> >>
> >>       This is similar to how procfs can be mounted for every PIDNS. This
> >>       may have some usecases.
> >
> > Sorry - I see this answers the first part of a question in my previous email.
> > However, the question of whether changes to limits in cgroups which are not
> > under our cgroup-ns-root are allowed.
> >
> > Admittedly the current case with cgmanager is the same - in that it depends
> > on proper setup of the container - but cgmanager is geared to recommend
> > not mounting the cgroups in the container at all (and we can reject such
> > mounts in the contaienr altogether with no loss in functionality) whereas
> > you are here encouraging such mounts.  Which is fine - so long as you then
> > fully address the potential issues.
> 
> It will be nice to have this, but frankly, it may add a bit of
> complexity in the cgroup/kernfs code (I will have to prototype and
> see). Also same behavior can be obtained simply by bind-mounting
> cgroupns-root inside the container. So I am currently inclining
> towards rejecting such mounts in favor of simplicity.

Not having to track what to bind-mount where is a very nice
simplification though.  In lxc with cgmanager, we are now able to always
simply bind-mount /sys/fs/cgroup/cgmanager from the host into the
container.  Nothing more needed for the container to be able to manage
its own cgroup and start its own containers.  Likewise, if mount -t
cgroup were filtered to cgroupns, then lxc could simply not mount
anything into the container at all.  If it mount -t cgroup is not
filtered wrt cgroupns, then we'd have to go back to, at container start,
finding the mountpoint for every subsystem, calculating the container's
cgroup there, and bind-mounting them into the container.

> Regarding disallowing writes to cgroup files outside of your
> cgroupns-root, I think it should possible implement it easily. I will
> include it in the next revision of this series.

Great - thanks.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-29  4:51         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-07-29  4:51 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge Hallyn, Linux API, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Thank you for your review. I have tried to respond to both your emails here.
> 
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzJhl2p70BpVqQ@public.gmane.orgm> wrote:
> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> Background
> >>   Cgroups and Namespaces are used together to create “virtual”
> >>   containers that isolates the host environment from the processes
> >>   running in container. But since cgroups themselves are not
> >>   “virtualized”, the task is always able to see global cgroups view
> >>   through cgroupfs mount and via /proc/self/cgroup file.
> >>
> > Hi,
> >
> > A few questions/comments:
> >
> > 1. Based on this description, am I to understand that after doing a
> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >    still mount the global root cgroup?  Any plans on "changing" that?
> 
> This is suggested in the "Possible Extensions of CGROUPNS" section.
> More details below.
> 
> >    Will attempts to change settings of a cgroup which is not under
> >    our current ns be rejected?  (That should be easy to do given your
> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
> >
> 
> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> read/write of actual cgroup properties are not prevented. Usual
> permission checks continue to apply for those. I was hoping that
> should be enough, but see more comments towards the end.
> 
> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >    created your current ns cgroup?  It'd be a shame if that wasn't
> >    on the roadmap.
> >
> 
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit

So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
you're fine.

The only real problem I can think of with unsharing a cgroup_ns is that
you could lock a setuid-root application someplace it wasn't expecting.
The above check guarantees that you were privileged enough that you'd
better be trusted in this user namespace.

(Unless there is some possible interaction I'm overlooking)

> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
> 
> /**
>  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>  * @inode: The inode in question
>  * @cap: The capability in question
>  *
>  * Return true if the current task has the given capability targeted at
>  * its own user namespace and that the given inode's uid and gid are
>  * mapped into the current user namespace.
>  */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
> 
> What do you think? We can enable this for non-init userns once this is
> decided on.

I don't think it's needed... (until you show how wrong I am above :)

> > 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
> >    makes me wonder whether it wouldn't be more appropriate to leave
> >    /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
> >    (or somesuch) to provide the namespaced view.  /proc/self/nscgroup
> >    would simply be empty (or say (invalid) or (unreachable)) from a
> >    sibling ns.  That will give criu and admin tools like lxc/docker all
> >    they need to do simple cgroup setup.
> >
> 
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
> 
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.

It'll be acceptable so long as setns(CLONE_NEWCGROUP) is supported.

> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   This exposure of cgroup names to the processes running inside a
> >>   container results in some problems:
> >>   (1) The container names are typically host-container-management-agent
> >>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
> >>       leaking the hierarchy) reveals too much information about the host
> >>       system.
> >>   (2) It makes the container migration across machines (CRIU) more
> >>       difficult as the container names need to be unique across the
> >>       machines in the migration domain.
> >>   (3) It makes it difficult to run container management tools (like
> >>       docker/libcontainer, lmctfy, etc.) within virtual containers
> >>       without adding dependency on some state/agent present outside the
> >>       container.
> >>
> >>   Note that the feature proposed here is completely different than the
> >>   “ns cgroup” feature which existed in the linux kernel until recently.
> >>   The ns cgroup also attempted to connect cgroups and namespaces by
> >>   creating a new cgroup every time a new namespace was created. It did
> >>   not solve any of the above mentioned problems and was later dropped
> >>   from the kernel.
> >>
> >> Introducing CGroup Namespaces
> >>   With unified cgroup hierarchy
> >>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> >>   have a much more coherent cgroup view and its easy to associate a
> >>   container with a single cgroup. This also allows us to virtualize the
> >>   cgroup view for tasks inside the container.
> >>
> >>   The new CGroup Namespace allows a process to “unshare” its cgroup
> >>   hierarchy starting from the cgroup its currently in.
> >>   For Ex:
> >>   $ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>   $ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> >>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> >>   [ns]$ ls -l /proc/self/ns/cgroup
> >>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> >>   # From within new cgroupns, process sees that its in the root cgroup
> >>   [ns]$ cat /proc/self/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>
> >>   # From global cgroupns:
> >>   $ cat /proc/<pid>/cgroup
> >>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >>   The virtualization of /proc/self/cgroup file combined with restricting
> >>   the view of cgroup hierarchy by bind-mounting for the
> >>   $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> >>   $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> >>   cgroup view inside the container.
> >>
> >>   In its current simplistic form, the cgroup namespaces provide
> >>   following behavior:
> >>
> >>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> >>       the process calling unshare is running.
> >>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> >>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> >>       For the init_cgroup_ns, this is the real root (“/”) cgroup
> >>       (identified in code as cgrp_dfl_root.cgrp).
> >>
> >>   (2) The cgroupns-root cgroup does not change even if the namespace
> >>       creator process later moves to a different cgroup.
> >>       $ ~/unshare -c # unshare cgroupns in some cgroup
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>       [ns]$ mkdir sub_cgrp_1
> >>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/self/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (3) Each process gets its CGROUPNS specific view of
> >>       /proc/<pid>/cgroup.
> >>   (a) Processes running inside the cgroup namespace will be able to see
> >>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> >>       [ns]$ sleep 100000 &  # From within unshared cgroupns
> >>       [1] 7353
> >>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> >>       [ns]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >>   (b) From global cgroupns, the real cgroup path will be visible:
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>
> >>   (c) From a sibling cgroupns, the real path will be visible:
> >>       [ns2]$ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       (In correct container setup though, it should not be possible to
> >>        access PIDs in another container in the first place. This can be
> >>        detected changed if desired.)
> >>
> >>   (4) Processes inside a cgroupns are not allowed to move out of the
> >>       cgroupns-root. This is true even if a privileged process in global
> >>       cgroupns tries to move the process out of its cgroupns-root.
> >>
> >>       # From global cgroupns
> >>       $ cat /proc/7353/cgroup
> >>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
> >>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> >>       -bash: echo: write error: Operation not permitted
> >>
> >>   (5) setns() is not supported for cgroup namespace in the initial
> >>       version.
> >
> > This combined with the full-path reporting for peer ns cgroups could make
> > for fun antics when attaching to an existing container (since we'd have
> > to unshare into a new ns cgroup with the same roto as the container).
> > I understand you are implying this will be fixed soon though.
> >
> 
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).

Certainly.

> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
> 
> [A] ----> [B] ----> C
>     ----> D
> 
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).

This is what I'd expect.

> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.

I'm ok with the restriction if it makes the patchset easier for you -
i.e. you not having to man-handle me into another cgroup.  Though I
wouldn't expect the locking for that to be an obstacle...

> WDYT? I haven't prototyped this yet, but will send out a patch after
> this series is accepted.

Either one is fine with me.

> >>   (6) When some thread from a multi-threaded process unshares its
> >>       cgroup-namespace, the new cgroupns gets applied to the entire
> >>       process (all the threads). This should be OK since
> >>       unified-hierarchy only allows process-level containerization. So
> >>       all the threads in the process will have the same cgroup. And both
> >>       - changing cgroups and unsharing namespaces - are protected under
> >>       threadgroup_lock(task).
> >>
> >>   (7) The cgroup namespace is alive as long as there is atleast 1
> >>       process inside it. When the last process exits, the cgroup
> >>       namespace is destroyed. The cgroupns-root and the actual cgroups
> >>       remain though.
> >>
> >> Implementation
> >>   The current patch-set is based on top of Tejun's cgroup tree (for-next
> >>   branch). Its fairly non-intrusive and provides above mentioned
> >>   features.
> >>
> >> Possible extensions of CGROUPNS:
> >>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> >>       capabilities to restrict cgroups to administrative users. CGroup
> >>       namespaces could be of help here. With cgroup namespaces, it might
> >>       be possible to delegate administration of sub-cgroups under a
> >>       cgroupns-root to the cgroupns owner.
> >
> > That would be nice.
> >
> >>   (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> >>       command when ran from inside a cgroupns should only mount the
> >>       hierarchy from cgroupns-root cgroup:
> >>       $ mount -t cgroup cgroup <cgroup-mountpoint>
> >>       # -o __DEVEL__sane_behavior should be implicit
> >>
> >>       This is similar to how procfs can be mounted for every PIDNS. This
> >>       may have some usecases.
> >
> > Sorry - I see this answers the first part of a question in my previous email.
> > However, the question of whether changes to limits in cgroups which are not
> > under our cgroup-ns-root are allowed.
> >
> > Admittedly the current case with cgmanager is the same - in that it depends
> > on proper setup of the container - but cgmanager is geared to recommend
> > not mounting the cgroups in the container at all (and we can reject such
> > mounts in the contaienr altogether with no loss in functionality) whereas
> > you are here encouraging such mounts.  Which is fine - so long as you then
> > fully address the potential issues.
> 
> It will be nice to have this, but frankly, it may add a bit of
> complexity in the cgroup/kernfs code (I will have to prototype and
> see). Also same behavior can be obtained simply by bind-mounting
> cgroupns-root inside the container. So I am currently inclining
> towards rejecting such mounts in favor of simplicity.

Not having to track what to bind-mount where is a very nice
simplification though.  In lxc with cgmanager, we are now able to always
simply bind-mount /sys/fs/cgroup/cgmanager from the host into the
container.  Nothing more needed for the container to be able to manage
its own cgroup and start its own containers.  Likewise, if mount -t
cgroup were filtered to cgroupns, then lxc could simply not mount
anything into the container at all.  If it mount -t cgroup is not
filtered wrt cgroupns, then we'd have to go back to, at container start,
finding the mountpoint for every subsystem, calculating the container's
cgroup there, and bind-mounting them into the container.

> Regarding disallowing writes to cgroup files outside of your
> cgroupns-root, I think it should possible implement it easily. I will
> include it in the next revision of this series.

Great - thanks.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
  2014-07-29  4:51         ` Serge E. Hallyn
@ 2014-07-29 15:08             ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-29 15:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Thank you for your review. I have tried to respond to both your emails here.
>>
>> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>> > Quoting Aditya Kali (adityakali@google.com):
>> >> Background
>> >>   Cgroups and Namespaces are used together to create “virtual”
>> >>   containers that isolates the host environment from the processes
>> >>   running in container. But since cgroups themselves are not
>> >>   “virtualized”, the task is always able to see global cgroups view
>> >>   through cgroupfs mount and via /proc/self/cgroup file.
>> >>
>> > Hi,
>> >
>> > A few questions/comments:
>> >
>> > 1. Based on this description, am I to understand that after doing a
>> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>> >    still mount the global root cgroup?  Any plans on "changing" that?
>>
>> This is suggested in the "Possible Extensions of CGROUPNS" section.
>> More details below.
>>
>> >    Will attempts to change settings of a cgroup which is not under
>> >    our current ns be rejected?  (That should be easy to do given your
>> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
>> >
>>
>> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
>> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
>> read/write of actual cgroup properties are not prevented. Usual
>> permission checks continue to apply for those. I was hoping that
>> should be enough, but see more comments towards the end.
>>
>> > 2. What would be the reprecussions of allowing cgroupns unshare so
>> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>> >    created your current ns cgroup?  It'd be a shame if that wasn't
>> >    on the roadmap.
>> >
>>
>> Its certainly on the roadmap, just that some logistics were not clear
>> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
>> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
>> processes, we may need some kind of explicit permission from the
>> cgroup subsystem to allow this. One approach could be an explicit
>
> So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> you're fine.
>
> The only real problem I can think of with unsharing a cgroup_ns is that
> you could lock a setuid-root application someplace it wasn't expecting.
> The above check guarantees that you were privileged enough that you'd
> better be trusted in this user namespace.
>
> (Unless there is some possible interaction I'm overlooking)

I think that, if it's done this way, you'd have to unshare cgroupns
before unsharing userns, since you forfeit that capability when you
unshare your userns.  That means that the new cgroupns ends up being
associated w/ the root userns, which may not be what you want.

You could unshare both namespaces in one syscall and give that some
magic semantics, but that's kind of weird.  It would be nice if you
could unshare your userns and temporarily retains caps in the parent,
but there is no such mechanism right now.

--Andy
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-29 15:08             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-07-29 15:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Serge Hallyn, Linux API, Linux Containers,
	linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Thank you for your review. I have tried to respond to both your emails here.
>>
>> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>> > Quoting Aditya Kali (adityakali@google.com):
>> >> Background
>> >>   Cgroups and Namespaces are used together to create “virtual”
>> >>   containers that isolates the host environment from the processes
>> >>   running in container. But since cgroups themselves are not
>> >>   “virtualized”, the task is always able to see global cgroups view
>> >>   through cgroupfs mount and via /proc/self/cgroup file.
>> >>
>> > Hi,
>> >
>> > A few questions/comments:
>> >
>> > 1. Based on this description, am I to understand that after doing a
>> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>> >    still mount the global root cgroup?  Any plans on "changing" that?
>>
>> This is suggested in the "Possible Extensions of CGROUPNS" section.
>> More details below.
>>
>> >    Will attempts to change settings of a cgroup which is not under
>> >    our current ns be rejected?  (That should be easy to do given your
>> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
>> >
>>
>> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
>> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
>> read/write of actual cgroup properties are not prevented. Usual
>> permission checks continue to apply for those. I was hoping that
>> should be enough, but see more comments towards the end.
>>
>> > 2. What would be the reprecussions of allowing cgroupns unshare so
>> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>> >    created your current ns cgroup?  It'd be a shame if that wasn't
>> >    on the roadmap.
>> >
>>
>> Its certainly on the roadmap, just that some logistics were not clear
>> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
>> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
>> processes, we may need some kind of explicit permission from the
>> cgroup subsystem to allow this. One approach could be an explicit
>
> So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> you're fine.
>
> The only real problem I can think of with unsharing a cgroup_ns is that
> you could lock a setuid-root application someplace it wasn't expecting.
> The above check guarantees that you were privileged enough that you'd
> better be trusted in this user namespace.
>
> (Unless there is some possible interaction I'm overlooking)

I think that, if it's done this way, you'd have to unshare cgroupns
before unsharing userns, since you forfeit that capability when you
unshare your userns.  That means that the new cgroupns ends up being
associated w/ the root userns, which may not be what you want.

You could unshare both namespaces in one syscall and give that some
magic semantics, but that's kind of weird.  It would be nice if you
could unshare your userns and temporarily retains caps in the parent,
but there is no such mechanism right now.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
  2014-07-29 15:08             ` Andy Lutomirski
@ 2014-07-29 16:06                 ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-07-29 16:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Andy Lutomirski (luto@amacapital.net):
> On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> Thank you for your review. I have tried to respond to both your emails here.
> >>
> >> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> >> > Quoting Aditya Kali (adityakali@google.com):
> >> >> Background
> >> >>   Cgroups and Namespaces are used together to create “virtual”
> >> >>   containers that isolates the host environment from the processes
> >> >>   running in container. But since cgroups themselves are not
> >> >>   “virtualized”, the task is always able to see global cgroups view
> >> >>   through cgroupfs mount and via /proc/self/cgroup file.
> >> >>
> >> > Hi,
> >> >
> >> > A few questions/comments:
> >> >
> >> > 1. Based on this description, am I to understand that after doing a
> >> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >> >    still mount the global root cgroup?  Any plans on "changing" that?
> >>
> >> This is suggested in the "Possible Extensions of CGROUPNS" section.
> >> More details below.
> >>
> >> >    Will attempts to change settings of a cgroup which is not under
> >> >    our current ns be rejected?  (That should be easy to do given your
> >> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
> >> >
> >>
> >> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> >> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> >> read/write of actual cgroup properties are not prevented. Usual
> >> permission checks continue to apply for those. I was hoping that
> >> should be enough, but see more comments towards the end.
> >>
> >> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >> >    created your current ns cgroup?  It'd be a shame if that wasn't
> >> >    on the roadmap.
> >> >
> >>
> >> Its certainly on the roadmap, just that some logistics were not clear
> >> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> >> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> >> processes, we may need some kind of explicit permission from the
> >> cgroup subsystem to allow this. One approach could be an explicit
> >
> > So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> > you're fine.
> >
> > The only real problem I can think of with unsharing a cgroup_ns is that
> > you could lock a setuid-root application someplace it wasn't expecting.
> > The above check guarantees that you were privileged enough that you'd
> > better be trusted in this user namespace.
> >
> > (Unless there is some possible interaction I'm overlooking)
> 
> I think that, if it's done this way, you'd have to unshare cgroupns
> before unsharing userns, since you forfeit that capability when you
> unshare your userns.  That means that the new cgroupns ends up being
> associated w/ the root userns, which may not be what you want.
> 
> You could unshare both namespaces in one syscall and give that some
> magic semantics, but that's kind of weird.  It would be nice if you
> could unshare your userns and temporarily retains caps in the parent,
> but there is no such mechanism right now.

Hm, good point.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 0/5] RFC: CGroup Namespaces
@ 2014-07-29 16:06                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-07-29 16:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Serge Hallyn, Linux API,
	Linux Containers, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

Quoting Andy Lutomirski (luto@amacapital.net):
> On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> Thank you for your review. I have tried to respond to both your emails here.
> >>
> >> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> >> > Quoting Aditya Kali (adityakali@google.com):
> >> >> Background
> >> >>   Cgroups and Namespaces are used together to create “virtual”
> >> >>   containers that isolates the host environment from the processes
> >> >>   running in container. But since cgroups themselves are not
> >> >>   “virtualized”, the task is always able to see global cgroups view
> >> >>   through cgroupfs mount and via /proc/self/cgroup file.
> >> >>
> >> > Hi,
> >> >
> >> > A few questions/comments:
> >> >
> >> > 1. Based on this description, am I to understand that after doing a
> >> >    cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >> >    still mount the global root cgroup?  Any plans on "changing" that?
> >>
> >> This is suggested in the "Possible Extensions of CGROUPNS" section.
> >> More details below.
> >>
> >> >    Will attempts to change settings of a cgroup which is not under
> >> >    our current ns be rejected?  (That should be easy to do given your
> >> >    patch 1/5).  Sorry if it's done in the set, I'm jumping around...
> >> >
> >>
> >> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> >> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> >> read/write of actual cgroup properties are not prevented. Usual
> >> permission checks continue to apply for those. I was hoping that
> >> should be enough, but see more comments towards the end.
> >>
> >> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >> >    long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >> >    created your current ns cgroup?  It'd be a shame if that wasn't
> >> >    on the roadmap.
> >> >
> >>
> >> Its certainly on the roadmap, just that some logistics were not clear
> >> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> >> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> >> processes, we may need some kind of explicit permission from the
> >> cgroup subsystem to allow this. One approach could be an explicit
> >
> > So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> > you're fine.
> >
> > The only real problem I can think of with unsharing a cgroup_ns is that
> > you could lock a setuid-root application someplace it wasn't expecting.
> > The above check guarantees that you were privileged enough that you'd
> > better be trusted in this user namespace.
> >
> > (Unless there is some possible interaction I'm overlooking)
> 
> I think that, if it's done this way, you'd have to unshare cgroupns
> before unsharing userns, since you forfeit that capability when you
> unshare your userns.  That means that the new cgroupns ends up being
> associated w/ the root userns, which may not be what you want.
> 
> You could unshare both namespaces in one syscall and give that some
> magic semantics, but that's kind of weird.  It would be nice if you
> could unshare your userns and temporarily retains caps in the parent,
> but there is no such mechanism right now.

Hm, good point.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2014-07-24 17:01       ` Serge Hallyn
  (?)
  (?)
@ 2014-07-31 19:48       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-31 19:48 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>>
>
> This is fine and I'm not looking to bikeshed, but am wondering - did
> you consider any other ways beside unshare (i.e. a new mount option
> to cgroupfs)?  If so, do you have a list of the downsides of those?
> (I mainly ask bc clone flags are still a scarce commodity)
>

I did consider couple of other ways:

(1) having a cgroup.ns_root (or something) cgroup file. If this value
is '1', it would mean that all processes it and its descendant cgroups
will have their cgroup paths in /proc/self/cgroup terminated at this
cgroup.
 For ex:
[A] --> [B] --> C
    | --> [D] --> E

[A], [B] and [D] has cgroup.ns_root = 1.
* all processes in cgroup C & E will see their cgroup path as /C and
/E respectively
* all processes in cgroup B & D will see their own cgroup path as /

In this model, its easy to know what to show if process is looking at
its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
looking at other process's /proc/<pid>/cgroup. We may be able to come
up with some hacky way read correct value, but depending on the
cgroupfs mount, it may not make sense.
One other major drawback of this approach is that "every" process in
the cgroup will now get a restricted view. i.e., you cannot change
cgroups without affecting your view. And this is undesirable for
administrative processes.

(2) Another idea that I didn't pursue further (and is a bit hacky as
above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
in cgroup.ns_procs will have their /proc/self/cgroup restricted).
Writing a pid to cgroup.ns_procs implies that you are writing it to
cgroup.procs too. But, not vise-versa. So, you could move yourself in
another cgroup by writing your pid in cgroup.procs, but not in
cgroup.ns_procs, thus preventing from getting "rooted". I This was to
solve administrative process issue in the above appraoch. But I think
this is very clunky too and I find semantics for this approach to be
non-intuitive. It almost looks like moving towards a separate "ns"
subsystem. But as we already know, its a path to failure.

I didn't think of using a mount option. I imagine the mount option
(something like -o root=/bathjobs/container_1) could be used to
restrict the visibility of cgroupfs inside the container's mount
namespace. i.e., the value you read from /proc/<pid>/cgroup now
depends on what mount namespace you are in. Its similar to cgroup
namespace, but just that the cgroupns_root is now stored in the
'struct mnt_namespace' instead of a separate 'struct
cgroup_namespace'. But, since mount namespace on creation inherits
mounts from its parent, the first cgroupfs mount in a mount namespace
is now treated specially. Also, its not possible to restrict cgroups
without mount namespace now. This is interesting and may not be too
bad. I am willing to give this a try. But I feel the cgroup namespace
approach fits well in-line with other namespaces where it does one
thing - virtualize the view of /proc/<pid>/cgroup file for processes
inside the namespace. The semantics are more intuitive as they are
similar to other namespaces.

Thanks,

>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
>> ---
>>  include/uapi/linux/sched.h | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 34f9d73..2f90d00 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -21,8 +21,7 @@
>>  #define CLONE_DETACHED               0x00400000      /* Unused, ignored */
>>  #define CLONE_UNTRACED               0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
>>  #define CLONE_CHILD_SETTID   0x01000000      /* set the TID in the child */
>> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>> -   and is now available for re-use. */
>> +#define CLONE_NEWCGROUP              0x02000000      /* New cgroup namespace */
>>  #define CLONE_NEWUTS         0x04000000      /* New utsname group? */
>>  #define CLONE_NEWIPC         0x08000000      /* New ipcs */
>>  #define CLONE_NEWUSER                0x10000000      /* New user namespace */
>> --
>> 2.0.0.526.g5318336
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2014-07-24 17:01       ` Serge Hallyn
@ 2014-07-31 19:48         ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-31 19:48 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tejun Heo, Li Zefan, cgroups, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>>
>
> This is fine and I'm not looking to bikeshed, but am wondering - did
> you consider any other ways beside unshare (i.e. a new mount option
> to cgroupfs)?  If so, do you have a list of the downsides of those?
> (I mainly ask bc clone flags are still a scarce commodity)
>

I did consider couple of other ways:

(1) having a cgroup.ns_root (or something) cgroup file. If this value
is '1', it would mean that all processes it and its descendant cgroups
will have their cgroup paths in /proc/self/cgroup terminated at this
cgroup.
 For ex:
[A] --> [B] --> C
    | --> [D] --> E

[A], [B] and [D] has cgroup.ns_root = 1.
* all processes in cgroup C & E will see their cgroup path as /C and
/E respectively
* all processes in cgroup B & D will see their own cgroup path as /

In this model, its easy to know what to show if process is looking at
its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
looking at other process's /proc/<pid>/cgroup. We may be able to come
up with some hacky way read correct value, but depending on the
cgroupfs mount, it may not make sense.
One other major drawback of this approach is that "every" process in
the cgroup will now get a restricted view. i.e., you cannot change
cgroups without affecting your view. And this is undesirable for
administrative processes.

(2) Another idea that I didn't pursue further (and is a bit hacky as
above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
in cgroup.ns_procs will have their /proc/self/cgroup restricted).
Writing a pid to cgroup.ns_procs implies that you are writing it to
cgroup.procs too. But, not vise-versa. So, you could move yourself in
another cgroup by writing your pid in cgroup.procs, but not in
cgroup.ns_procs, thus preventing from getting "rooted". I This was to
solve administrative process issue in the above appraoch. But I think
this is very clunky too and I find semantics for this approach to be
non-intuitive. It almost looks like moving towards a separate "ns"
subsystem. But as we already know, its a path to failure.

I didn't think of using a mount option. I imagine the mount option
(something like -o root=/bathjobs/container_1) could be used to
restrict the visibility of cgroupfs inside the container's mount
namespace. i.e., the value you read from /proc/<pid>/cgroup now
depends on what mount namespace you are in. Its similar to cgroup
namespace, but just that the cgroupns_root is now stored in the
'struct mnt_namespace' instead of a separate 'struct
cgroup_namespace'. But, since mount namespace on creation inherits
mounts from its parent, the first cgroupfs mount in a mount namespace
is now treated specially. Also, its not possible to restrict cgroups
without mount namespace now. This is interesting and may not be too
bad. I am willing to give this a try. But I feel the cgroup namespace
approach fits well in-line with other namespaces where it does one
thing - virtualize the view of /proc/<pid>/cgroup file for processes
inside the namespace. The semantics are more intuitive as they are
similar to other namespaces.

Thanks,

>> Signed-off-by: Aditya Kali <adityakali@google.com>
>
> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
>
>> ---
>>  include/uapi/linux/sched.h | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 34f9d73..2f90d00 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -21,8 +21,7 @@
>>  #define CLONE_DETACHED               0x00400000      /* Unused, ignored */
>>  #define CLONE_UNTRACED               0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
>>  #define CLONE_CHILD_SETTID   0x01000000      /* set the TID in the child */
>> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>> -   and is now available for re-use. */
>> +#define CLONE_NEWCGROUP              0x02000000      /* New cgroup namespace */
>>  #define CLONE_NEWUTS         0x04000000      /* New utsname group? */
>>  #define CLONE_NEWIPC         0x08000000      /* New ipcs */
>>  #define CLONE_NEWUSER                0x10000000      /* New user namespace */
>> --
>> 2.0.0.526.g5318336
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-07-31 19:48         ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-07-31 19:48 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tejun Heo, Li Zefan, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers

On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>>
>
> This is fine and I'm not looking to bikeshed, but am wondering - did
> you consider any other ways beside unshare (i.e. a new mount option
> to cgroupfs)?  If so, do you have a list of the downsides of those?
> (I mainly ask bc clone flags are still a scarce commodity)
>

I did consider couple of other ways:

(1) having a cgroup.ns_root (or something) cgroup file. If this value
is '1', it would mean that all processes it and its descendant cgroups
will have their cgroup paths in /proc/self/cgroup terminated at this
cgroup.
 For ex:
[A] --> [B] --> C
    | --> [D] --> E

[A], [B] and [D] has cgroup.ns_root = 1.
* all processes in cgroup C & E will see their cgroup path as /C and
/E respectively
* all processes in cgroup B & D will see their own cgroup path as /

In this model, its easy to know what to show if process is looking at
its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
looking at other process's /proc/<pid>/cgroup. We may be able to come
up with some hacky way read correct value, but depending on the
cgroupfs mount, it may not make sense.
One other major drawback of this approach is that "every" process in
the cgroup will now get a restricted view. i.e., you cannot change
cgroups without affecting your view. And this is undesirable for
administrative processes.

(2) Another idea that I didn't pursue further (and is a bit hacky as
above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
in cgroup.ns_procs will have their /proc/self/cgroup restricted).
Writing a pid to cgroup.ns_procs implies that you are writing it to
cgroup.procs too. But, not vise-versa. So, you could move yourself in
another cgroup by writing your pid in cgroup.procs, but not in
cgroup.ns_procs, thus preventing from getting "rooted". I This was to
solve administrative process issue in the above appraoch. But I think
this is very clunky too and I find semantics for this approach to be
non-intuitive. It almost looks like moving towards a separate "ns"
subsystem. But as we already know, its a path to failure.

I didn't think of using a mount option. I imagine the mount option
(something like -o root=/bathjobs/container_1) could be used to
restrict the visibility of cgroupfs inside the container's mount
namespace. i.e., the value you read from /proc/<pid>/cgroup now
depends on what mount namespace you are in. Its similar to cgroup
namespace, but just that the cgroupns_root is now stored in the
'struct mnt_namespace' instead of a separate 'struct
cgroup_namespace'. But, since mount namespace on creation inherits
mounts from its parent, the first cgroupfs mount in a mount namespace
is now treated specially. Also, its not possible to restrict cgroups
without mount namespace now. This is interesting and may not be too
bad. I am willing to give this a try. But I feel the cgroup namespace
approach fits well in-line with other namespaces where it does one
thing - virtualize the view of /proc/<pid>/cgroup file for processes
inside the namespace. The semantics are more intuitive as they are
similar to other namespaces.

Thanks,

>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
>> ---
>>  include/uapi/linux/sched.h | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 34f9d73..2f90d00 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -21,8 +21,7 @@
>>  #define CLONE_DETACHED               0x00400000      /* Unused, ignored */
>>  #define CLONE_UNTRACED               0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
>>  #define CLONE_CHILD_SETTID   0x01000000      /* set the TID in the child */
>> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>> -   and is now available for re-use. */
>> +#define CLONE_NEWCGROUP              0x02000000      /* New cgroup namespace */
>>  #define CLONE_NEWUTS         0x04000000      /* New utsname group? */
>>  #define CLONE_NEWIPC         0x08000000      /* New ipcs */
>>  #define CLONE_NEWUSER                0x10000000      /* New user namespace */
>> --
>> 2.0.0.526.g5318336
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]         ` <CAGr1F2FAiSFR_Y3t1=eBVoAtJvh4m=cNUi+vG146nDkgtBjisQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-04 23:12           ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-08-04 23:12 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> >>
> >
> > This is fine and I'm not looking to bikeshed, but am wondering - did
> > you consider any other ways beside unshare (i.e. a new mount option
> > to cgroupfs)?  If so, do you have a list of the downsides of those?
> > (I mainly ask bc clone flags are still a scarce commodity)
> >
> 
> I did consider couple of other ways:
> 
> (1) having a cgroup.ns_root (or something) cgroup file. If this value
> is '1', it would mean that all processes it and its descendant cgroups
> will have their cgroup paths in /proc/self/cgroup terminated at this
> cgroup.
>  For ex:
> [A] --> [B] --> C
>     | --> [D] --> E
> 
> [A], [B] and [D] has cgroup.ns_root = 1.
> * all processes in cgroup C & E will see their cgroup path as /C and
> /E respectively
> * all processes in cgroup B & D will see their own cgroup path as /
> 
> In this model, its easy to know what to show if process is looking at
> its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
> looking at other process's /proc/<pid>/cgroup. We may be able to come
> up with some hacky way read correct value, but depending on the
> cgroupfs mount, it may not make sense.
> One other major drawback of this approach is that "every" process in
> the cgroup will now get a restricted view. i.e., you cannot change
> cgroups without affecting your view. And this is undesirable for
> administrative processes.
> 
> (2) Another idea that I didn't pursue further (and is a bit hacky as
> above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
> in cgroup.ns_procs will have their /proc/self/cgroup restricted).
> Writing a pid to cgroup.ns_procs implies that you are writing it to
> cgroup.procs too. But, not vise-versa. So, you could move yourself in
> another cgroup by writing your pid in cgroup.procs, but not in
> cgroup.ns_procs, thus preventing from getting "rooted". I This was to
> solve administrative process issue in the above appraoch. But I think
> this is very clunky too and I find semantics for this approach to be
> non-intuitive. It almost looks like moving towards a separate "ns"
> subsystem. But as we already know, its a path to failure.
> 
> I didn't think of using a mount option. I imagine the mount option
> (something like -o root=/bathjobs/container_1) could be used to
> restrict the visibility of cgroupfs inside the container's mount
> namespace. i.e., the value you read from /proc/<pid>/cgroup now
> depends on what mount namespace you are in. Its similar to cgroup
> namespace, but just that the cgroupns_root is now stored in the
> 'struct mnt_namespace' instead of a separate 'struct
> cgroup_namespace'. But, since mount namespace on creation inherits
> mounts from its parent, the first cgroupfs mount in a mount namespace
> is now treated specially. Also, its not possible to restrict cgroups
> without mount namespace now. This is interesting and may not be too
> bad. I am willing to give this a try. But I feel the cgroup namespace
> approach fits well in-line with other namespaces where it does one
> thing - virtualize the view of /proc/<pid>/cgroup file for processes
> inside the namespace. The semantics are more intuitive as they are
> similar to other namespaces.

Yeah, let's stick with what you have :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]         ` <CAGr1F2FAiSFR_Y3t1=eBVoAtJvh4m=cNUi+vG146nDkgtBjisQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-04 23:12           ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-08-04 23:12 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, cgroups, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

Quoting Aditya Kali (adityakali@google.com):
> On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> >>
> >
> > This is fine and I'm not looking to bikeshed, but am wondering - did
> > you consider any other ways beside unshare (i.e. a new mount option
> > to cgroupfs)?  If so, do you have a list of the downsides of those?
> > (I mainly ask bc clone flags are still a scarce commodity)
> >
> 
> I did consider couple of other ways:
> 
> (1) having a cgroup.ns_root (or something) cgroup file. If this value
> is '1', it would mean that all processes it and its descendant cgroups
> will have their cgroup paths in /proc/self/cgroup terminated at this
> cgroup.
>  For ex:
> [A] --> [B] --> C
>     | --> [D] --> E
> 
> [A], [B] and [D] has cgroup.ns_root = 1.
> * all processes in cgroup C & E will see their cgroup path as /C and
> /E respectively
> * all processes in cgroup B & D will see their own cgroup path as /
> 
> In this model, its easy to know what to show if process is looking at
> its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
> looking at other process's /proc/<pid>/cgroup. We may be able to come
> up with some hacky way read correct value, but depending on the
> cgroupfs mount, it may not make sense.
> One other major drawback of this approach is that "every" process in
> the cgroup will now get a restricted view. i.e., you cannot change
> cgroups without affecting your view. And this is undesirable for
> administrative processes.
> 
> (2) Another idea that I didn't pursue further (and is a bit hacky as
> above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
> in cgroup.ns_procs will have their /proc/self/cgroup restricted).
> Writing a pid to cgroup.ns_procs implies that you are writing it to
> cgroup.procs too. But, not vise-versa. So, you could move yourself in
> another cgroup by writing your pid in cgroup.procs, but not in
> cgroup.ns_procs, thus preventing from getting "rooted". I This was to
> solve administrative process issue in the above appraoch. But I think
> this is very clunky too and I find semantics for this approach to be
> non-intuitive. It almost looks like moving towards a separate "ns"
> subsystem. But as we already know, its a path to failure.
> 
> I didn't think of using a mount option. I imagine the mount option
> (something like -o root=/bathjobs/container_1) could be used to
> restrict the visibility of cgroupfs inside the container's mount
> namespace. i.e., the value you read from /proc/<pid>/cgroup now
> depends on what mount namespace you are in. Its similar to cgroup
> namespace, but just that the cgroupns_root is now stored in the
> 'struct mnt_namespace' instead of a separate 'struct
> cgroup_namespace'. But, since mount namespace on creation inherits
> mounts from its parent, the first cgroupfs mount in a mount namespace
> is now treated specially. Also, its not possible to restrict cgroups
> without mount namespace now. This is interesting and may not be too
> bad. I am willing to give this a try. But I feel the cgroup namespace
> approach fits well in-line with other namespaces where it does one
> thing - virtualize the view of /proc/<pid>/cgroup file for processes
> inside the namespace. The semantics are more intuitive as they are
> similar to other namespaces.

Yeah, let's stick with what you have :)

thanks,
-serge


^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-08-04 23:12           ` Serge Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge Hallyn @ 2014-08-04 23:12 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> >>
> >
> > This is fine and I'm not looking to bikeshed, but am wondering - did
> > you consider any other ways beside unshare (i.e. a new mount option
> > to cgroupfs)?  If so, do you have a list of the downsides of those?
> > (I mainly ask bc clone flags are still a scarce commodity)
> >
> 
> I did consider couple of other ways:
> 
> (1) having a cgroup.ns_root (or something) cgroup file. If this value
> is '1', it would mean that all processes it and its descendant cgroups
> will have their cgroup paths in /proc/self/cgroup terminated at this
> cgroup.
>  For ex:
> [A] --> [B] --> C
>     | --> [D] --> E
> 
> [A], [B] and [D] has cgroup.ns_root = 1.
> * all processes in cgroup C & E will see their cgroup path as /C and
> /E respectively
> * all processes in cgroup B & D will see their own cgroup path as /
> 
> In this model, its easy to know what to show if process is looking at
> its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
> looking at other process's /proc/<pid>/cgroup. We may be able to come
> up with some hacky way read correct value, but depending on the
> cgroupfs mount, it may not make sense.
> One other major drawback of this approach is that "every" process in
> the cgroup will now get a restricted view. i.e., you cannot change
> cgroups without affecting your view. And this is undesirable for
> administrative processes.
> 
> (2) Another idea that I didn't pursue further (and is a bit hacky as
> above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
> in cgroup.ns_procs will have their /proc/self/cgroup restricted).
> Writing a pid to cgroup.ns_procs implies that you are writing it to
> cgroup.procs too. But, not vise-versa. So, you could move yourself in
> another cgroup by writing your pid in cgroup.procs, but not in
> cgroup.ns_procs, thus preventing from getting "rooted". I This was to
> solve administrative process issue in the above appraoch. But I think
> this is very clunky too and I find semantics for this approach to be
> non-intuitive. It almost looks like moving towards a separate "ns"
> subsystem. But as we already know, its a path to failure.
> 
> I didn't think of using a mount option. I imagine the mount option
> (something like -o root=/bathjobs/container_1) could be used to
> restrict the visibility of cgroupfs inside the container's mount
> namespace. i.e., the value you read from /proc/<pid>/cgroup now
> depends on what mount namespace you are in. Its similar to cgroup
> namespace, but just that the cgroupns_root is now stored in the
> 'struct mnt_namespace' instead of a separate 'struct
> cgroup_namespace'. But, since mount namespace on creation inherits
> mounts from its parent, the first cgroupfs mount in a mount namespace
> is now treated specially. Also, its not possible to restrict cgroups
> without mount namespace now. This is interesting and may not be too
> bad. I am willing to give this a try. But I feel the cgroup namespace
> approach fits well in-line with other namespaces where it does one
> thing - virtualize the view of /proc/<pid>/cgroup file for processes
> inside the namespace. The semantics are more intuitive as they are
> similar to other namespaces.

Yeah, let's stick with what you have :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv1 0/8] CGroup Namespaces
       [not found] <adityakali-cgroupns>
@ 2014-10-13 21:23   ` Aditya Kali
  2014-07-17 19:52 ` Aditya Kali
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Second take at the Cgroup Namespace patch-set.

Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
      path will be visible:
      # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
      [ns2]$ cat /proc/7353/cgroup
      [ns2]$
      This is same as when cgroup hierarchy is not mounted at all.
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) Setns to another cgroup namespace is allowed only when:
      (a) process has CAP_SYS_ADMIN in its current userns
      (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
      (c) the process's current cgroup is a descendant cgroupns-root of the
          target namespace.
      (d) the target cgroupns-root is descendant of current cgroupns-root..
      The last check (d) prevents processes from escaping their cgroupns-root by
      attaching to parent cgroupns. Thus, setns is allowed only when the process
      is trying to restrict itself to a deeper cgroup hierarchy.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

  (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
      the unified cgroup hierarchy with cgroupns-root as the filesystem root.
      The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
      container management tools to be run inside the containers transparently.

Implementation
  The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.


---
 fs/kernfs/dir.c                  |  53 +++++++++---
 fs/kernfs/mount.c                |  48 +++++++++++
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  41 +++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++
 include/linux/kernfs.h           |   5 ++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 +
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
 kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 ++++-
 15 files changed, 518 insertions(+), 41 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv1 3/8] cgroup: add function to get task's cgroup on default
[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv1 5/8] cgroup: introduce cgroup namespaces
[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
[PATCHv1 7/8] cgroup: cgroup namespace setns support
[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv1 0/8] CGroup Namespaces
@ 2014-10-13 21:23   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal

Second take at the Cgroup Namespace patch-set.

Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup 
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
      path will be visible:
      # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
      [ns2]$ cat /proc/7353/cgroup
      [ns2]$
      This is same as when cgroup hierarchy is not mounted at all.
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) Setns to another cgroup namespace is allowed only when:
      (a) process has CAP_SYS_ADMIN in its current userns
      (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
      (c) the process's current cgroup is a descendant cgroupns-root of the
          target namespace.
      (d) the target cgroupns-root is descendant of current cgroupns-root..
      The last check (d) prevents processes from escaping their cgroupns-root by
      attaching to parent cgroupns. Thus, setns is allowed only when the process
      is trying to restrict itself to a deeper cgroup hierarchy.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

  (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
      the unified cgroup hierarchy with cgroupns-root as the filesystem root.
      The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
      container management tools to be run inside the containers transparently.

Implementation
  The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.


---
 fs/kernfs/dir.c                  |  53 +++++++++---
 fs/kernfs/mount.c                |  48 +++++++++++
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  41 +++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++
 include/linux/kernfs.h           |   5 ++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 +
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
 kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 ++++-
 15 files changed, 518 insertions(+), 41 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv1 3/8] cgroup: add function to get task's cgroup on default
[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv1 5/8] cgroup: introduce cgroup namespaces
[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
[PATCHv1 7/8] cgroup: cgroup namespace setns support
[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-13 21:23       ` Aditya Kali
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_root,
+	struct kernfs_node *kn,
+	char *buf,
+	size_t buflen)
 {
 	char *p = buf + buflen;
 	int len;
 
+	BUG_ON(!buflen);
+
 	*--p = '\0';
 
+	if (kn == kn_root) {
+		*--p = '/';
+		return p;
+	}
+
 	do {
 		len = strlen(kn->name);
 		if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
+		if (kn == kn_root)
+			break;
 	} while (kn && kn->parent);
 
 	return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-13 21:23       ` Aditya Kali
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_root,
+	struct kernfs_node *kn,
+	char *buf,
+	size_t buflen)
 {
 	char *p = buf + buflen;
 	int len;
 
+	BUG_ON(!buflen);
+
 	*--p = '\0';
 
+	if (kn == kn_root) {
+		*--p = '/';
+		return p;
+	}
+
 	do {
 		len = strlen(kn->name);
 		if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
+		if (kn == kn_root)
+			break;
 	} while (kn && kn->parent);
 
 	return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
@ 2014-10-13 21:23     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_root,
+	struct kernfs_node *kn,
+	char *buf,
+	size_t buflen)
 {
 	char *p = buf + buflen;
 	int len;
 
+	BUG_ON(!buflen);
+
 	*--p = '\0';
 
+	if (kn == kn_root) {
+		*--p = '/';
+		return p;
+	}
+
 	do {
 		len = strlen(kn->name);
 		if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
+		if (kn == kn_root)
+			break;
 	} while (kn && kn->parent);
 
 	return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-13 21:23       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-10-13 21:23       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2014-10-13 21:23     ` [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path Aditya Kali
  2014-10-13 21:23       ` Aditya Kali
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-13 21:23       ` Aditya Kali
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cab7dc4..56d507b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
  2014-10-13 21:23   ` Aditya Kali
  (?)
  (?)
@ 2014-10-13 21:23   ` Aditya Kali
  2014-10-16 16:13       ` Serge E. Hallyn
       [not found]     ` <1413235430-22944-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 2 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cab7dc4..56d507b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-13 21:23       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
@ 2014-10-13 21:23       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 5/8] cgroup: introduce cgroup namespaces
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-13 21:23       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  11 ++++
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 11 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+	&cgroupns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+	return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+		unsigned long flags,
+		struct user_namespace *user_ns,
+		struct cgroup_namespace *old_ns) {
+	if (flags & CLONE_NEWCGROUP)
+		return ERR_PTR(-EINVAL);
+
+	return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..c3be001 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_NS
+	bool "CGroup Namespaces"
+	default n
+	help
+	  This options enables CGroup Namespaces which can be used to isolate
+	  cgroup paths. This feature is only useful when unified cgroup
+	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+	  option).
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..75334f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2b3e9f9..f8099b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..c16604f
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	cgrp = get_task_cgroup(current);
+
+	/* Creating new CGROUPNS is supported only when unified hierarchy is in
+	 * use. */
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto err_out_unlock;
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..cc06851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 5/8] cgroup: introduce cgroup namespaces
@ 2014-10-13 21:23       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  11 ++++
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 11 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+	&cgroupns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+	return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+		unsigned long flags,
+		struct user_namespace *user_ns,
+		struct cgroup_namespace *old_ns) {
+	if (flags & CLONE_NEWCGROUP)
+		return ERR_PTR(-EINVAL);
+
+	return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..c3be001 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_NS
+	bool "CGroup Namespaces"
+	default n
+	help
+	  This options enables CGroup Namespaces which can be used to isolate
+	  cgroup paths. This feature is only useful when unified cgroup
+	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+	  option).
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..75334f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2b3e9f9..f8099b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..c16604f
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	cgrp = get_task_cgroup(current);
+
+	/* Creating new CGROUPNS is supported only when unified hierarchy is in
+	 * use. */
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto err_out_unlock;
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..cc06851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-13 21:23       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root

Also, read of /proc/<pid>/cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	struct task_struct *task;
 	int ret;
 
+	/* Only allow changing cgroups accessible within task's cgroup
+	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+	 * cgroupns->root_cgrp. */
+	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+		return -EPERM;
+
 	/* look up all src csets */
 	down_read(&css_set_rwsem);
 	rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 	struct cgroup_subsys_state *css;
 	int ret;
 
+	/* Reject writes to cgroup files outside of task's cgroupns-root. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+		return -EINVAL;
+
 	if (cft->write)
 		return cft->write(of, buf, nbytes, off);
 
@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
+	/* Allow mkdir only within process's cgroup namespace root. */
+	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
+	/* Allow rmdir only within process's cgroup namespace root.
+	 * The process can't delete its own root anyways. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+		cgroup_kn_unlock(kn);
+		return -EPERM;
+	}
+
 	ret = cgroup_destroy_locked(cgrp);
 
 	cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
 			continue;
 
+		cgrp = task_cgroup_from_root(tsk, root);
+
+		/* The cgroup path on default hierarchy is shown only if it
+		 * falls under current task's cgroupns-root.
+		 */
+		if (root == &cgrp_dfl_root &&
+		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+			continue;
+
 		seq_printf(m, "%d:", root->hierarchy_id);
 		for_each_subsys(ss, ssid)
 			if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%sname=%s", count ? "," : "",
 				   root->name);
 		seq_putc(m, ':');
-		cgrp = task_cgroup_from_root(tsk, root);
 		path = cgroup_path(cgrp, buf, PATH_MAX);
 		if (!path) {
 			retval = -ENAMETOOLONG;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
@ 2014-10-13 21:23       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root

Also, read of /proc/<pid>/cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	struct task_struct *task;
 	int ret;
 
+	/* Only allow changing cgroups accessible within task's cgroup
+	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+	 * cgroupns->root_cgrp. */
+	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+		return -EPERM;
+
 	/* look up all src csets */
 	down_read(&css_set_rwsem);
 	rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 	struct cgroup_subsys_state *css;
 	int ret;
 
+	/* Reject writes to cgroup files outside of task's cgroupns-root. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+		return -EINVAL;
+
 	if (cft->write)
 		return cft->write(of, buf, nbytes, off);
 
@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
+	/* Allow mkdir only within process's cgroup namespace root. */
+	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
+	/* Allow rmdir only within process's cgroup namespace root.
+	 * The process can't delete its own root anyways. */
+	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+		cgroup_kn_unlock(kn);
+		return -EPERM;
+	}
+
 	ret = cgroup_destroy_locked(cgrp);
 
 	cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
 			continue;
 
+		cgrp = task_cgroup_from_root(tsk, root);
+
+		/* The cgroup path on default hierarchy is shown only if it
+		 * falls under current task's cgroupns-root.
+		 */
+		if (root == &cgrp_dfl_root &&
+		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+			continue;
+
 		seq_printf(m, "%d:", root->hierarchy_id);
 		for_each_subsys(ss, ssid)
 			if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%sname=%s", count ? "," : "",
 				   root->name);
 		seq_putc(m, ':');
-		cgrp = task_cgroup_from_root(tsk, root);
 		path = cgroup_path(cgrp, buf, PATH_MAX);
 		if (!path) {
 			retval = -ENAMETOOLONG;
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (5 preceding siblings ...)
  2014-10-13 21:23       ` Aditya Kali
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-13 21:23     ` [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
  over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
  cgroup.
* target cgroupns-root is same as or deeper than task's current
  cgroupns-root. This is so that the task cannot escape out of its
  cgroupns-root. This also ensures that setns() only makes the task
  get restricted to a deeper cgroup hierarchy.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+	struct task_struct *task = current;
+	struct cgroup *cgrp = NULL;
+	int err = 0;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(task);
+
+	cgrp = get_task_cgroup(task);
+
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto out_unlock;
+
+	/* Allow switch only if the task's current cgroup is descendant of the
+	 * target cgroup_ns->root_cgrp.
+	 */
+	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+		goto out_unlock;
+
+	/* Only allow setns to a cgroupns root-ed deeper than task's current
+	 * cgroupns-root. This will make sure that tasks cannot escape their
+	 * cgroupns by attaching to parent cgroupns.
+	 */
+	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+				  task_cgroupns_root(task)))
+		goto out_unlock;
+
+	err = 0;
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+	threadgroup_unlock(current);
+	if (cgrp)
+		cgroup_put(cgrp);
+	return err;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-13 21:23       ` Aditya Kali
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
  over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
  cgroup.
* target cgroupns-root is same as or deeper than task's current
  cgroupns-root. This is so that the task cannot escape out of its
  cgroupns-root. This also ensures that setns() only makes the task
  get restricted to a deeper cgroup hierarchy.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+	struct task_struct *task = current;
+	struct cgroup *cgrp = NULL;
+	int err = 0;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(task);
+
+	cgrp = get_task_cgroup(task);
+
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto out_unlock;
+
+	/* Allow switch only if the task's current cgroup is descendant of the
+	 * target cgroup_ns->root_cgrp.
+	 */
+	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+		goto out_unlock;
+
+	/* Only allow setns to a cgroupns root-ed deeper than task's current
+	 * cgroupns-root. This will make sure that tasks cannot escape their
+	 * cgroupns by attaching to parent cgroupns.
+	 */
+	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+				  task_cgroupns_root(task)))
+		goto out_unlock;
+
+	err = 0;
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+	threadgroup_unlock(current);
+	if (cgrp)
+		cgroup_put(cgrp);
+	return err;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-13 21:23     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA, Aditya Kali

setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
  over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
  cgroup.
* target cgroupns-root is same as or deeper than task's current
  cgroupns-root. This is so that the task cannot escape out of its
  cgroupns-root. This also ensures that setns() only makes the task
  get restricted to a deeper cgroup hierarchy.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+	struct task_struct *task = current;
+	struct cgroup *cgrp = NULL;
+	int err = 0;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(task);
+
+	cgrp = get_task_cgroup(task);
+
+	err = -EINVAL;
+	if (!cgroup_on_dfl(cgrp))
+		goto out_unlock;
+
+	/* Allow switch only if the task's current cgroup is descendant of the
+	 * target cgroup_ns->root_cgrp.
+	 */
+	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+		goto out_unlock;
+
+	/* Only allow setns to a cgroupns root-ed deeper than task's current
+	 * cgroupns-root. This will make sure that tasks cannot escape their
+	 * cgroupns by attaching to parent cgroupns.
+	 */
+	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+				  task_cgroupns_root(task)))
+		goto out_unlock;
+
+	err = 0;
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+	threadgroup_unlock(current);
+	if (cgrp)
+		cgroup_put(cgrp);
+	return err;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (6 preceding siblings ...)
  2014-10-13 21:23     ` [PATCHv1 7/8] cgroup: cgroup namespace setns support Aditya Kali
@ 2014-10-13 21:23     ` Aditya Kali
  2014-10-14 22:42       ` Andy Lutomirski
  2014-10-19  4:54       ` Eric W. Biederman
  9 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	memset(opts, 0, sizeof(*opts));
 
+	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+	 * namespace.
+	 */
+	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+	}
+
 	while ((token = strsep(&o, ",")) != NULL) {
 		nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
-		if (nr_opts != 1) {
+		if (nr_opts > 1) {
 			pr_err("sane_behavior: no other mount options allowed\n");
 			return -EINVAL;
 		}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1816,11 +1840,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry)) {
+		/* If this mount is for a non-init cgroup namespace, then
+		 * Instead of root's dentry, we return the dentry specific to
+		 * the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1833,6 +1874,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-10-13 21:23   ` Aditya Kali
                     ` (4 preceding siblings ...)
  (?)
@ 2014-10-13 21:23   ` Aditya Kali
       [not found]     ` <1413235430-22944-9-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 1 reply; 384+ messages in thread
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	memset(opts, 0, sizeof(*opts));
 
+	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+	 * namespace.
+	 */
+	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+	}
+
 	while ((token = strsep(&o, ",")) != NULL) {
 		nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
-		if (nr_opts != 1) {
+		if (nr_opts > 1) {
 			pr_err("sane_behavior: no other mount options allowed\n");
 			return -EINVAL;
 		}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1816,11 +1840,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry)) {
+		/* If this mount is for a non-init cgroup namespace, then
+		 * Instead of root's dentry, we return the dentry specific to
+		 * the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1833,6 +1874,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-14 22:42       ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-14 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@google.com> wrote:
> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> More details in the writeup below.
>
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
>
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
>
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel. Incidentally though, it used the same config option
>   name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
>
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>   cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   # Unshare cgroupns along with userns and mountns
>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>   # sets up uid/gid map and exec’s /bin/bash
>   $ ~/unshare -c -u -m
>
>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>   # hierarchy.
>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>   [ns]$ ls -l /tmp/cgroup
>   total 0
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>   filesystem root for the namespace specific cgroupfs mount.
>
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>   should provide a completely isolated cgroup view inside the container.
>
>   In its current form, the cgroup namespaces patcheset provides following
>   behavior:
>
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
>
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

This is a little weird.  Not sure it's a problem.

>
>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>       path will be visible:
>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>       [ns2]$ cat /proc/7353/cgroup
>       [ns2]$
>       This is same as when cgroup hierarchy is not mounted at all.
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place.)
>
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
>
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
>

>
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).

This seems odd to me.  Does unsharing the cgroupns unshare for all
tasks in the process?  If not, then I think that it shouldn't change
the cgroup either.

What did you end up doing to grant permission to unshare the cgroup ns?

--Andy
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
@ 2014-10-14 22:42       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-14 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, cgroups, linux-kernel,
	Linux API, Ingo Molnar, Linux Containers, jnagal

On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@google.com> wrote:
> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> More details in the writeup below.
>
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
>
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
>
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel. Incidentally though, it used the same config option
>   name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
>
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>   cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   # Unshare cgroupns along with userns and mountns
>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>   # sets up uid/gid map and exec’s /bin/bash
>   $ ~/unshare -c -u -m
>
>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>   # hierarchy.
>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>   [ns]$ ls -l /tmp/cgroup
>   total 0
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>   filesystem root for the namespace specific cgroupfs mount.
>
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>   should provide a completely isolated cgroup view inside the container.
>
>   In its current form, the cgroup namespaces patcheset provides following
>   behavior:
>
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
>
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

This is a little weird.  Not sure it's a problem.

>
>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>       path will be visible:
>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>       [ns2]$ cat /proc/7353/cgroup
>       [ns2]$
>       This is same as when cgroup hierarchy is not mounted at all.
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place.)
>
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
>
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
>

>
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).

This seems odd to me.  Does unsharing the cgroupns unshare for all
tasks in the process?  If not, then I think that it shouldn't change
the cgroup either.

What did you end up doing to grant permission to unshare the cgroup ns?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
  2014-10-14 22:42       ` Andy Lutomirski
@ 2014-10-14 23:33           ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-14 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@google.com> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird.  Not sure it's a problem.
>
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>
> This seems odd to me.  Does unsharing the cgroupns unshare for all
> tasks in the process?  If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
-- 
Aditya
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
@ 2014-10-14 23:33           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-14 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, cgroups, linux-kernel,
	Linux API, Ingo Molnar, Linux Containers, Rohit Jnagal

On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@google.com> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird.  Not sure it's a problem.
>
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>
> This seems odd to me.  Does unsharing the cgroupns unshare for all
> tasks in the process?  If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
       [not found]     ` <1413235430-22944-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:07       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

(with or without my comment below taken)

> ---
>  fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..8655485 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
> +	BUG_ON(!buflen);
> +
>  	*--p = '\0';
>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;

I wonder if it would be clearer if you instead changed the while condition, i.e.

	} while (kn && kn != kn_root && kn_parent);

i.e .it's not a special condition, just a part of the expected flow.

>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 30faf79..3c2be75 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
       [not found]     ` <1413235430-22944-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:07       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

(with or without my comment below taken)

> ---
>  fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..8655485 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
> +	BUG_ON(!buflen);
> +
>  	*--p = '\0';
>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;

I wonder if it would be clearer if you instead changed the while condition, i.e.

	} while (kn && kn != kn_root && kn_parent);

i.e .it's not a special condition, just a part of the expected flow.

>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 30faf79..3c2be75 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
@ 2014-10-16 16:07       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

(with or without my comment below taken)

> ---
>  fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..8655485 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>  	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -					      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> +	struct kernfs_node *kn_root,
> +	struct kernfs_node *kn,
> +	char *buf,
> +	size_t buflen)
>  {
>  	char *p = buf + buflen;
>  	int len;
>  
> +	BUG_ON(!buflen);
> +
>  	*--p = '\0';
>  
> +	if (kn == kn_root) {
> +		*--p = '/';
> +		return p;
> +	}
> +
>  	do {
>  		len = strlen(kn->name);
>  		if (p - buf < len + 1) {
> @@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>  		memcpy(p, kn->name, len);
>  		*--p = '/';
>  		kn = kn->parent;
> +		if (kn == kn_root)
> +			break;

I wonder if it would be clearer if you instead changed the while condition, i.e.

	} while (kn && kn != kn_root && kn_parent);

i.e .it's not a special condition, just a part of the expected flow.

>  	} while (kn && kn->parent);
>  
>  	return p;
> @@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +			    char *buf, size_t buflen)
>  {
>  	unsigned long flags;
>  	char *p;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> -	p = kernfs_path_locked(kn, buf, buflen);
> +	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>  	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>  	return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> +	return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> +	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> +					 sizeof(kernfs_pr_cont_buf));
>  	if (p)
>  		pr_cont("%s", p);
>  	else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 30faf79..3c2be75 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +					  struct kernfs_node *kn, char *buf,
> +					  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>  				size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]       ` <1413235430-22944-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:08         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:08 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
       [not found]       ` <1413235430-22944-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:08         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:08 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-10-16 16:08         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:08 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
>  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
>  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
       [not found]     ` <1413235430-22944-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:13       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:13 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 1d51968..80ed6e0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cab7dc4..56d507b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
       [not found]     ` <1413235430-22944-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:13       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:13 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 1d51968..80ed6e0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cab7dc4..56d507b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
@ 2014-10-16 16:13       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:13 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 1d51968..80ed6e0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cab7dc4..56d507b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> +	struct cgroup *cgrp;
> +
> +	mutex_lock(&cgroup_mutex);
> +	down_read(&css_set_rwsem);
> +
> +	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> +	cgroup_get(cgrp);
> +
> +	up_read(&css_set_rwsem);
> +	mutex_unlock(&cgroup_mutex);
> +	return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>  	/* the src and dst cset list running through cset->mg_node */
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
       [not found]       ` <1413235430-22944-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:14         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:14 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/linux/cgroup.h | 22 ++++++++++++++++++++++
>  kernel/cgroup.c        | 22 ----------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 80ed6e0..4a0eb2d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> +{
> +	return css_tryget(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 56d507b..2b3e9f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static bool cgroup_tryget(struct cgroup *cgrp)
> -{
> -	return css_tryget(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
       [not found]       ` <1413235430-22944-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:14         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:14 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>  include/linux/cgroup.h | 22 ++++++++++++++++++++++
>  kernel/cgroup.c        | 22 ----------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 80ed6e0..4a0eb2d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> +{
> +	return css_tryget(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 56d507b..2b3e9f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static bool cgroup_tryget(struct cgroup *cgrp)
> -{
> -	return css_tryget(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
@ 2014-10-16 16:14         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:14 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  include/linux/cgroup.h | 22 ++++++++++++++++++++++
>  kernel/cgroup.c        | 22 ----------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 80ed6e0..4a0eb2d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>  	return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> +	return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> +	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> +	css_get(&cgrp->self);
> +}
> +
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> +{
> +	return css_tryget(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> +	css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 56d507b..2b3e9f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>  	return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> -	return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>  	struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>  	return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> -	WARN_ON_ONCE(cgroup_is_dead(cgrp));
> -	css_get(&cgrp->self);
> -}
> -
> -static bool cgroup_tryget(struct cgroup *cgrp)
> -{
> -	return css_tryget(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> -	css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
       [not found]       ` <1413235430-22944-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:37         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:37 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
have cgroups in the kernel this won't add much in the way of memory
usage, right?  And I think the 'experimental' argument has long since
been squashed.  So I'd argue for simplifying this patch by removing
CONFIG_CGROUP_NS.

(more below)

> ---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  18 +++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  11 ++++
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  11 files changed, 255 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e04ed4b 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>  	&userns_operations,
>  #endif
>  	&mntns_operations,
> +#ifdef CONFIG_CGROUP_NS
> +	&cgroupns_operations,
> +#endif
>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4a0eb2d..aa86495 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> +	atomic_t		count;
> +	unsigned int		proc_inum;
> +	struct user_namespace	*user_ns;
> +	struct cgroup		*root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>  	return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> +						 struct cgroup *cgrp, char *buf,
> +						 size_t buflen)
> +{
> +	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>  					      size_t buflen)
>  {
> -	return kernfs_path(cgrp->kn, buf, buflen);
> +	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..9f637fe
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
> +{
> +	return tsk->nsproxy->cgroup_ns->root_cgrp;

Per the rules in nsproxy.h, you should be taking the task_lock here.

(If you are making assumptions about tsk then you need to state them
here - I only looked quickly enough that you pass in 'leader')

> +}
> +
> +#ifdef CONFIG_CGROUP_NS
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	if (ns)
> +		atomic_inc(&ns->count);
> +	return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	if (ns && atomic_dec_and_test(&ns->count))
> +		free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					       struct user_namespace *user_ns,
> +					       struct cgroup_namespace *old_ns);
> +
> +#else  /* CONFIG_CGROUP_NS */
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	return &init_cgroup_ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +}
> +
> +static inline struct cgroup_namespace *copy_cgroup_ns(
> +		unsigned long flags,
> +		struct user_namespace *user_ns,
> +		struct cgroup_namespace *old_ns) {
> +	if (flags & CLONE_NEWCGROUP)
> +		return ERR_PTR(-EINVAL);
> +
> +	return old_ns;
> +}
> +
> +#endif  /* CONFIG_CGROUP_NS */
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>  	struct mnt_namespace *mnt_ns;
>  	struct pid_namespace *pid_ns_for_children;
>  	struct net 	     *net_ns;
> +	struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;
>  
>  struct proc_ns_operations {
>  	const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
>  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
>  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
> +	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/init/Kconfig b/init/Kconfig
> index e84c642..c3be001 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>  	Enable some debugging help. Currently it exports additional stat
>  	files in a cgroup which can be useful for debugging.
>  
> +config CGROUP_NS
> +	bool "CGroup Namespaces"
> +	default n
> +	help
> +	  This options enables CGroup Namespaces which can be used to isolate
> +	  cgroup paths. This feature is only useful when unified cgroup
> +	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
> +	  option).
> +
>  endif # CGROUPS
>  
>  config CHECKPOINT_RESTORE
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..75334f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2b3e9f9..f8099b4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>  			      bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> +	.count = {
> +		.counter = 1,
> +	},
> +	.proc_inum = PROC_CGROUP_INIT_INO,
> +	.user_ns = &init_user_ns,

This might mean that you should bump the init_user_ns refcount.

> +	.root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  			    gfp_t gfp_mask)
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..c16604f
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,128 @@
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> +	struct cgroup_namespace *new_ns;
> +
> +	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	if (new_ns)
> +		atomic_set(&new_ns->count, 1);
> +	return new_ns;
> +}
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	cgroup_put(ns->root_cgrp);
> +	put_user_ns(ns->user_ns);

This is a problem on error patch in copy_cgroup_ns.  The
alloc_cgroup_ns() doesn't initialize these values, so if
you should fail in proc_alloc_inum() you'll show up here
with fandom values in ns->*.

> +	proc_free_inum(ns->proc_inum);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					struct user_namespace *user_ns,
> +					struct cgroup_namespace *old_ns)
> +{
> +	struct cgroup_namespace *new_ns = NULL;
> +	struct cgroup *cgrp = NULL;
> +	int err;
> +
> +	BUG_ON(!old_ns);
> +
> +	if (!(flags & CLONE_NEWCGROUP))
> +		return get_cgroup_ns(old_ns);
> +
> +	/* Allow only sysadmin to create cgroup namespace. */
> +	err = -EPERM;
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto err_out;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(current);
> +
> +	cgrp = get_task_cgroup(current);
> +
> +	/* Creating new CGROUPNS is supported only when unified hierarchy is in
> +	 * use. */

Oh, drat.  Well, I'll take, it, but under protest  :)

> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto err_out_unlock;
> +
> +	err = -ENOMEM;
> +	new_ns = alloc_cgroup_ns();
> +	if (!new_ns)
> +		goto err_out_unlock;
> +
> +	err = proc_alloc_inum(&new_ns->proc_inum);
> +	if (err)
> +		goto err_out_unlock;
> +
> +	new_ns->user_ns = get_user_ns(user_ns);
> +	new_ns->root_cgrp = cgrp;
> +
> +	threadgroup_unlock(current);
> +
> +	return new_ns;
> +
> +err_out_unlock:
> +	threadgroup_unlock(current);
> +err_out:
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	kfree(new_ns);
> +	return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	pr_info("setns not supported for cgroup namespace");
> +	return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +	struct cgroup_namespace *ns = NULL;
> +	struct nsproxy *nsproxy;
> +
> +	rcu_read_lock();
> +	nsproxy = task->nsproxy;
> +	if (nsproxy) {
> +		ns = nsproxy->cgroup_ns;
> +		get_cgroup_ns(ns);
> +	}
> +	rcu_read_unlock();
> +
> +	return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> +	put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> +	struct cgroup_namespace *cgroup_ns = ns;
> +
> +	return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> +	.name		= "cgroup",
> +	.type		= CLONE_NEWCGROUP,
> +	.get		= cgroupns_get,
> +	.put		= cgroupns_put,
> +	.install	= cgroupns_install,
> +	.inum		= cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> +	return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..cc06851 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>  	.net_ns			= &init_net,
>  #endif
> +	.cgroup_ns		= &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  		goto out_pid;
>  	}
>  
> +	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +					    tsk->nsproxy->cgroup_ns);
> +	if (IS_ERR(new_nsp->cgroup_ns)) {
> +		err = PTR_ERR(new_nsp->cgroup_ns);
> +		goto out_cgroup;
> +	}
> +
>  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  	return new_nsp;
>  
>  out_net:
> +	if (new_nsp->cgroup_ns)
> +		put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>  	if (new_nsp->pid_ns_for_children)
>  		put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  	struct nsproxy *new_ns;
>  
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			      CLONE_NEWPID | CLONE_NEWNET)))) {
> +			      CLONE_NEWPID | CLONE_NEWNET |
> +			      CLONE_NEWCGROUP)))) {
>  		get_nsproxy(old_ns);
>  		return 0;
>  	}
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>  		put_ipc_ns(ns->ipc_ns);
>  	if (ns->pid_ns_for_children)
>  		put_pid_ns(ns->pid_ns_for_children);
> +	if (ns->cgroup_ns)
> +		put_cgroup_ns(ns->cgroup_ns);
>  	put_net(ns->net_ns);
>  	kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
       [not found]       ` <1413235430-22944-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-16 16:37         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:37 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
have cgroups in the kernel this won't add much in the way of memory
usage, right?  And I think the 'experimental' argument has long since
been squashed.  So I'd argue for simplifying this patch by removing
CONFIG_CGROUP_NS.

(more below)

> ---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  18 +++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  11 ++++
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  11 files changed, 255 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e04ed4b 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>  	&userns_operations,
>  #endif
>  	&mntns_operations,
> +#ifdef CONFIG_CGROUP_NS
> +	&cgroupns_operations,
> +#endif
>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4a0eb2d..aa86495 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> +	atomic_t		count;
> +	unsigned int		proc_inum;
> +	struct user_namespace	*user_ns;
> +	struct cgroup		*root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>  	return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> +						 struct cgroup *cgrp, char *buf,
> +						 size_t buflen)
> +{
> +	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>  					      size_t buflen)
>  {
> -	return kernfs_path(cgrp->kn, buf, buflen);
> +	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..9f637fe
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
> +{
> +	return tsk->nsproxy->cgroup_ns->root_cgrp;

Per the rules in nsproxy.h, you should be taking the task_lock here.

(If you are making assumptions about tsk then you need to state them
here - I only looked quickly enough that you pass in 'leader')

> +}
> +
> +#ifdef CONFIG_CGROUP_NS
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	if (ns)
> +		atomic_inc(&ns->count);
> +	return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	if (ns && atomic_dec_and_test(&ns->count))
> +		free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					       struct user_namespace *user_ns,
> +					       struct cgroup_namespace *old_ns);
> +
> +#else  /* CONFIG_CGROUP_NS */
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	return &init_cgroup_ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +}
> +
> +static inline struct cgroup_namespace *copy_cgroup_ns(
> +		unsigned long flags,
> +		struct user_namespace *user_ns,
> +		struct cgroup_namespace *old_ns) {
> +	if (flags & CLONE_NEWCGROUP)
> +		return ERR_PTR(-EINVAL);
> +
> +	return old_ns;
> +}
> +
> +#endif  /* CONFIG_CGROUP_NS */
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>  	struct mnt_namespace *mnt_ns;
>  	struct pid_namespace *pid_ns_for_children;
>  	struct net 	     *net_ns;
> +	struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;
>  
>  struct proc_ns_operations {
>  	const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
>  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
>  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
> +	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/init/Kconfig b/init/Kconfig
> index e84c642..c3be001 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>  	Enable some debugging help. Currently it exports additional stat
>  	files in a cgroup which can be useful for debugging.
>  
> +config CGROUP_NS
> +	bool "CGroup Namespaces"
> +	default n
> +	help
> +	  This options enables CGroup Namespaces which can be used to isolate
> +	  cgroup paths. This feature is only useful when unified cgroup
> +	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
> +	  option).
> +
>  endif # CGROUPS
>  
>  config CHECKPOINT_RESTORE
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..75334f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2b3e9f9..f8099b4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>  			      bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> +	.count = {
> +		.counter = 1,
> +	},
> +	.proc_inum = PROC_CGROUP_INIT_INO,
> +	.user_ns = &init_user_ns,

This might mean that you should bump the init_user_ns refcount.

> +	.root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  			    gfp_t gfp_mask)
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..c16604f
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,128 @@
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> +	struct cgroup_namespace *new_ns;
> +
> +	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	if (new_ns)
> +		atomic_set(&new_ns->count, 1);
> +	return new_ns;
> +}
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	cgroup_put(ns->root_cgrp);
> +	put_user_ns(ns->user_ns);

This is a problem on error patch in copy_cgroup_ns.  The
alloc_cgroup_ns() doesn't initialize these values, so if
you should fail in proc_alloc_inum() you'll show up here
with fandom values in ns->*.

> +	proc_free_inum(ns->proc_inum);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					struct user_namespace *user_ns,
> +					struct cgroup_namespace *old_ns)
> +{
> +	struct cgroup_namespace *new_ns = NULL;
> +	struct cgroup *cgrp = NULL;
> +	int err;
> +
> +	BUG_ON(!old_ns);
> +
> +	if (!(flags & CLONE_NEWCGROUP))
> +		return get_cgroup_ns(old_ns);
> +
> +	/* Allow only sysadmin to create cgroup namespace. */
> +	err = -EPERM;
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto err_out;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(current);
> +
> +	cgrp = get_task_cgroup(current);
> +
> +	/* Creating new CGROUPNS is supported only when unified hierarchy is in
> +	 * use. */

Oh, drat.  Well, I'll take, it, but under protest  :)

> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto err_out_unlock;
> +
> +	err = -ENOMEM;
> +	new_ns = alloc_cgroup_ns();
> +	if (!new_ns)
> +		goto err_out_unlock;
> +
> +	err = proc_alloc_inum(&new_ns->proc_inum);
> +	if (err)
> +		goto err_out_unlock;
> +
> +	new_ns->user_ns = get_user_ns(user_ns);
> +	new_ns->root_cgrp = cgrp;
> +
> +	threadgroup_unlock(current);
> +
> +	return new_ns;
> +
> +err_out_unlock:
> +	threadgroup_unlock(current);
> +err_out:
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	kfree(new_ns);
> +	return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	pr_info("setns not supported for cgroup namespace");
> +	return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +	struct cgroup_namespace *ns = NULL;
> +	struct nsproxy *nsproxy;
> +
> +	rcu_read_lock();
> +	nsproxy = task->nsproxy;
> +	if (nsproxy) {
> +		ns = nsproxy->cgroup_ns;
> +		get_cgroup_ns(ns);
> +	}
> +	rcu_read_unlock();
> +
> +	return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> +	put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> +	struct cgroup_namespace *cgroup_ns = ns;
> +
> +	return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> +	.name		= "cgroup",
> +	.type		= CLONE_NEWCGROUP,
> +	.get		= cgroupns_get,
> +	.put		= cgroupns_put,
> +	.install	= cgroupns_install,
> +	.inum		= cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> +	return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..cc06851 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>  	.net_ns			= &init_net,
>  #endif
> +	.cgroup_ns		= &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  		goto out_pid;
>  	}
>  
> +	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +					    tsk->nsproxy->cgroup_ns);
> +	if (IS_ERR(new_nsp->cgroup_ns)) {
> +		err = PTR_ERR(new_nsp->cgroup_ns);
> +		goto out_cgroup;
> +	}
> +
>  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  	return new_nsp;
>  
>  out_net:
> +	if (new_nsp->cgroup_ns)
> +		put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>  	if (new_nsp->pid_ns_for_children)
>  		put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  	struct nsproxy *new_ns;
>  
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			      CLONE_NEWPID | CLONE_NEWNET)))) {
> +			      CLONE_NEWPID | CLONE_NEWNET |
> +			      CLONE_NEWCGROUP)))) {
>  		get_nsproxy(old_ns);
>  		return 0;
>  	}
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>  		put_ipc_ns(ns->ipc_ns);
>  	if (ns->pid_ns_for_children)
>  		put_pid_ns(ns->pid_ns_for_children);
> +	if (ns->cgroup_ns)
> +		put_cgroup_ns(ns->cgroup_ns);
>  	put_net(ns->net_ns);
>  	kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
@ 2014-10-16 16:37         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 16:37 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
have cgroups in the kernel this won't add much in the way of memory
usage, right?  And I think the 'experimental' argument has long since
been squashed.  So I'd argue for simplifying this patch by removing
CONFIG_CGROUP_NS.

(more below)

> ---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  18 +++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  11 ++++
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  11 files changed, 255 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e04ed4b 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>  	&userns_operations,
>  #endif
>  	&mntns_operations,
> +#ifdef CONFIG_CGROUP_NS
> +	&cgroupns_operations,
> +#endif
>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4a0eb2d..aa86495 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> +	atomic_t		count;
> +	unsigned int		proc_inum;
> +	struct user_namespace	*user_ns;
> +	struct cgroup		*root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>  	return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> +						 struct cgroup *cgrp, char *buf,
> +						 size_t buflen)
> +{
> +	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>  					      size_t buflen)
>  {
> -	return kernfs_path(cgrp->kn, buf, buflen);
> +	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..9f637fe
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
> +{
> +	return tsk->nsproxy->cgroup_ns->root_cgrp;

Per the rules in nsproxy.h, you should be taking the task_lock here.

(If you are making assumptions about tsk then you need to state them
here - I only looked quickly enough that you pass in 'leader')

> +}
> +
> +#ifdef CONFIG_CGROUP_NS
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	if (ns)
> +		atomic_inc(&ns->count);
> +	return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	if (ns && atomic_dec_and_test(&ns->count))
> +		free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					       struct user_namespace *user_ns,
> +					       struct cgroup_namespace *old_ns);
> +
> +#else  /* CONFIG_CGROUP_NS */
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	return &init_cgroup_ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +}
> +
> +static inline struct cgroup_namespace *copy_cgroup_ns(
> +		unsigned long flags,
> +		struct user_namespace *user_ns,
> +		struct cgroup_namespace *old_ns) {
> +	if (flags & CLONE_NEWCGROUP)
> +		return ERR_PTR(-EINVAL);
> +
> +	return old_ns;
> +}
> +
> +#endif  /* CONFIG_CGROUP_NS */
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>  	struct mnt_namespace *mnt_ns;
>  	struct pid_namespace *pid_ns_for_children;
>  	struct net 	     *net_ns;
> +	struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;
>  
>  struct proc_ns_operations {
>  	const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
>  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
>  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
> +	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/init/Kconfig b/init/Kconfig
> index e84c642..c3be001 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>  	Enable some debugging help. Currently it exports additional stat
>  	files in a cgroup which can be useful for debugging.
>  
> +config CGROUP_NS
> +	bool "CGroup Namespaces"
> +	default n
> +	help
> +	  This options enables CGroup Namespaces which can be used to isolate
> +	  cgroup paths. This feature is only useful when unified cgroup
> +	  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
> +	  option).
> +
>  endif # CGROUPS
>  
>  config CHECKPOINT_RESTORE
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..75334f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2b3e9f9..f8099b4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>  			      bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> +	.count = {
> +		.counter = 1,
> +	},
> +	.proc_inum = PROC_CGROUP_INIT_INO,
> +	.user_ns = &init_user_ns,

This might mean that you should bump the init_user_ns refcount.

> +	.root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  			    gfp_t gfp_mask)
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..c16604f
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,128 @@
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> +	struct cgroup_namespace *new_ns;
> +
> +	new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	if (new_ns)
> +		atomic_set(&new_ns->count, 1);
> +	return new_ns;
> +}
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	cgroup_put(ns->root_cgrp);
> +	put_user_ns(ns->user_ns);

This is a problem on error patch in copy_cgroup_ns.  The
alloc_cgroup_ns() doesn't initialize these values, so if
you should fail in proc_alloc_inum() you'll show up here
with fandom values in ns->*.

> +	proc_free_inum(ns->proc_inum);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					struct user_namespace *user_ns,
> +					struct cgroup_namespace *old_ns)
> +{
> +	struct cgroup_namespace *new_ns = NULL;
> +	struct cgroup *cgrp = NULL;
> +	int err;
> +
> +	BUG_ON(!old_ns);
> +
> +	if (!(flags & CLONE_NEWCGROUP))
> +		return get_cgroup_ns(old_ns);
> +
> +	/* Allow only sysadmin to create cgroup namespace. */
> +	err = -EPERM;
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto err_out;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(current);
> +
> +	cgrp = get_task_cgroup(current);
> +
> +	/* Creating new CGROUPNS is supported only when unified hierarchy is in
> +	 * use. */

Oh, drat.  Well, I'll take, it, but under protest  :)

> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto err_out_unlock;
> +
> +	err = -ENOMEM;
> +	new_ns = alloc_cgroup_ns();
> +	if (!new_ns)
> +		goto err_out_unlock;
> +
> +	err = proc_alloc_inum(&new_ns->proc_inum);
> +	if (err)
> +		goto err_out_unlock;
> +
> +	new_ns->user_ns = get_user_ns(user_ns);
> +	new_ns->root_cgrp = cgrp;
> +
> +	threadgroup_unlock(current);
> +
> +	return new_ns;
> +
> +err_out_unlock:
> +	threadgroup_unlock(current);
> +err_out:
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	kfree(new_ns);
> +	return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	pr_info("setns not supported for cgroup namespace");
> +	return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +	struct cgroup_namespace *ns = NULL;
> +	struct nsproxy *nsproxy;
> +
> +	rcu_read_lock();
> +	nsproxy = task->nsproxy;
> +	if (nsproxy) {
> +		ns = nsproxy->cgroup_ns;
> +		get_cgroup_ns(ns);
> +	}
> +	rcu_read_unlock();
> +
> +	return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> +	put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> +	struct cgroup_namespace *cgroup_ns = ns;
> +
> +	return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> +	.name		= "cgroup",
> +	.type		= CLONE_NEWCGROUP,
> +	.get		= cgroupns_get,
> +	.put		= cgroupns_put,
> +	.install	= cgroupns_install,
> +	.inum		= cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> +	return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..cc06851 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>  	.net_ns			= &init_net,
>  #endif
> +	.cgroup_ns		= &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  		goto out_pid;
>  	}
>  
> +	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +					    tsk->nsproxy->cgroup_ns);
> +	if (IS_ERR(new_nsp->cgroup_ns)) {
> +		err = PTR_ERR(new_nsp->cgroup_ns);
> +		goto out_cgroup;
> +	}
> +
>  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  	return new_nsp;
>  
>  out_net:
> +	if (new_nsp->cgroup_ns)
> +		put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>  	if (new_nsp->pid_ns_for_children)
>  		put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  	struct nsproxy *new_ns;
>  
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			      CLONE_NEWPID | CLONE_NEWNET)))) {
> +			      CLONE_NEWPID | CLONE_NEWNET |
> +			      CLONE_NEWCGROUP)))) {
>  		get_nsproxy(old_ns);
>  		return 0;
>  	}
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>  		put_ipc_ns(ns->ipc_ns);
>  	if (ns->pid_ns_for_children)
>  		put_pid_ns(ns->pid_ns_for_children);
> +	if (ns->cgroup_ns)
> +		put_cgroup_ns(ns->cgroup_ns);
>  	put_net(ns->net_ns);
>  	kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-13 21:23     ` Aditya Kali
@ 2014-10-16 21:12         ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 21:12 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
>   over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
>   cgroup.

What is the point of this?

If I'm a user logged into
/lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
a container which is in
/lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
then I will want to be able to enter the container's cgroup.
The container's cgroup root is under my own (satisfying the
below condition0 but my cgroup is not a descendent of the
container's cgroup.


> * target cgroupns-root is same as or deeper than task's current
>   cgroupns-root. This is so that the task cannot escape out of its
>   cgroupns-root. This also ensures that setns() only makes the task
>   get restricted to a deeper cgroup hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>  
>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>  {
> -	pr_info("setns not supported for cgroup namespace");
> -	return -EINVAL;
> +	struct cgroup_namespace *cgroup_ns = ns;
> +	struct task_struct *task = current;
> +	struct cgroup *cgrp = NULL;
> +	int err = 0;
> +
> +	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> +	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(task);
> +
> +	cgrp = get_task_cgroup(task);
> +
> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto out_unlock;
> +
> +	/* Allow switch only if the task's current cgroup is descendant of the
> +	 * target cgroup_ns->root_cgrp.
> +	 */
> +	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> +		goto out_unlock;
> +
> +	/* Only allow setns to a cgroupns root-ed deeper than task's current
> +	 * cgroupns-root. This will make sure that tasks cannot escape their
> +	 * cgroupns by attaching to parent cgroupns.
> +	 */
> +	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> +				  task_cgroupns_root(task)))
> +		goto out_unlock;
> +
> +	err = 0;
> +	get_cgroup_ns(cgroup_ns);
> +	put_cgroup_ns(nsproxy->cgroup_ns);
> +	nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> +	threadgroup_unlock(current);
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	return err;
>  }
>  
>  static void *cgroupns_get(struct task_struct *task)
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-16 21:12         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 21:12 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal

Quoting Aditya Kali (adityakali@google.com):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
>   over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
>   cgroup.

What is the point of this?

If I'm a user logged into
/lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
a container which is in
/lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
then I will want to be able to enter the container's cgroup.
The container's cgroup root is under my own (satisfying the
below condition0 but my cgroup is not a descendent of the
container's cgroup.


> * target cgroupns-root is same as or deeper than task's current
>   cgroupns-root. This is so that the task cannot escape out of its
>   cgroupns-root. This also ensures that setns() only makes the task
>   get restricted to a deeper cgroup hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>  
>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>  {
> -	pr_info("setns not supported for cgroup namespace");
> -	return -EINVAL;
> +	struct cgroup_namespace *cgroup_ns = ns;
> +	struct task_struct *task = current;
> +	struct cgroup *cgrp = NULL;
> +	int err = 0;
> +
> +	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> +	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(task);
> +
> +	cgrp = get_task_cgroup(task);
> +
> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto out_unlock;
> +
> +	/* Allow switch only if the task's current cgroup is descendant of the
> +	 * target cgroup_ns->root_cgrp.
> +	 */
> +	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> +		goto out_unlock;
> +
> +	/* Only allow setns to a cgroupns root-ed deeper than task's current
> +	 * cgroupns-root. This will make sure that tasks cannot escape their
> +	 * cgroupns by attaching to parent cgroupns.
> +	 */
> +	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> +				  task_cgroupns_root(task)))
> +		goto out_unlock;
> +
> +	err = 0;
> +	get_cgroup_ns(cgroup_ns);
> +	put_cgroup_ns(nsproxy->cgroup_ns);
> +	nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> +	threadgroup_unlock(current);
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	return err;
>  }
>  
>  static void *cgroupns_get(struct task_struct *task)
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-16 21:12         ` Serge E. Hallyn
@ 2014-10-16 21:17             ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-16 21:17 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>

Presumably you need to ask your friendly cgroup manager to stick you
in that cgroup first.  Or we need to generally allow tasks to move
themselves deeper in the hierarchy, but that seems like a big change.

--Andy

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-16 21:17             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-16 21:17 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>

Presumably you need to ask your friendly cgroup manager to stick you
in that cgroup first.  Or we need to generally allow tasks to move
themselves deeper in the hierarchy, but that seems like a big change.

--Andy

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-16 21:12         ` Serge E. Hallyn
@ 2014-10-16 21:22             ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-16 21:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>
This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in "session-c12.scope" container to
attach to cgroupns root'ed at "session-c12.scope/x1" container
(without implicitly moving its cgroup), then this invariant won't
hold.

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-16 21:22             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-16 21:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>
This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in "session-c12.scope" container to
attach to cgroupns root'ed at "session-c12.scope/x1" container
(without implicitly moving its cgroup), then this invariant won't
hold.

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-16 21:22             ` Aditya Kali
@ 2014-10-16 21:47                 ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 21:47 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> setns on a cgroup namespace is allowed only if
> >> * task has CAP_SYS_ADMIN in its current user-namespace and
> >>   over the user-namespace associated with target cgroupns.
> >> * task's current cgroup is descendent of the target cgroupns-root
> >>   cgroup.
> >
> > What is the point of this?
> >
> > If I'm a user logged into
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> > a container which is in
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> > then I will want to be able to enter the container's cgroup.
> > The container's cgroup root is under my own (satisfying the
> > below condition0 but my cgroup is not a descendent of the
> > container's cgroup.
> >
> This condition is there because we don't want to do implicit cgroup
> changes when a process attaches to another cgroupns. cgroupns tries to
> preserve the invariant that at any point, your current cgroup is
> always under the cgroupns-root of your cgroup namespace. But in your
> example, if we allow a process in "session-c12.scope" container to
> attach to cgroupns root'ed at "session-c12.scope/x1" container
> (without implicitly moving its cgroup), then this invariant won't
> hold.

Oh, I see.  Guess that should be workable.  Thanks.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-16 21:47                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-16 21:47 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge E. Hallyn, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

Quoting Aditya Kali (adityakali@google.com):
> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Aditya Kali (adityakali@google.com):
> >> setns on a cgroup namespace is allowed only if
> >> * task has CAP_SYS_ADMIN in its current user-namespace and
> >>   over the user-namespace associated with target cgroupns.
> >> * task's current cgroup is descendent of the target cgroupns-root
> >>   cgroup.
> >
> > What is the point of this?
> >
> > If I'm a user logged into
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> > a container which is in
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> > then I will want to be able to enter the container's cgroup.
> > The container's cgroup root is under my own (satisfying the
> > below condition0 but my cgroup is not a descendent of the
> > container's cgroup.
> >
> This condition is there because we don't want to do implicit cgroup
> changes when a process attaches to another cgroupns. cgroupns tries to
> preserve the invariant that at any point, your current cgroup is
> always under the cgroupns-root of your cgroup namespace. But in your
> example, if we allow a process in "session-c12.scope" container to
> attach to cgroupns root'ed at "session-c12.scope/x1" container
> (without implicitly moving its cgroup), then this invariant won't
> hold.

Oh, I see.  Guess that should be workable.  Thanks.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
  2014-10-13 21:23       ` Aditya Kali
@ 2014-10-17  9:28           ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17  9:28 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
> 
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

So this is a bit different from some other namespaces - if I
have an open fd to a file, then setns into a mntns where that
file is not addressable, I can still use the file.

I guess not allowing attach to a cgroup outside our ns is a
good failsafe as we'll otherwise risk falling off a cliff in
some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
restrictions are needed.  (And really I can fchdir to a
directory not in my ns, so the cgroup-attach restriction is
any more justified).

Still I'm not strictly opposed ot this, so

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

just wanted to point this out.

> ---
>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>  	struct task_struct *task;
>  	int ret;
>  
> +	/* Only allow changing cgroups accessible within task's cgroup
> +	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> +	 * cgroupns->root_cgrp. */
> +	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> +		return -EPERM;
> +
>  	/* look up all src csets */
>  	down_read(&css_set_rwsem);
>  	rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	struct cgroup_subsys_state *css;
>  	int ret;
>  
> +	/* Reject writes to cgroup files outside of task's cgroupns-root. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +		return -EINVAL;
> +
>  	if (cft->write)
>  		return cft->write(of, buf, nbytes, off);
>  
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	parent = cgroup_kn_lock_live(parent_kn);
>  	if (!parent)
>  		return -ENODEV;
> +
> +	/* Allow mkdir only within process's cgroup namespace root. */
> +	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
>  	root = parent->root;
>  
>  	/* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>  	if (!cgrp)
>  		return 0;
>  
> +	/* Allow rmdir only within process's cgroup namespace root.
> +	 * The process can't delete its own root anyways. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> +		cgroup_kn_unlock(kn);
> +		return -EPERM;
> +	}
> +
>  	ret = cgroup_destroy_locked(cgrp);
>  
>  	cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>  			continue;
>  
> +		cgrp = task_cgroup_from_root(tsk, root);
> +
> +		/* The cgroup path on default hierarchy is shown only if it
> +		 * falls under current task's cgroupns-root.
> +		 */
> +		if (root == &cgrp_dfl_root &&
> +		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +			continue;
> +
>  		seq_printf(m, "%d:", root->hierarchy_id);
>  		for_each_subsys(ss, ssid)
>  			if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  			seq_printf(m, "%sname=%s", count ? "," : "",
>  				   root->name);
>  		seq_putc(m, ':');
> -		cgrp = task_cgroup_from_root(tsk, root);
>  		path = cgroup_path(cgrp, buf, PATH_MAX);
>  		if (!path) {
>  			retval = -ENAMETOOLONG;
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
@ 2014-10-17  9:28           ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17  9:28 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Quoting Aditya Kali (adityakali@google.com):
> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
> 
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

So this is a bit different from some other namespaces - if I
have an open fd to a file, then setns into a mntns where that
file is not addressable, I can still use the file.

I guess not allowing attach to a cgroup outside our ns is a
good failsafe as we'll otherwise risk falling off a cliff in
some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
restrictions are needed.  (And really I can fchdir to a
directory not in my ns, so the cgroup-attach restriction is
any more justified).

Still I'm not strictly opposed ot this, so

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

just wanted to point this out.

> ---
>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>  	struct task_struct *task;
>  	int ret;
>  
> +	/* Only allow changing cgroups accessible within task's cgroup
> +	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> +	 * cgroupns->root_cgrp. */
> +	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> +		return -EPERM;
> +
>  	/* look up all src csets */
>  	down_read(&css_set_rwsem);
>  	rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	struct cgroup_subsys_state *css;
>  	int ret;
>  
> +	/* Reject writes to cgroup files outside of task's cgroupns-root. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +		return -EINVAL;
> +
>  	if (cft->write)
>  		return cft->write(of, buf, nbytes, off);
>  
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	parent = cgroup_kn_lock_live(parent_kn);
>  	if (!parent)
>  		return -ENODEV;
> +
> +	/* Allow mkdir only within process's cgroup namespace root. */
> +	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
>  	root = parent->root;
>  
>  	/* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>  	if (!cgrp)
>  		return 0;
>  
> +	/* Allow rmdir only within process's cgroup namespace root.
> +	 * The process can't delete its own root anyways. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> +		cgroup_kn_unlock(kn);
> +		return -EPERM;
> +	}
> +
>  	ret = cgroup_destroy_locked(cgrp);
>  
>  	cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>  			continue;
>  
> +		cgrp = task_cgroup_from_root(tsk, root);
> +
> +		/* The cgroup path on default hierarchy is shown only if it
> +		 * falls under current task's cgroupns-root.
> +		 */
> +		if (root == &cgrp_dfl_root &&
> +		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +			continue;
> +
>  		seq_printf(m, "%d:", root->hierarchy_id);
>  		for_each_subsys(ss, ssid)
>  			if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  			seq_printf(m, "%sname=%s", count ? "," : "",
>  				   root->name);
>  		seq_putc(m, ':');
> -		cgrp = task_cgroup_from_root(tsk, root);
>  		path = cgroup_path(cgrp, buf, PATH_MAX);
>  		if (!path) {
>  			retval = -ENAMETOOLONG;
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-13 21:23     ` Aditya Kali
@ 2014-10-17  9:52         ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17  9:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
>   over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
>   cgroup.
> * target cgroupns-root is same as or deeper than task's current
>   cgroupns-root. This is so that the task cannot escape out of its
>   cgroupns-root. This also ensures that setns() only makes the task
>   get restricted to a deeper cgroup hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Below you allow setns to your own cgroupns.  I think that's fine,
but since you're not doing an explicit cgroup change anyway should
you just return 0 at top in that case to save some cpu time?

> ---
>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>  
>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>  {
> -	pr_info("setns not supported for cgroup namespace");
> -	return -EINVAL;
> +	struct cgroup_namespace *cgroup_ns = ns;
> +	struct task_struct *task = current;
> +	struct cgroup *cgrp = NULL;
> +	int err = 0;
> +
> +	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> +	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(task);
> +
> +	cgrp = get_task_cgroup(task);
> +
> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto out_unlock;
> +
> +	/* Allow switch only if the task's current cgroup is descendant of the
> +	 * target cgroup_ns->root_cgrp.
> +	 */
> +	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> +		goto out_unlock;
> +
> +	/* Only allow setns to a cgroupns root-ed deeper than task's current
> +	 * cgroupns-root. This will make sure that tasks cannot escape their
> +	 * cgroupns by attaching to parent cgroupns.
> +	 */
> +	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> +				  task_cgroupns_root(task)))
> +		goto out_unlock;
> +
> +	err = 0;
> +	get_cgroup_ns(cgroup_ns);
> +	put_cgroup_ns(nsproxy->cgroup_ns);
> +	nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> +	threadgroup_unlock(current);
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	return err;
>  }
>  
>  static void *cgroupns_get(struct task_struct *task)
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-17  9:52         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17  9:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal

Quoting Aditya Kali (adityakali@google.com):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
>   over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
>   cgroup.
> * target cgroupns-root is same as or deeper than task's current
>   cgroupns-root. This is so that the task cannot escape out of its
>   cgroupns-root. This also ensures that setns() only makes the task
>   get restricted to a deeper cgroup hierarchy.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

Below you allow setns to your own cgroupns.  I think that's fine,
but since you're not doing an explicit cgroup change anyway should
you just return 0 at top in that case to save some cpu time?

> ---
>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>  
>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>  {
> -	pr_info("setns not supported for cgroup namespace");
> -	return -EINVAL;
> +	struct cgroup_namespace *cgroup_ns = ns;
> +	struct task_struct *task = current;
> +	struct cgroup *cgrp = NULL;
> +	int err = 0;
> +
> +	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> +	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/* Prevent cgroup changes for this task. */
> +	threadgroup_lock(task);
> +
> +	cgrp = get_task_cgroup(task);
> +
> +	err = -EINVAL;
> +	if (!cgroup_on_dfl(cgrp))
> +		goto out_unlock;
> +
> +	/* Allow switch only if the task's current cgroup is descendant of the
> +	 * target cgroup_ns->root_cgrp.
> +	 */
> +	if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> +		goto out_unlock;
> +
> +	/* Only allow setns to a cgroupns root-ed deeper than task's current
> +	 * cgroupns-root. This will make sure that tasks cannot escape their
> +	 * cgroupns by attaching to parent cgroupns.
> +	 */
> +	if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> +				  task_cgroupns_root(task)))
> +		goto out_unlock;
> +
> +	err = 0;
> +	get_cgroup_ns(cgroup_ns);
> +	put_cgroup_ns(nsproxy->cgroup_ns);
> +	nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> +	threadgroup_unlock(current);
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	return err;
>  }
>  
>  static void *cgroupns_get(struct task_struct *task)
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-10-13 21:23   ` [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
@ 2014-10-17 12:19         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17 12:19 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */
> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}
> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2fc0dfa..ef27dc4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	memset(opts, 0, sizeof(*opts));
>  
> +	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +	 * namespace.
> +	 */
> +	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +	}
> +
>  	while ((token = strsep(&o, ",")) != NULL) {
>  		nr_opts++;
>  
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -		if (nr_opts != 1) {
> +		if (nr_opts > 1) {
>  			pr_err("sane_behavior: no other mount options allowed\n");
>  			return -EINVAL;
>  		}
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1816,11 +1840,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry)) {
> +		/* If this mount is for a non-init cgroup namespace, then
> +		 * Instead of root's dentry, we return the dentry specific to
> +		 * the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1833,6 +1874,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-10-17 12:19         ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-17 12:19 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal

Quoting Aditya Kali (adityakali@google.com):
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */
> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}
> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2fc0dfa..ef27dc4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	memset(opts, 0, sizeof(*opts));
>  
> +	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +	 * namespace.
> +	 */
> +	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +	}
> +
>  	while ((token = strsep(&o, ",")) != NULL) {
>  		nr_opts++;
>  
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -		if (nr_opts != 1) {
> +		if (nr_opts > 1) {
>  			pr_err("sane_behavior: no other mount options allowed\n");
>  			return -EINVAL;
>  		}
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1816,11 +1840,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry)) {
> +		/* If this mount is for a non-init cgroup namespace, then
> +		 * Instead of root's dentry, we return the dentry specific to
> +		 * the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1833,6 +1874,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
  2014-10-13 21:23   ` Aditya Kali
@ 2014-10-19  4:54       ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  4:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Aditya Kali <adityakali@google.com> writes:

> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> More details in the writeup below.

This definitely looks like the right direction to go, and something that
in some form or another I had been asking for since cgroups were merged.
So I am very glad to see this work moving forward.

I had hoped that we might just be able to be clever with remounting
cgroupfs but 2 things stand in the way.
1) /proc/<pid>/cgroups (but proc could capture that).
2) providing a hard guarnatee that tasks stay within a subset of the
   cgroup hierarchy.

So I think this clearly meets the requirements for a new namespace.

We need to have the discussion on chmod of files on cgroupfs.  There is
a notion that has floated around that only systemd or only root (with
the appropriate capabilities) should be allowed to set resource limits
in cgroupfs.  In a practical reality that is nonsense.  If an atribute
is properly bound in it's hiearchy it should be safe to change.

Not all attributes are properly bound to hierarchy and some are or at
least were dangerous for anyone except root to set.  So I suggest that a
CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
a cgroup attribute from root.

That would be complimentary work, and not strictly tied the cgroup
namespaces but unprivileged cgroup namespaces don't make much sense
without that work.

Eric

> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
>
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
>
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel. Incidentally though, it used the same config option
>   name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
>
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>   cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   # Unshare cgroupns along with userns and mountns
>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>   # sets up uid/gid map and exec’s /bin/bash
>   $ ~/unshare -c -u -m
>
>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>   # hierarchy.
>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>   [ns]$ ls -l /tmp/cgroup
>   total 0
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>   filesystem root for the namespace specific cgroupfs mount.
>
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>   should provide a completely isolated cgroup view inside the container.
>
>   In its current form, the cgroup namespaces patcheset provides following
>   behavior:
>
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
>
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>       path will be visible:
>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>       [ns2]$ cat /proc/7353/cgroup
>       [ns2]$
>       This is same as when cgroup hierarchy is not mounted at all.
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place.)
>
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
>
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
>
>   (5) Setns to another cgroup namespace is allowed only when:
>       (a) process has CAP_SYS_ADMIN in its current userns
>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>       (c) the process's current cgroup is a descendant cgroupns-root of the
>           target namespace.
>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>       The last check (d) prevents processes from escaping their cgroupns-root by
>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>       is trying to restrict itself to a deeper cgroup hierarchy.
>
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
>
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
>
>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>       container management tools to be run inside the containers transparently.
>
> Implementation
>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
>
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.




> ---
>  fs/kernfs/dir.c                  |  53 +++++++++---
>  fs/kernfs/mount.c                |  48 +++++++++++
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  41 +++++++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>  include/linux/kernfs.h           |   5 ++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 +
>  include/uapi/linux/sched.h       |   3 +-
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 ++++-
>  15 files changed, 518 insertions(+), 41 deletions(-)
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
> [PATCHv1 7/8] cgroup: cgroup namespace setns support
> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
@ 2014-10-19  4:54       ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  4:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Aditya Kali <adityakali@google.com> writes:

> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> More details in the writeup below.

This definitely looks like the right direction to go, and something that
in some form or another I had been asking for since cgroups were merged.
So I am very glad to see this work moving forward.

I had hoped that we might just be able to be clever with remounting
cgroupfs but 2 things stand in the way.
1) /proc/<pid>/cgroups (but proc could capture that).
2) providing a hard guarnatee that tasks stay within a subset of the
   cgroup hierarchy.

So I think this clearly meets the requirements for a new namespace.

We need to have the discussion on chmod of files on cgroupfs.  There is
a notion that has floated around that only systemd or only root (with
the appropriate capabilities) should be allowed to set resource limits
in cgroupfs.  In a practical reality that is nonsense.  If an atribute
is properly bound in it's hiearchy it should be safe to change.

Not all attributes are properly bound to hierarchy and some are or at
least were dangerous for anyone except root to set.  So I suggest that a
CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
a cgroup attribute from root.

That would be complimentary work, and not strictly tied the cgroup
namespaces but unprivileged cgroup namespaces don't make much sense
without that work.

Eric

> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
>
>   $ cat /proc/self/cgroup 
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
>
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel. Incidentally though, it used the same config option
>   name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
>
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>   cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   # Unshare cgroupns along with userns and mountns
>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>   # sets up uid/gid map and exec’s /bin/bash
>   $ ~/unshare -c -u -m
>
>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>   # hierarchy.
>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>   [ns]$ ls -l /tmp/cgroup
>   total 0
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>   filesystem root for the namespace specific cgroupfs mount.
>
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>   should provide a completely isolated cgroup view inside the container.
>
>   In its current form, the cgroup namespaces patcheset provides following
>   behavior:
>
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
>
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup 
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>       path will be visible:
>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>       [ns2]$ cat /proc/7353/cgroup
>       [ns2]$
>       This is same as when cgroup hierarchy is not mounted at all.
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place.)
>
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
>
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
>
>   (5) Setns to another cgroup namespace is allowed only when:
>       (a) process has CAP_SYS_ADMIN in its current userns
>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>       (c) the process's current cgroup is a descendant cgroupns-root of the
>           target namespace.
>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>       The last check (d) prevents processes from escaping their cgroupns-root by
>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>       is trying to restrict itself to a deeper cgroup hierarchy.
>
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).
>
>   (7) The cgroup namespace is alive as long as there is atleast 1
>       process inside it. When the last process exits, the cgroup
>       namespace is destroyed. The cgroupns-root and the actual cgroups
>       remain though.
>
>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>       container management tools to be run inside the containers transparently.
>
> Implementation
>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>   branch). Its fairly non-intrusive and provides above mentioned
>   features.
>
> Possible extensions of CGROUPNS:
>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>       capabilities to restrict cgroups to administrative users. CGroup
>       namespaces could be of help here. With cgroup namespaces, it might
>       be possible to delegate administration of sub-cgroups under a
>       cgroupns-root to the cgroupns owner.




> ---
>  fs/kernfs/dir.c                  |  53 +++++++++---
>  fs/kernfs/mount.c                |  48 +++++++++++
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  41 +++++++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>  include/linux/kernfs.h           |   5 ++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 +
>  include/uapi/linux/sched.h       |   3 +-
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 ++++-
>  15 files changed, 518 insertions(+), 41 deletions(-)
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
> [PATCHv1 7/8] cgroup: cgroup namespace setns support
> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
  2014-10-13 21:23       ` Aditya Kali
@ 2014-10-19  4:57           ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  4:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
>
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.

So I think this patch is out of order.  

We should add the namespace infrastructre and the restrictions before
we allow creation of the namespace.  Otherwise there is a bisection
point where cgroup namespaces are broken or at the very least have a
security hole.  Since we can anticipate this let's see if we can figure
out how to avoid it.

Eric


> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>  	struct task_struct *task;
>  	int ret;
>  
> +	/* Only allow changing cgroups accessible within task's cgroup
> +	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> +	 * cgroupns->root_cgrp. */
> +	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> +		return -EPERM;
> +
>  	/* look up all src csets */
>  	down_read(&css_set_rwsem);
>  	rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	struct cgroup_subsys_state *css;
>  	int ret;
>  
> +	/* Reject writes to cgroup files outside of task's cgroupns-root. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +		return -EINVAL;
> +
>  	if (cft->write)
>  		return cft->write(of, buf, nbytes, off);
>  
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	parent = cgroup_kn_lock_live(parent_kn);
>  	if (!parent)
>  		return -ENODEV;
> +
> +	/* Allow mkdir only within process's cgroup namespace root. */
> +	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
>  	root = parent->root;
>  
>  	/* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>  	if (!cgrp)
>  		return 0;
>  
> +	/* Allow rmdir only within process's cgroup namespace root.
> +	 * The process can't delete its own root anyways. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> +		cgroup_kn_unlock(kn);
> +		return -EPERM;
> +	}
> +
>  	ret = cgroup_destroy_locked(cgrp);
>  
>  	cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>  			continue;
>  
> +		cgrp = task_cgroup_from_root(tsk, root);
> +
> +		/* The cgroup path on default hierarchy is shown only if it
> +		 * falls under current task's cgroupns-root.
> +		 */
> +		if (root == &cgrp_dfl_root &&
> +		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +			continue;
> +
>  		seq_printf(m, "%d:", root->hierarchy_id);
>  		for_each_subsys(ss, ssid)
>  			if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  			seq_printf(m, "%sname=%s", count ? "," : "",
>  				   root->name);
>  		seq_putc(m, ':');
> -		cgrp = task_cgroup_from_root(tsk, root);
>  		path = cgroup_path(cgrp, buf, PATH_MAX);
>  		if (!path) {
>  			retval = -ENAMETOOLONG;

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
@ 2014-10-19  4:57           ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  4:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers

Aditya Kali <adityakali@google.com> writes:

> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
>
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.

So I think this patch is out of order.  

We should add the namespace infrastructre and the restrictions before
we allow creation of the namespace.  Otherwise there is a bisection
point where cgroup namespaces are broken or at the very least have a
security hole.  Since we can anticipate this let's see if we can figure
out how to avoid it.

Eric


> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>  	struct task_struct *task;
>  	int ret;
>  
> +	/* Only allow changing cgroups accessible within task's cgroup
> +	 * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> +	 * cgroupns->root_cgrp. */
> +	if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> +		return -EPERM;
> +
>  	/* look up all src csets */
>  	down_read(&css_set_rwsem);
>  	rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	struct cgroup_subsys_state *css;
>  	int ret;
>  
> +	/* Reject writes to cgroup files outside of task's cgroupns-root. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +		return -EINVAL;
> +
>  	if (cft->write)
>  		return cft->write(of, buf, nbytes, off);
>  
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	parent = cgroup_kn_lock_live(parent_kn);
>  	if (!parent)
>  		return -ENODEV;
> +
> +	/* Allow mkdir only within process's cgroup namespace root. */
> +	if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
>  	root = parent->root;
>  
>  	/* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>  	if (!cgrp)
>  		return 0;
>  
> +	/* Allow rmdir only within process's cgroup namespace root.
> +	 * The process can't delete its own root anyways. */
> +	if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> +		cgroup_kn_unlock(kn);
> +		return -EPERM;
> +	}
> +
>  	ret = cgroup_destroy_locked(cgrp);
>  
>  	cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  		if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>  			continue;
>  
> +		cgrp = task_cgroup_from_root(tsk, root);
> +
> +		/* The cgroup path on default hierarchy is shown only if it
> +		 * falls under current task's cgroupns-root.
> +		 */
> +		if (root == &cgrp_dfl_root &&
> +		    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> +			continue;
> +
>  		seq_printf(m, "%d:", root->hierarchy_id);
>  		for_each_subsys(ss, ssid)
>  			if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>  			seq_printf(m, "%sname=%s", count ? "," : "",
>  				   root->name);
>  		seq_putc(m, ':');
> -		cgrp = task_cgroup_from_root(tsk, root);
>  		path = cgroup_path(cgrp, buf, PATH_MAX);
>  		if (!path) {
>  			retval = -ENAMETOOLONG;

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-16 21:47                 ` Serge E. Hallyn
@ 2014-10-19  5:23                     ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  5:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> >> setns on a cgroup namespace is allowed only if
>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>> >>   over the user-namespace associated with target cgroupns.
>> >> * task's current cgroup is descendent of the target cgroupns-root
>> >>   cgroup.
>> >
>> > What is the point of this?
>> >
>> > If I'm a user logged into
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>> > a container which is in
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>> > then I will want to be able to enter the container's cgroup.
>> > The container's cgroup root is under my own (satisfying the
>> > below condition0 but my cgroup is not a descendent of the
>> > container's cgroup.
>> >
>> This condition is there because we don't want to do implicit cgroup
>> changes when a process attaches to another cgroupns. cgroupns tries to
>> preserve the invariant that at any point, your current cgroup is
>> always under the cgroupns-root of your cgroup namespace. But in your
>> example, if we allow a process in "session-c12.scope" container to
>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>> (without implicitly moving its cgroup), then this invariant won't
>> hold.
>
> Oh, I see.  Guess that should be workable.  Thanks.

Which has me looking at what the rules are for moving through
the cgroup hierarchy.

As long as we have write access to cgroup.procs and are allowed
to open the file for write, we can move any of our own tasks
into the cgroup.  So the cgroup namespace rules don't seem
to be a problem.

Andy can you please take a look at the permission checks in
__cgroup_procs_write.  

As I read the code I see 3 security gaffaws in the permssion check.
- Using current->cred instead of file->f_cred.
- Not checking tcred->euid.
- Checking GLOBAL_ROOT_UID instead of having a capable call.

The file permission on cgroup.procs seem just sufficient to keep
to keep those bugs from being easily exploitable.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-19  5:23                     ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-19  5:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel, Andy Lutomirski, Tejun Heo, cgroups, Ingo Molnar

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Aditya Kali (adityakali@google.com):
>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Quoting Aditya Kali (adityakali@google.com):
>> >> setns on a cgroup namespace is allowed only if
>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>> >>   over the user-namespace associated with target cgroupns.
>> >> * task's current cgroup is descendent of the target cgroupns-root
>> >>   cgroup.
>> >
>> > What is the point of this?
>> >
>> > If I'm a user logged into
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>> > a container which is in
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>> > then I will want to be able to enter the container's cgroup.
>> > The container's cgroup root is under my own (satisfying the
>> > below condition0 but my cgroup is not a descendent of the
>> > container's cgroup.
>> >
>> This condition is there because we don't want to do implicit cgroup
>> changes when a process attaches to another cgroupns. cgroupns tries to
>> preserve the invariant that at any point, your current cgroup is
>> always under the cgroupns-root of your cgroup namespace. But in your
>> example, if we allow a process in "session-c12.scope" container to
>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>> (without implicitly moving its cgroup), then this invariant won't
>> hold.
>
> Oh, I see.  Guess that should be workable.  Thanks.

Which has me looking at what the rules are for moving through
the cgroup hierarchy.

As long as we have write access to cgroup.procs and are allowed
to open the file for write, we can move any of our own tasks
into the cgroup.  So the cgroup namespace rules don't seem
to be a problem.

Andy can you please take a look at the permission checks in
__cgroup_procs_write.  

As I read the code I see 3 security gaffaws in the permssion check.
- Using current->cred instead of file->f_cred.
- Not checking tcred->euid.
- Checking GLOBAL_ROOT_UID instead of having a capable call.

The file permission on cgroup.procs seem just sufficient to keep
to keep those bugs from being easily exploitable.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                     ` <87iojgmy3o.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2014-10-19 18:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-19 18:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>
>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> >> setns on a cgroup namespace is allowed only if
>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>> >>   over the user-namespace associated with target cgroupns.
>>> >> * task's current cgroup is descendent of the target cgroupns-root
>>> >>   cgroup.
>>> >
>>> > What is the point of this?
>>> >
>>> > If I'm a user logged into
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>> > a container which is in
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>> > then I will want to be able to enter the container's cgroup.
>>> > The container's cgroup root is under my own (satisfying the
>>> > below condition0 but my cgroup is not a descendent of the
>>> > container's cgroup.
>>> >
>>> This condition is there because we don't want to do implicit cgroup
>>> changes when a process attaches to another cgroupns. cgroupns tries to
>>> preserve the invariant that at any point, your current cgroup is
>>> always under the cgroupns-root of your cgroup namespace. But in your
>>> example, if we allow a process in "session-c12.scope" container to
>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>> (without implicitly moving its cgroup), then this invariant won't
>>> hold.
>>
>> Oh, I see.  Guess that should be workable.  Thanks.
>
> Which has me looking at what the rules are for moving through
> the cgroup hierarchy.
>
> As long as we have write access to cgroup.procs and are allowed
> to open the file for write, we can move any of our own tasks
> into the cgroup.  So the cgroup namespace rules don't seem
> to be a problem.
>
> Andy can you please take a look at the permission checks in
> __cgroup_procs_write.

The actual requirements for calling that function haven't changed,
right?  IOW, what does this have to do with cgroupns?  Is the idea
that you want a privileged user wrt a cgroupns's userns to be able to
use this?  If so:

Yes, that current_cred() thing is bogus.  (Actually, this is probably
exploitable right now if any cgroup.procs inode anywhere on the system
lets non-root write.)  (Can we have some kernel debugging option that
makes any use of current_cred() in write(2) warn?)

We really need a weaker version of may_ptrace for this kind of stuff.
Maybe the existing may_ptrace stuff is okay, actually.  But this is
completely missing group checks, cap checks, capabilities wrt the
userns, etc.

Also, I think that, if this version of the patchset allows non-init
userns to unshare cgroupns, then the issue of what permission is
needed to lock the cgroup hierarchy like that needs to be addressed,
because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
the calling task with no permission required.  Bolting on a fix later
will be a mess.

--Andy

>
> As I read the code I see 3 security gaffaws in the permssion check.
> - Using current->cred instead of file->f_cred.
> - Not checking tcred->euid.
> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>
> The file permission on cgroup.procs seem just sufficient to keep
> to keep those bugs from being easily exploitable.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                     ` <87iojgmy3o.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2014-10-19 18:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-19 18:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
>
>> Quoting Aditya Kali (adityakali@google.com):
>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>> > Quoting Aditya Kali (adityakali@google.com):
>>> >> setns on a cgroup namespace is allowed only if
>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>> >>   over the user-namespace associated with target cgroupns.
>>> >> * task's current cgroup is descendent of the target cgroupns-root
>>> >>   cgroup.
>>> >
>>> > What is the point of this?
>>> >
>>> > If I'm a user logged into
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>> > a container which is in
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>> > then I will want to be able to enter the container's cgroup.
>>> > The container's cgroup root is under my own (satisfying the
>>> > below condition0 but my cgroup is not a descendent of the
>>> > container's cgroup.
>>> >
>>> This condition is there because we don't want to do implicit cgroup
>>> changes when a process attaches to another cgroupns. cgroupns tries to
>>> preserve the invariant that at any point, your current cgroup is
>>> always under the cgroupns-root of your cgroup namespace. But in your
>>> example, if we allow a process in "session-c12.scope" container to
>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>> (without implicitly moving its cgroup), then this invariant won't
>>> hold.
>>
>> Oh, I see.  Guess that should be workable.  Thanks.
>
> Which has me looking at what the rules are for moving through
> the cgroup hierarchy.
>
> As long as we have write access to cgroup.procs and are allowed
> to open the file for write, we can move any of our own tasks
> into the cgroup.  So the cgroup namespace rules don't seem
> to be a problem.
>
> Andy can you please take a look at the permission checks in
> __cgroup_procs_write.

The actual requirements for calling that function haven't changed,
right?  IOW, what does this have to do with cgroupns?  Is the idea
that you want a privileged user wrt a cgroupns's userns to be able to
use this?  If so:

Yes, that current_cred() thing is bogus.  (Actually, this is probably
exploitable right now if any cgroup.procs inode anywhere on the system
lets non-root write.)  (Can we have some kernel debugging option that
makes any use of current_cred() in write(2) warn?)

We really need a weaker version of may_ptrace for this kind of stuff.
Maybe the existing may_ptrace stuff is okay, actually.  But this is
completely missing group checks, cap checks, capabilities wrt the
userns, etc.

Also, I think that, if this version of the patchset allows non-init
userns to unshare cgroupns, then the issue of what permission is
needed to lock the cgroup hierarchy like that needs to be addressed,
because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
the calling task with no permission required.  Bolting on a fix later
will be a mess.

--Andy

>
> As I read the code I see 3 security gaffaws in the permssion check.
> - Using current->cred instead of file->f_cred.
> - Not checking tcred->euid.
> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>
> The file permission on cgroup.procs seem just sufficient to keep
> to keep those bugs from being easily exploitable.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-19 18:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-19 18:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>
>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> >> setns on a cgroup namespace is allowed only if
>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>> >>   over the user-namespace associated with target cgroupns.
>>> >> * task's current cgroup is descendent of the target cgroupns-root
>>> >>   cgroup.
>>> >
>>> > What is the point of this?
>>> >
>>> > If I'm a user logged into
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>> > a container which is in
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>> > then I will want to be able to enter the container's cgroup.
>>> > The container's cgroup root is under my own (satisfying the
>>> > below condition0 but my cgroup is not a descendent of the
>>> > container's cgroup.
>>> >
>>> This condition is there because we don't want to do implicit cgroup
>>> changes when a process attaches to another cgroupns. cgroupns tries to
>>> preserve the invariant that at any point, your current cgroup is
>>> always under the cgroupns-root of your cgroup namespace. But in your
>>> example, if we allow a process in "session-c12.scope" container to
>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>> (without implicitly moving its cgroup), then this invariant won't
>>> hold.
>>
>> Oh, I see.  Guess that should be workable.  Thanks.
>
> Which has me looking at what the rules are for moving through
> the cgroup hierarchy.
>
> As long as we have write access to cgroup.procs and are allowed
> to open the file for write, we can move any of our own tasks
> into the cgroup.  So the cgroup namespace rules don't seem
> to be a problem.
>
> Andy can you please take a look at the permission checks in
> __cgroup_procs_write.

The actual requirements for calling that function haven't changed,
right?  IOW, what does this have to do with cgroupns?  Is the idea
that you want a privileged user wrt a cgroupns's userns to be able to
use this?  If so:

Yes, that current_cred() thing is bogus.  (Actually, this is probably
exploitable right now if any cgroup.procs inode anywhere on the system
lets non-root write.)  (Can we have some kernel debugging option that
makes any use of current_cred() in write(2) warn?)

We really need a weaker version of may_ptrace for this kind of stuff.
Maybe the existing may_ptrace stuff is okay, actually.  But this is
completely missing group checks, cap checks, capabilities wrt the
userns, etc.

Also, I think that, if this version of the patchset allows non-init
userns to unshare cgroupns, then the issue of what permission is
needed to lock the cgroup hierarchy like that needs to be addressed,
because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
the calling task with no permission required.  Bolting on a fix later
will be a mess.

--Andy

>
> As I read the code I see 3 security gaffaws in the permssion check.
> - Using current->cred instead of file->f_cred.
> - Not checking tcred->euid.
> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>
> The file permission on cgroup.procs seem just sufficient to keep
> to keep those bugs from being easily exploitable.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                       ` <CALCETrUC=yW72d2hDzjESmZAt85x1WcGz4L-DrtY5YXAQxbpMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-20  4:55                         ` Eric W.Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W.Biederman @ 2014-10-20  4:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA



On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>wrote:
>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >>   over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >>   cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see.  Guess that should be workable.  Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup.  So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right?  IOW, what does this have to do with cgroupns?

Excluding user namespaces the requirements have not changed.  

The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.

So I was examining what it would take to enter the cgroup of cgroupns.

> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this?  If so:
>
>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.)  (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required.  Bolting on a fix later
>will be a mess.

I imagine the pinning would be like the userns.

Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup. 

I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.

Sigh.

I am too tired tonight to see the end game in this.

Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                       ` <CALCETrUC=yW72d2hDzjESmZAt85x1WcGz4L-DrtY5YXAQxbpMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-10-20  4:55                         ` Eric W.Biederman
@ 2014-10-20  4:55                         ` Eric W.Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W.Biederman @ 2014-10-20  4:55 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar,
	Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar



On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@amacapital.net> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><ebiederm@xmission.com> wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>
>>> Quoting Aditya Kali (adityakali@google.com):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com>
>wrote:
>>>> > Quoting Aditya Kali (adityakali@google.com):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >>   over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >>   cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see.  Guess that should be workable.  Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup.  So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right?  IOW, what does this have to do with cgroupns?

Excluding user namespaces the requirements have not changed.  

The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.

So I was examining what it would take to enter the cgroup of cgroupns.

> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this?  If so:
>
>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.)  (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required.  Bolting on a fix later
>will be a mess.

I imagine the pinning would be like the userns.

Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup. 

I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.

Sigh.

I am too tired tonight to see the end game in this.

Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric


^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-20  4:55                         ` Eric W.Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W.Biederman @ 2014-10-20  4:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar



On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>wrote:
>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >>   over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >>   cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see.  Guess that should be workable.  Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup.  So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right?  IOW, what does this have to do with cgroupns?

Excluding user namespaces the requirements have not changed.  

The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.

So I was examining what it would take to enter the cgroup of cgroupns.

> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this?  If so:
>
>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.)  (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required.  Bolting on a fix later
>will be a mess.

I imagine the pinning would be like the userns.

Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup. 

I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.

Sigh.

I am too tired tonight to see the end game in this.

Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-20  4:55                         ` Eric W.Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W.Biederman @ 2014-10-20  4:55 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Serge E. Hallyn,
	Aditya Kali, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar



On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>wrote:
>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >>   over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >>   cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see.  Guess that should be workable.  Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup.  So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right?  IOW, what does this have to do with cgroupns?

Excluding user namespaces the requirements have not changed.  

The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.

So I was examining what it would take to enter the cgroup of cgroupns.

> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this?  If so:
>
>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.)  (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required.  Bolting on a fix later
>will be a mess.

I imagine the pinning would be like the userns.

Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup. 

I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.

Sigh.

I am too tired tonight to see the end game in this.

Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                         ` <44072106-c0f3-46b8-b2b5-9b1cbd1b7d88-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
@ 2014-10-21  0:20                           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  0:20 UTC (permalink / raw)
  To: Eric W.Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
>><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>>wrote:
>>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>>> >> setns on a cgroup namespace is allowed only if
>>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>>> >>   over the user-namespace associated with target cgroupns.
>>>>> >> * task's current cgroup is descendent of the target
>>cgroupns-root
>>>>> >>   cgroup.
>>>>> >
>>>>> > What is the point of this?
>>>>> >
>>>>> > If I'm a user logged into
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>>> > a container which is in
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>>> > then I will want to be able to enter the container's cgroup.
>>>>> > The container's cgroup root is under my own (satisfying the
>>>>> > below condition0 but my cgroup is not a descendent of the
>>>>> > container's cgroup.
>>>>> >
>>>>> This condition is there because we don't want to do implicit cgroup
>>>>> changes when a process attaches to another cgroupns. cgroupns tries
>>to
>>>>> preserve the invariant that at any point, your current cgroup is
>>>>> always under the cgroupns-root of your cgroup namespace. But in
>>your
>>>>> example, if we allow a process in "session-c12.scope" container to
>>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>>> (without implicitly moving its cgroup), then this invariant won't
>>>>> hold.
>>>>
>>>> Oh, I see.  Guess that should be workable.  Thanks.
>>>
>>> Which has me looking at what the rules are for moving through
>>> the cgroup hierarchy.
>>>
>>> As long as we have write access to cgroup.procs and are allowed
>>> to open the file for write, we can move any of our own tasks
>>> into the cgroup.  So the cgroup namespace rules don't seem
>>> to be a problem.
>>>
>>> Andy can you please take a look at the permission checks in
>>> __cgroup_procs_write.
>>
>>The actual requirements for calling that function haven't changed,
>>right?  IOW, what does this have to do with cgroupns?
>
> Excluding user namespaces the requirements have not changed.
>
> The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
>
> So I was examining what it would take to enter the cgroup of cgroupns.
>
>> Is the idea
>>that you want a privileged user wrt a cgroupns's userns to be able to
>>use this?  If so:
>>
>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>exploitable right now if any cgroup.procs inode anywhere on the system
>>lets non-root write.)  (Can we have some kernel debugging option that
>>makes any use of current_cred() in write(2) warn?)
>>
>>We really need a weaker version of may_ptrace for this kind of stuff.
>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>completely missing group checks, cap checks, capabilities wrt the
>>userns, etc.
>>
>>Also, I think that, if this version of the patchset allows non-init
>>userns to unshare cgroupns, then the issue of what permission is
>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>the calling task with no permission required.  Bolting on a fix later
>>will be a mess.
>
> I imagine the pinning would be like the userns.
>
> Ah but there is a potentially serious issue with the pinning.
> With pinning we can make it impossible for root to move us to a different cgroup.
>
> I am not certain how serious that is but it bears thinking about.
> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>
> Sigh.
>
> I am too tired tonight to see the end game in this.

Possible solution:

Ditch the pinning.  That is, if you're outside a cgroupns (or you have
a non-ns-confined cgroupfs mounted), then you can move a task in a
cgroupns outside of its root cgroup.  If you do this, then the task
thinks its cgroup is something like "../foo" or "../../foo".

While we're at it, consider making setns for a cgroupns *not* change
the caller's cgroup.  Is there any reason it really needs to?

Thoughts?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                         ` <44072106-c0f3-46b8-b2b5-9b1cbd1b7d88-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
@ 2014-10-21  0:20                           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  0:20 UTC (permalink / raw)
  To: Eric W.Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm@xmission.com> wrote:
>
>
> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@amacapital.net> wrote:
>>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
>><ebiederm@xmission.com> wrote:
>>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>>
>>>> Quoting Aditya Kali (adityakali@google.com):
>>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@hallyn.com>
>>wrote:
>>>>> > Quoting Aditya Kali (adityakali@google.com):
>>>>> >> setns on a cgroup namespace is allowed only if
>>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>>> >>   over the user-namespace associated with target cgroupns.
>>>>> >> * task's current cgroup is descendent of the target
>>cgroupns-root
>>>>> >>   cgroup.
>>>>> >
>>>>> > What is the point of this?
>>>>> >
>>>>> > If I'm a user logged into
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>>> > a container which is in
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>>> > then I will want to be able to enter the container's cgroup.
>>>>> > The container's cgroup root is under my own (satisfying the
>>>>> > below condition0 but my cgroup is not a descendent of the
>>>>> > container's cgroup.
>>>>> >
>>>>> This condition is there because we don't want to do implicit cgroup
>>>>> changes when a process attaches to another cgroupns. cgroupns tries
>>to
>>>>> preserve the invariant that at any point, your current cgroup is
>>>>> always under the cgroupns-root of your cgroup namespace. But in
>>your
>>>>> example, if we allow a process in "session-c12.scope" container to
>>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>>> (without implicitly moving its cgroup), then this invariant won't
>>>>> hold.
>>>>
>>>> Oh, I see.  Guess that should be workable.  Thanks.
>>>
>>> Which has me looking at what the rules are for moving through
>>> the cgroup hierarchy.
>>>
>>> As long as we have write access to cgroup.procs and are allowed
>>> to open the file for write, we can move any of our own tasks
>>> into the cgroup.  So the cgroup namespace rules don't seem
>>> to be a problem.
>>>
>>> Andy can you please take a look at the permission checks in
>>> __cgroup_procs_write.
>>
>>The actual requirements for calling that function haven't changed,
>>right?  IOW, what does this have to do with cgroupns?
>
> Excluding user namespaces the requirements have not changed.
>
> The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
>
> So I was examining what it would take to enter the cgroup of cgroupns.
>
>> Is the idea
>>that you want a privileged user wrt a cgroupns's userns to be able to
>>use this?  If so:
>>
>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>exploitable right now if any cgroup.procs inode anywhere on the system
>>lets non-root write.)  (Can we have some kernel debugging option that
>>makes any use of current_cred() in write(2) warn?)
>>
>>We really need a weaker version of may_ptrace for this kind of stuff.
>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>completely missing group checks, cap checks, capabilities wrt the
>>userns, etc.
>>
>>Also, I think that, if this version of the patchset allows non-init
>>userns to unshare cgroupns, then the issue of what permission is
>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>the calling task with no permission required.  Bolting on a fix later
>>will be a mess.
>
> I imagine the pinning would be like the userns.
>
> Ah but there is a potentially serious issue with the pinning.
> With pinning we can make it impossible for root to move us to a different cgroup.
>
> I am not certain how serious that is but it bears thinking about.
> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>
> Sigh.
>
> I am too tired tonight to see the end game in this.

Possible solution:

Ditch the pinning.  That is, if you're outside a cgroupns (or you have
a non-ns-confined cgroupfs mounted), then you can move a task in a
cgroupns outside of its root cgroup.  If you do this, then the task
thinks its cgroup is something like "../foo" or "../../foo".

While we're at it, consider making setns for a cgroupns *not* change
the caller's cgroup.  Is there any reason it really needs to?

Thoughts?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21  0:20                           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  0:20 UTC (permalink / raw)
  To: Eric W.Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
>><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>>wrote:
>>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>>> >> setns on a cgroup namespace is allowed only if
>>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>>> >>   over the user-namespace associated with target cgroupns.
>>>>> >> * task's current cgroup is descendent of the target
>>cgroupns-root
>>>>> >>   cgroup.
>>>>> >
>>>>> > What is the point of this?
>>>>> >
>>>>> > If I'm a user logged into
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>>> > a container which is in
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>>> > then I will want to be able to enter the container's cgroup.
>>>>> > The container's cgroup root is under my own (satisfying the
>>>>> > below condition0 but my cgroup is not a descendent of the
>>>>> > container's cgroup.
>>>>> >
>>>>> This condition is there because we don't want to do implicit cgroup
>>>>> changes when a process attaches to another cgroupns. cgroupns tries
>>to
>>>>> preserve the invariant that at any point, your current cgroup is
>>>>> always under the cgroupns-root of your cgroup namespace. But in
>>your
>>>>> example, if we allow a process in "session-c12.scope" container to
>>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>>> (without implicitly moving its cgroup), then this invariant won't
>>>>> hold.
>>>>
>>>> Oh, I see.  Guess that should be workable.  Thanks.
>>>
>>> Which has me looking at what the rules are for moving through
>>> the cgroup hierarchy.
>>>
>>> As long as we have write access to cgroup.procs and are allowed
>>> to open the file for write, we can move any of our own tasks
>>> into the cgroup.  So the cgroup namespace rules don't seem
>>> to be a problem.
>>>
>>> Andy can you please take a look at the permission checks in
>>> __cgroup_procs_write.
>>
>>The actual requirements for calling that function haven't changed,
>>right?  IOW, what does this have to do with cgroupns?
>
> Excluding user namespaces the requirements have not changed.
>
> The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
>
> So I was examining what it would take to enter the cgroup of cgroupns.
>
>> Is the idea
>>that you want a privileged user wrt a cgroupns's userns to be able to
>>use this?  If so:
>>
>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>exploitable right now if any cgroup.procs inode anywhere on the system
>>lets non-root write.)  (Can we have some kernel debugging option that
>>makes any use of current_cred() in write(2) warn?)
>>
>>We really need a weaker version of may_ptrace for this kind of stuff.
>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>completely missing group checks, cap checks, capabilities wrt the
>>userns, etc.
>>
>>Also, I think that, if this version of the patchset allows non-init
>>userns to unshare cgroupns, then the issue of what permission is
>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>the calling task with no permission required.  Bolting on a fix later
>>will be a mess.
>
> I imagine the pinning would be like the userns.
>
> Ah but there is a potentially serious issue with the pinning.
> With pinning we can make it impossible for root to move us to a different cgroup.
>
> I am not certain how serious that is but it bears thinking about.
> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>
> Sigh.
>
> I am too tired tonight to see the end game in this.

Possible solution:

Ditch the pinning.  That is, if you're outside a cgroupns (or you have
a non-ns-confined cgroupfs mounted), then you can move a task in a
cgroupns outside of its root cgroup.  If you do this, then the task
thinks its cgroup is something like "../foo" or "../../foo".

While we're at it, consider making setns for a cgroupns *not* change
the caller's cgroup.  Is there any reason it really needs to?

Thoughts?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                           ` <CALCETrXhGnBM_xx=Auz3WRQXkqhGGTWuZN=PU+A9HZ7Ek27FLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21  4:49                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  4:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:

>>> Is the idea
>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>use this?  If so:
>>>
>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>makes any use of current_cred() in write(2) warn?)
>>>
>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>completely missing group checks, cap checks, capabilities wrt the
>>>userns, etc.
>>>
>>>Also, I think that, if this version of the patchset allows non-init
>>>userns to unshare cgroupns, then the issue of what permission is
>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>the calling task with no permission required.  Bolting on a fix later
>>>will be a mess.
>>
>> I imagine the pinning would be like the userns.
>>
>> Ah but there is a potentially serious issue with the pinning.
>> With pinning we can make it impossible for root to move us to a different cgroup.
>>
>> I am not certain how serious that is but it bears thinking about.
>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>
>> Sigh.
>>
>> I am too tired tonight to see the end game in this.
>
> Possible solution:
>
> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
> a non-ns-confined cgroupfs mounted), then you can move a task in a
> cgroupns outside of its root cgroup.  If you do this, then the task
> thinks its cgroup is something like "../foo" or "../../foo".

Of the possible solutions that seems attractive to me, simply because
we sometimes want to allow clever things to occur.

Does anyone know of a reason (beyond pretty printing) why we need
cgroupns to restrict the subset of cgroups processes can be in?

I would expect permissions on the cgroup directories themselves, and
limited visiblilty would be (in general) to achieve the desired
visiblity.

> While we're at it, consider making setns for a cgroupns *not* change
> the caller's cgroup.  Is there any reason it really needs to?

setns doesn't but nsenter is going to need to change the cgroup 
if the pinning requirement is kept.  nsenenter is going to want to
change the cgroup if the pinning requirement is dropped.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                           ` <CALCETrXhGnBM_xx=Auz3WRQXkqhGGTWuZN=PU+A9HZ7Ek27FLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21  4:49                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  4:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

Andy Lutomirski <luto@amacapital.net> writes:

> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm@xmission.com> wrote:
>>
>>
>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@amacapital.net> wrote:

>>> Is the idea
>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>use this?  If so:
>>>
>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>makes any use of current_cred() in write(2) warn?)
>>>
>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>completely missing group checks, cap checks, capabilities wrt the
>>>userns, etc.
>>>
>>>Also, I think that, if this version of the patchset allows non-init
>>>userns to unshare cgroupns, then the issue of what permission is
>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>the calling task with no permission required.  Bolting on a fix later
>>>will be a mess.
>>
>> I imagine the pinning would be like the userns.
>>
>> Ah but there is a potentially serious issue with the pinning.
>> With pinning we can make it impossible for root to move us to a different cgroup.
>>
>> I am not certain how serious that is but it bears thinking about.
>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>
>> Sigh.
>>
>> I am too tired tonight to see the end game in this.
>
> Possible solution:
>
> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
> a non-ns-confined cgroupfs mounted), then you can move a task in a
> cgroupns outside of its root cgroup.  If you do this, then the task
> thinks its cgroup is something like "../foo" or "../../foo".

Of the possible solutions that seems attractive to me, simply because
we sometimes want to allow clever things to occur.

Does anyone know of a reason (beyond pretty printing) why we need
cgroupns to restrict the subset of cgroups processes can be in?

I would expect permissions on the cgroup directories themselves, and
limited visiblilty would be (in general) to achieve the desired
visiblity.

> While we're at it, consider making setns for a cgroupns *not* change
> the caller's cgroup.  Is there any reason it really needs to?

setns doesn't but nsenter is going to need to change the cgroup 
if the pinning requirement is kept.  nsenenter is going to want to
change the cgroup if the pinning requirement is dropped.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21  4:49                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  4:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:

>>> Is the idea
>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>use this?  If so:
>>>
>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>makes any use of current_cred() in write(2) warn?)
>>>
>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>completely missing group checks, cap checks, capabilities wrt the
>>>userns, etc.
>>>
>>>Also, I think that, if this version of the patchset allows non-init
>>>userns to unshare cgroupns, then the issue of what permission is
>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>the calling task with no permission required.  Bolting on a fix later
>>>will be a mess.
>>
>> I imagine the pinning would be like the userns.
>>
>> Ah but there is a potentially serious issue with the pinning.
>> With pinning we can make it impossible for root to move us to a different cgroup.
>>
>> I am not certain how serious that is but it bears thinking about.
>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>
>> Sigh.
>>
>> I am too tired tonight to see the end game in this.
>
> Possible solution:
>
> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
> a non-ns-confined cgroupfs mounted), then you can move a task in a
> cgroupns outside of its root cgroup.  If you do this, then the task
> thinks its cgroup is something like "../foo" or "../../foo".

Of the possible solutions that seems attractive to me, simply because
we sometimes want to allow clever things to occur.

Does anyone know of a reason (beyond pretty printing) why we need
cgroupns to restrict the subset of cgroups processes can be in?

I would expect permissions on the cgroup directories themselves, and
limited visiblilty would be (in general) to achieve the desired
visiblity.

> While we're at it, consider making setns for a cgroupns *not* change
> the caller's cgroup.  Is there any reason it really needs to?

setns doesn't but nsenter is going to need to change the cgroup 
if the pinning requirement is kept.  nsenenter is going to want to
change the cgroup if the pinning requirement is dropped.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-21  4:49                             ` Eric W. Biederman
@ 2014-10-21  5:03                                 ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  5:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>>
>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>
>>>> Is the idea
>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>use this?  If so:
>>>>
>>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>>makes any use of current_cred() in write(2) warn?)
>>>>
>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>userns, etc.
>>>>
>>>>Also, I think that, if this version of the patchset allows non-init
>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>the calling task with no permission required.  Bolting on a fix later
>>>>will be a mess.
>>>
>>> I imagine the pinning would be like the userns.
>>>
>>> Ah but there is a potentially serious issue with the pinning.
>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>
>>> I am not certain how serious that is but it bears thinking about.
>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>
>>> Sigh.
>>>
>>> I am too tired tonight to see the end game in this.
>>
>> Possible solution:
>>
>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>> cgroupns outside of its root cgroup.  If you do this, then the task
>> thinks its cgroup is something like "../foo" or "../../foo".
>
> Of the possible solutions that seems attractive to me, simply because
> we sometimes want to allow clever things to occur.
>
> Does anyone know of a reason (beyond pretty printing) why we need
> cgroupns to restrict the subset of cgroups processes can be in?
>
> I would expect permissions on the cgroup directories themselves, and
> limited visiblilty would be (in general) to achieve the desired
> visiblity.

This makes the security impact of cgroupns very easy to understand,
right?  Because there really won't be any -- cgroupns only affects
reads from /proc and what cgroupfs shows, but it doesn't change any
actual cgroups, nor does it affect any cgroup *changes*.

>
>> While we're at it, consider making setns for a cgroupns *not* change
>> the caller's cgroup.  Is there any reason it really needs to?
>
> setns doesn't but nsenter is going to need to change the cgroup
> if the pinning requirement is kept.  nsenenter is going to want to
> change the cgroup if the pinning requirement is dropped.
>

It seems easy enough for nsenter to change the cgroup all by itself.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21  5:03                                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  5:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm@xmission.com> wrote:
>>>
>>>
>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@amacapital.net> wrote:
>
>>>> Is the idea
>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>use this?  If so:
>>>>
>>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>>makes any use of current_cred() in write(2) warn?)
>>>>
>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>userns, etc.
>>>>
>>>>Also, I think that, if this version of the patchset allows non-init
>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>the calling task with no permission required.  Bolting on a fix later
>>>>will be a mess.
>>>
>>> I imagine the pinning would be like the userns.
>>>
>>> Ah but there is a potentially serious issue with the pinning.
>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>
>>> I am not certain how serious that is but it bears thinking about.
>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>
>>> Sigh.
>>>
>>> I am too tired tonight to see the end game in this.
>>
>> Possible solution:
>>
>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>> cgroupns outside of its root cgroup.  If you do this, then the task
>> thinks its cgroup is something like "../foo" or "../../foo".
>
> Of the possible solutions that seems attractive to me, simply because
> we sometimes want to allow clever things to occur.
>
> Does anyone know of a reason (beyond pretty printing) why we need
> cgroupns to restrict the subset of cgroups processes can be in?
>
> I would expect permissions on the cgroup directories themselves, and
> limited visiblilty would be (in general) to achieve the desired
> visiblity.

This makes the security impact of cgroupns very easy to understand,
right?  Because there really won't be any -- cgroupns only affects
reads from /proc and what cgroupfs shows, but it doesn't change any
actual cgroups, nor does it affect any cgroup *changes*.

>
>> While we're at it, consider making setns for a cgroupns *not* change
>> the caller's cgroup.  Is there any reason it really needs to?
>
> setns doesn't but nsenter is going to need to change the cgroup
> if the pinning requirement is kept.  nsenenter is going to want to
> change the cgroup if the pinning requirement is dropped.
>

It seems easy enough for nsenter to change the cgroup all by itself.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                 ` <CALCETrVkMtsnEh57jFZrdx5vHbz97BdO7OuupT+xVNnWpJjxng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21  5:42                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  5:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>>
>>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>
>>>>> Is the idea
>>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>>use this?  If so:
>>>>>
>>>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>>>makes any use of current_cred() in write(2) warn?)
>>>>>
>>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>>userns, etc.
>>>>>
>>>>>Also, I think that, if this version of the patchset allows non-init
>>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>>the calling task with no permission required.  Bolting on a fix later
>>>>>will be a mess.
>>>>
>>>> I imagine the pinning would be like the userns.
>>>>
>>>> Ah but there is a potentially serious issue with the pinning.
>>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>>
>>>> I am not certain how serious that is but it bears thinking about.
>>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>>
>>>> Sigh.
>>>>
>>>> I am too tired tonight to see the end game in this.
>>>
>>> Possible solution:
>>>
>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>> thinks its cgroup is something like "../foo" or "../../foo".
>>
>> Of the possible solutions that seems attractive to me, simply because
>> we sometimes want to allow clever things to occur.
>>
>> Does anyone know of a reason (beyond pretty printing) why we need
>> cgroupns to restrict the subset of cgroups processes can be in?
>>
>> I would expect permissions on the cgroup directories themselves, and
>> limited visiblilty would be (in general) to achieve the desired
>> visiblity.
>
> This makes the security impact of cgroupns very easy to understand,
> right?  Because there really won't be any -- cgroupns only affects
> reads from /proc and what cgroupfs shows, but it doesn't change any
> actual cgroups, nor does it affect any cgroup *changes*.

It seems like what we have described is chcgrouproot aka chroot for
cgroups.  At which point I think there are potentially similar security
issues as for chroot.  Can we confuse a setuid root process if we make
it's cgroup names look different.

Of course the confusing root concern is handled by the usual namespace
security checks that are already present.

I do wonder if we think of this as chcgrouproot if there is a simpler
implementation.

>>> While we're at it, consider making setns for a cgroupns *not* change
>>> the caller's cgroup.  Is there any reason it really needs to?
>>
>> setns doesn't but nsenter is going to need to change the cgroup
>> if the pinning requirement is kept.  nsenenter is going to want to
>> change the cgroup if the pinning requirement is dropped.
>>
>
> It seems easy enough for nsenter to change the cgroup all by itself.

Again.  I don't think anyone has suggested or implemented anything
different.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                 ` <CALCETrVkMtsnEh57jFZrdx5vHbz97BdO7OuupT+xVNnWpJjxng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21  5:42                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  5:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

Andy Lutomirski <luto@amacapital.net> writes:

> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>
>>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm@xmission.com> wrote:
>>>>
>>>>
>>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>>>> Is the idea
>>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>>use this?  If so:
>>>>>
>>>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>>>makes any use of current_cred() in write(2) warn?)
>>>>>
>>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>>userns, etc.
>>>>>
>>>>>Also, I think that, if this version of the patchset allows non-init
>>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>>the calling task with no permission required.  Bolting on a fix later
>>>>>will be a mess.
>>>>
>>>> I imagine the pinning would be like the userns.
>>>>
>>>> Ah but there is a potentially serious issue with the pinning.
>>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>>
>>>> I am not certain how serious that is but it bears thinking about.
>>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>>
>>>> Sigh.
>>>>
>>>> I am too tired tonight to see the end game in this.
>>>
>>> Possible solution:
>>>
>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>> thinks its cgroup is something like "../foo" or "../../foo".
>>
>> Of the possible solutions that seems attractive to me, simply because
>> we sometimes want to allow clever things to occur.
>>
>> Does anyone know of a reason (beyond pretty printing) why we need
>> cgroupns to restrict the subset of cgroups processes can be in?
>>
>> I would expect permissions on the cgroup directories themselves, and
>> limited visiblilty would be (in general) to achieve the desired
>> visiblity.
>
> This makes the security impact of cgroupns very easy to understand,
> right?  Because there really won't be any -- cgroupns only affects
> reads from /proc and what cgroupfs shows, but it doesn't change any
> actual cgroups, nor does it affect any cgroup *changes*.

It seems like what we have described is chcgrouproot aka chroot for
cgroups.  At which point I think there are potentially similar security
issues as for chroot.  Can we confuse a setuid root process if we make
it's cgroup names look different.

Of course the confusing root concern is handled by the usual namespace
security checks that are already present.

I do wonder if we think of this as chcgrouproot if there is a simpler
implementation.

>>> While we're at it, consider making setns for a cgroupns *not* change
>>> the caller's cgroup.  Is there any reason it really needs to?
>>
>> setns doesn't but nsenter is going to need to change the cgroup
>> if the pinning requirement is kept.  nsenenter is going to want to
>> change the cgroup if the pinning requirement is dropped.
>>
>
> It seems easy enough for nsenter to change the cgroup all by itself.

Again.  I don't think anyone has suggested or implemented anything
different.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21  5:42                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-10-21  5:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>>
>>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>
>>>>> Is the idea
>>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>>use this?  If so:
>>>>>
>>>>>Yes, that current_cred() thing is bogus.  (Actually, this is probably
>>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>>lets non-root write.)  (Can we have some kernel debugging option that
>>>>>makes any use of current_cred() in write(2) warn?)
>>>>>
>>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>>Maybe the existing may_ptrace stuff is okay, actually.  But this is
>>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>>userns, etc.
>>>>>
>>>>>Also, I think that, if this version of the patchset allows non-init
>>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>>the calling task with no permission required.  Bolting on a fix later
>>>>>will be a mess.
>>>>
>>>> I imagine the pinning would be like the userns.
>>>>
>>>> Ah but there is a potentially serious issue with the pinning.
>>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>>
>>>> I am not certain how serious that is but it bears thinking about.
>>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>>
>>>> Sigh.
>>>>
>>>> I am too tired tonight to see the end game in this.
>>>
>>> Possible solution:
>>>
>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>> thinks its cgroup is something like "../foo" or "../../foo".
>>
>> Of the possible solutions that seems attractive to me, simply because
>> we sometimes want to allow clever things to occur.
>>
>> Does anyone know of a reason (beyond pretty printing) why we need
>> cgroupns to restrict the subset of cgroups processes can be in?
>>
>> I would expect permissions on the cgroup directories themselves, and
>> limited visiblilty would be (in general) to achieve the desired
>> visiblity.
>
> This makes the security impact of cgroupns very easy to understand,
> right?  Because there really won't be any -- cgroupns only affects
> reads from /proc and what cgroupfs shows, but it doesn't change any
> actual cgroups, nor does it affect any cgroup *changes*.

It seems like what we have described is chcgrouproot aka chroot for
cgroups.  At which point I think there are potentially similar security
issues as for chroot.  Can we confuse a setuid root process if we make
it's cgroup names look different.

Of course the confusing root concern is handled by the usual namespace
security checks that are already present.

I do wonder if we think of this as chcgrouproot if there is a simpler
implementation.

>>> While we're at it, consider making setns for a cgroupns *not* change
>>> the caller's cgroup.  Is there any reason it really needs to?
>>
>> setns doesn't but nsenter is going to need to change the cgroup
>> if the pinning requirement is kept.  nsenenter is going to want to
>> change the cgroup if the pinning requirement is dropped.
>>
>
> It seems easy enough for nsenter to change the cgroup all by itself.

Again.  I don't think anyone has suggested or implemented anything
different.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-21  5:42                                   ` Eric W. Biederman
@ 2014-10-21  5:49                                       ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  5:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>> Possible solution:
>>>>
>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>
>>> Of the possible solutions that seems attractive to me, simply because
>>> we sometimes want to allow clever things to occur.
>>>
>>> Does anyone know of a reason (beyond pretty printing) why we need
>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>
>>> I would expect permissions on the cgroup directories themselves, and
>>> limited visiblilty would be (in general) to achieve the desired
>>> visiblity.
>>
>> This makes the security impact of cgroupns very easy to understand,
>> right?  Because there really won't be any -- cgroupns only affects
>> reads from /proc and what cgroupfs shows, but it doesn't change any
>> actual cgroups, nor does it affect any cgroup *changes*.
>
> It seems like what we have described is chcgrouproot aka chroot for
> cgroups.  At which point I think there are potentially similar security
> issues as for chroot.  Can we confuse a setuid root process if we make
> it's cgroup names look different.
>
> Of course the confusing root concern is handled by the usual namespace
> security checks that are already present.

I think that the chroot issues are mostly in two categories: setuid
confusion (not an issue here as you described) and chroot escapes.
cgroupns escapes aren't a big deal, I think -- admins should deny the
confined task the right to write to cgroupfs outside its hierarchy, by
setting cgroupfs permissions appropriately and/or avoiding mounting
cgroupfs outside the hierarchy.

>
> I do wonder if we think of this as chcgrouproot if there is a simpler
> implementation.

Could be.  I'll defer to Aditya for that one.

>
>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>
>>> setns doesn't but nsenter is going to need to change the cgroup
>>> if the pinning requirement is kept.  nsenenter is going to want to
>>> change the cgroup if the pinning requirement is dropped.
>>>
>>
>> It seems easy enough for nsenter to change the cgroup all by itself.
>
> Again.  I don't think anyone has suggested or implemented anything
> different.

The current patchset seems to punt on this decision by just failing
the setns call if the caller is outside the cgroup in question.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21  5:49                                       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21  5:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>> Possible solution:
>>>>
>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>
>>> Of the possible solutions that seems attractive to me, simply because
>>> we sometimes want to allow clever things to occur.
>>>
>>> Does anyone know of a reason (beyond pretty printing) why we need
>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>
>>> I would expect permissions on the cgroup directories themselves, and
>>> limited visiblilty would be (in general) to achieve the desired
>>> visiblity.
>>
>> This makes the security impact of cgroupns very easy to understand,
>> right?  Because there really won't be any -- cgroupns only affects
>> reads from /proc and what cgroupfs shows, but it doesn't change any
>> actual cgroups, nor does it affect any cgroup *changes*.
>
> It seems like what we have described is chcgrouproot aka chroot for
> cgroups.  At which point I think there are potentially similar security
> issues as for chroot.  Can we confuse a setuid root process if we make
> it's cgroup names look different.
>
> Of course the confusing root concern is handled by the usual namespace
> security checks that are already present.

I think that the chroot issues are mostly in two categories: setuid
confusion (not an issue here as you described) and chroot escapes.
cgroupns escapes aren't a big deal, I think -- admins should deny the
confined task the right to write to cgroupfs outside its hierarchy, by
setting cgroupfs permissions appropriately and/or avoiding mounting
cgroupfs outside the hierarchy.

>
> I do wonder if we think of this as chcgrouproot if there is a simpler
> implementation.

Could be.  I'll defer to Aditya for that one.

>
>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>
>>> setns doesn't but nsenter is going to need to change the cgroup
>>> if the pinning requirement is kept.  nsenenter is going to want to
>>> change the cgroup if the pinning requirement is dropped.
>>>
>>
>> It seems easy enough for nsenter to change the cgroup all by itself.
>
> Again.  I don't think anyone has suggested or implemented anything
> different.

The current patchset seems to punt on this decision by just failing
the setns call if the caller is outside the cgroup in question.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                       ` <CALCETrVFKvtHpTfY3kuE5ZTrwQAzuDmk6dm-mbQffDHAZmq-KQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21 18:49                                         ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-21 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>>> Possible solution:
>>>>>
>>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>>
>>>> Of the possible solutions that seems attractive to me, simply because
>>>> we sometimes want to allow clever things to occur.
>>>>
>>>> Does anyone know of a reason (beyond pretty printing) why we need
>>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>>
>>>> I would expect permissions on the cgroup directories themselves, and
>>>> limited visiblilty would be (in general) to achieve the desired
>>>> visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right?  Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups.  At which point I think there are potentially similar security
>> issues as for chroot.  Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be.  I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
>>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>>
>>>> setns doesn't but nsenter is going to need to change the cgroup
>>>> if the pinning requirement is kept.  nsenenter is going to want to
>>>> change the cgroup if the pinning requirement is dropped.
>>>>
>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again.  I don't think anyone has suggested or implemented anything
>> different.
>
> The current patchset seems to punt on this decision by just failing
> the setns call if the caller is outside the cgroup in question.
>
> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                       ` <CALCETrVFKvtHpTfY3kuE5ZTrwQAzuDmk6dm-mbQffDHAZmq-KQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21 18:49                                         ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-21 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>> Possible solution:
>>>>>
>>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>>
>>>> Of the possible solutions that seems attractive to me, simply because
>>>> we sometimes want to allow clever things to occur.
>>>>
>>>> Does anyone know of a reason (beyond pretty printing) why we need
>>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>>
>>>> I would expect permissions on the cgroup directories themselves, and
>>>> limited visiblilty would be (in general) to achieve the desired
>>>> visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right?  Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups.  At which point I think there are potentially similar security
>> issues as for chroot.  Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be.  I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
>>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>>
>>>> setns doesn't but nsenter is going to need to change the cgroup
>>>> if the pinning requirement is kept.  nsenenter is going to want to
>>>> change the cgroup if the pinning requirement is dropped.
>>>>
>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again.  I don't think anyone has suggested or implemented anything
>> different.
>
> The current patchset seems to punt on this decision by just failing
> the setns call if the caller is outside the cgroup in question.
>
> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21 18:49                                         ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-21 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>>> Possible solution:
>>>>>
>>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>>
>>>> Of the possible solutions that seems attractive to me, simply because
>>>> we sometimes want to allow clever things to occur.
>>>>
>>>> Does anyone know of a reason (beyond pretty printing) why we need
>>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>>
>>>> I would expect permissions on the cgroup directories themselves, and
>>>> limited visiblilty would be (in general) to achieve the desired
>>>> visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right?  Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups.  At which point I think there are potentially similar security
>> issues as for chroot.  Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be.  I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
>>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>>
>>>> setns doesn't but nsenter is going to need to change the cgroup
>>>> if the pinning requirement is kept.  nsenenter is going to want to
>>>> change the cgroup if the pinning requirement is dropped.
>>>>
>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again.  I don't think anyone has suggested or implemented anything
>> different.
>
> The current patchset seems to punt on this decision by just failing
> the setns call if the caller is outside the cgroup in question.
>
> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-21 18:49                                         ` Aditya Kali
@ 2014-10-21 19:02                                             ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21 19:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>> implementation.
>>
>> Could be.  I'll defer to Aditya for that one.
>>
>
> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
> addition to restricting the process to a cgroup-root, new processes
> entering the container should also be implicitly contained within the
> cgroup-root of that container.

Why?  Concretely, why should this be in the kernel namespace code
instead of in userspace?

> Implementing pivot_cgroup_root would
> probably involve overloading mount-namespace to now understand cgroup
> filesystem too. I did attempt combining cgroupns-root with mntns
> earlier (not via a new syscall though), but came to the conclusion
> that its just simpler to have a separate cgroup namespace and get
> clear semantics. One of the issues was that implicitly changing cgroup
> on setns to mntns seemed like a huge undesirable side-effect.
>
> About pinning: I really feel that it should be OK to pin processes
> within cgroupns-root. I think thats one of the most important feature
> of cgroup-namespace since its most common usecase is to containerize
> un-trusted processes - processes that, for their entire lifetime, need
> to remain inside their container.

So don't let them out.  None of the other namespaces have this kind of
constraint:

 - If you're in a mntns, you can still use fds from outside.
 - If you're in a netns, you can still use sockets from outside the namespace.
 - If you're in an ipcns, you can still use ipc handles from outside.

etc.

> And with explicit permission from
> cgroup subsystem (something like cgroup.may_unshare as you had
> suggested previously), we can make sure that unprivileged processes
> cannot pin themselves. Also, maintaining this invariant (your current
> cgroup is always under your cgroupns-root) keeps the code and the
> semantics simple.

I actually think it makes the semantics more complex.  The less policy
you stick in the kernel, the easier it is to understand the impact of
that policy.

>
> If we ditch the pinning requirement and allow the containarized
> process to move outside of its cgroupns-root, we will have to address
> atleast the following:
> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
> look like? We might need to just not show anything in
> /proc/<pid>/cgroup in such case (for default hierarchy).

The process should see the cgroup path relative to its cgroup ns.
Whether this requires a new /proc mount or happens automatically is an
open question.  (I *hate* procfs for reasons like this.)

> * how should future setns() and unshare() by such process behave?

Open question.

> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result

You could disallow that and instead require 'mount -t cgroup -o
cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
relative to the caller's cgroupns.

> * container will not remain migratable

Why not?

> * added code complexity to handle above scenarios
>
> I understand that having process pinned to a cgroup hierarchy might
> seem inconvenient. But even today (without cgroup namespaces), moving
> a task from one cgroup to another can fail for reasons outside of
> control of the task attempting the move (even if its privileged). So
> the userspace should already handle this scenario. I feel its not
> worth to add complexity in the kernel for this.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21 19:02                                             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21 19:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>> implementation.
>>
>> Could be.  I'll defer to Aditya for that one.
>>
>
> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
> addition to restricting the process to a cgroup-root, new processes
> entering the container should also be implicitly contained within the
> cgroup-root of that container.

Why?  Concretely, why should this be in the kernel namespace code
instead of in userspace?

> Implementing pivot_cgroup_root would
> probably involve overloading mount-namespace to now understand cgroup
> filesystem too. I did attempt combining cgroupns-root with mntns
> earlier (not via a new syscall though), but came to the conclusion
> that its just simpler to have a separate cgroup namespace and get
> clear semantics. One of the issues was that implicitly changing cgroup
> on setns to mntns seemed like a huge undesirable side-effect.
>
> About pinning: I really feel that it should be OK to pin processes
> within cgroupns-root. I think thats one of the most important feature
> of cgroup-namespace since its most common usecase is to containerize
> un-trusted processes - processes that, for their entire lifetime, need
> to remain inside their container.

So don't let them out.  None of the other namespaces have this kind of
constraint:

 - If you're in a mntns, you can still use fds from outside.
 - If you're in a netns, you can still use sockets from outside the namespace.
 - If you're in an ipcns, you can still use ipc handles from outside.

etc.

> And with explicit permission from
> cgroup subsystem (something like cgroup.may_unshare as you had
> suggested previously), we can make sure that unprivileged processes
> cannot pin themselves. Also, maintaining this invariant (your current
> cgroup is always under your cgroupns-root) keeps the code and the
> semantics simple.

I actually think it makes the semantics more complex.  The less policy
you stick in the kernel, the easier it is to understand the impact of
that policy.

>
> If we ditch the pinning requirement and allow the containarized
> process to move outside of its cgroupns-root, we will have to address
> atleast the following:
> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
> look like? We might need to just not show anything in
> /proc/<pid>/cgroup in such case (for default hierarchy).

The process should see the cgroup path relative to its cgroup ns.
Whether this requires a new /proc mount or happens automatically is an
open question.  (I *hate* procfs for reasons like this.)

> * how should future setns() and unshare() by such process behave?

Open question.

> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result

You could disallow that and instead require 'mount -t cgroup -o
cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
relative to the caller's cgroupns.

> * container will not remain migratable

Why not?

> * added code complexity to handle above scenarios
>
> I understand that having process pinned to a cgroup hierarchy might
> seem inconvenient. But even today (without cgroup namespaces), moving
> a task from one cgroup to another can fail for reasons outside of
> control of the task attempting the move (even if its privileged). So
> the userspace should already handle this scenario. I feel its not
> worth to add complexity in the kernel for this.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-21 19:02                                             ` Andy Lutomirski
@ 2014-10-21 22:33                                                 ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-21 22:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>> implementation.
>>>
>>> Could be.  I'll defer to Aditya for that one.
>>>
>>
>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>> addition to restricting the process to a cgroup-root, new processes
>> entering the container should also be implicitly contained within the
>> cgroup-root of that container.
>
> Why?  Concretely, why should this be in the kernel namespace code
> instead of in userspace?
>

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

>> Implementing pivot_cgroup_root would
>> probably involve overloading mount-namespace to now understand cgroup
>> filesystem too. I did attempt combining cgroupns-root with mntns
>> earlier (not via a new syscall though), but came to the conclusion
>> that its just simpler to have a separate cgroup namespace and get
>> clear semantics. One of the issues was that implicitly changing cgroup
>> on setns to mntns seemed like a huge undesirable side-effect.
>>
>> About pinning: I really feel that it should be OK to pin processes
>> within cgroupns-root. I think thats one of the most important feature
>> of cgroup-namespace since its most common usecase is to containerize
>> un-trusted processes - processes that, for their entire lifetime, need
>> to remain inside their container.
>
> So don't let them out.  None of the other namespaces have this kind of
> constraint:
>
>  - If you're in a mntns, you can still use fds from outside.
>  - If you're in a netns, you can still use sockets from outside the namespace.
>  - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

>
> etc.

>
>> And with explicit permission from
>> cgroup subsystem (something like cgroup.may_unshare as you had
>> suggested previously), we can make sure that unprivileged processes
>> cannot pin themselves. Also, maintaining this invariant (your current
>> cgroup is always under your cgroupns-root) keeps the code and the
>> semantics simple.
>
> I actually think it makes the semantics more complex.  The less policy
> you stick in the kernel, the easier it is to understand the impact of
> that policy.
>

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

>>
>> If we ditch the pinning requirement and allow the containarized
>> process to move outside of its cgroupns-root, we will have to address
>> atleast the following:
>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>> look like? We might need to just not show anything in
>> /proc/<pid>/cgroup in such case (for default hierarchy).
>
> The process should see the cgroup path relative to its cgroup ns.
> Whether this requires a new /proc mount or happens automatically is an
> open question.  (I *hate* procfs for reasons like this.)
>
>> * how should future setns() and unshare() by such process behave?
>
> Open question.
>
>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>
> You could disallow that and instead require 'mount -t cgroup -o
> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
> relative to the caller's cgroupns.
>
>> * container will not remain migratable
>
> Why not?
>

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

>> * added code complexity to handle above scenarios
>>
>> I understand that having process pinned to a cgroup hierarchy might
>> seem inconvenient. But even today (without cgroup namespaces), moving
>> a task from one cgroup to another can fail for reasons outside of
>> control of the task attempting the move (even if its privileged). So
>> the userspace should already handle this scenario. I feel its not
>> worth to add complexity in the kernel for this.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21 22:33                                                 ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-21 22:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>>
>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>> implementation.
>>>
>>> Could be.  I'll defer to Aditya for that one.
>>>
>>
>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>> addition to restricting the process to a cgroup-root, new processes
>> entering the container should also be implicitly contained within the
>> cgroup-root of that container.
>
> Why?  Concretely, why should this be in the kernel namespace code
> instead of in userspace?
>

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

>> Implementing pivot_cgroup_root would
>> probably involve overloading mount-namespace to now understand cgroup
>> filesystem too. I did attempt combining cgroupns-root with mntns
>> earlier (not via a new syscall though), but came to the conclusion
>> that its just simpler to have a separate cgroup namespace and get
>> clear semantics. One of the issues was that implicitly changing cgroup
>> on setns to mntns seemed like a huge undesirable side-effect.
>>
>> About pinning: I really feel that it should be OK to pin processes
>> within cgroupns-root. I think thats one of the most important feature
>> of cgroup-namespace since its most common usecase is to containerize
>> un-trusted processes - processes that, for their entire lifetime, need
>> to remain inside their container.
>
> So don't let them out.  None of the other namespaces have this kind of
> constraint:
>
>  - If you're in a mntns, you can still use fds from outside.
>  - If you're in a netns, you can still use sockets from outside the namespace.
>  - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

>
> etc.

>
>> And with explicit permission from
>> cgroup subsystem (something like cgroup.may_unshare as you had
>> suggested previously), we can make sure that unprivileged processes
>> cannot pin themselves. Also, maintaining this invariant (your current
>> cgroup is always under your cgroupns-root) keeps the code and the
>> semantics simple.
>
> I actually think it makes the semantics more complex.  The less policy
> you stick in the kernel, the easier it is to understand the impact of
> that policy.
>

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

>>
>> If we ditch the pinning requirement and allow the containarized
>> process to move outside of its cgroupns-root, we will have to address
>> atleast the following:
>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>> look like? We might need to just not show anything in
>> /proc/<pid>/cgroup in such case (for default hierarchy).
>
> The process should see the cgroup path relative to its cgroup ns.
> Whether this requires a new /proc mount or happens automatically is an
> open question.  (I *hate* procfs for reasons like this.)
>
>> * how should future setns() and unshare() by such process behave?
>
> Open question.
>
>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>
> You could disallow that and instead require 'mount -t cgroup -o
> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
> relative to the caller's cgroupns.
>
>> * container will not remain migratable
>
> Why not?
>

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

>> * added code complexity to handle above scenarios
>>
>> I understand that having process pinned to a cgroup hierarchy might
>> seem inconvenient. But even today (without cgroup namespaces), moving
>> a task from one cgroup to another can fail for reasons outside of
>> control of the task attempting the move (even if its privileged). So
>> the userspace should already handle this scenario. I feel its not
>> worth to add complexity in the kernel for this.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                                 ` <CAGr1F2FdQ4VF1_o7mdybZ-WhLLhFxdgkNnzotHOwnhLU8W+YCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21 22:42                                                   ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>
>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>> implementation.
>>>>
>>>> Could be.  I'll defer to Aditya for that one.
>>>>
>>>
>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>> addition to restricting the process to a cgroup-root, new processes
>>> entering the container should also be implicitly contained within the
>>> cgroup-root of that container.
>>
>> Why?  Concretely, why should this be in the kernel namespace code
>> instead of in userspace?
>>
>
> Userspace can do it too. Though then there will be possibility of
> having processes in the same mount namespace with different
> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
> more complex. Thats another reason why it might not be good idea to
> tie cgroups with mount namespace.
>
>>> Implementing pivot_cgroup_root would
>>> probably involve overloading mount-namespace to now understand cgroup
>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>> earlier (not via a new syscall though), but came to the conclusion
>>> that its just simpler to have a separate cgroup namespace and get
>>> clear semantics. One of the issues was that implicitly changing cgroup
>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>
>>> About pinning: I really feel that it should be OK to pin processes
>>> within cgroupns-root. I think thats one of the most important feature
>>> of cgroup-namespace since its most common usecase is to containerize
>>> un-trusted processes - processes that, for their entire lifetime, need
>>> to remain inside their container.
>>
>> So don't let them out.  None of the other namespaces have this kind of
>> constraint:
>>
>>  - If you're in a mntns, you can still use fds from outside.
>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>  - If you're in an ipcns, you can still use ipc handles from outside.
>
> But none of the namespaces allow you to allocate new fds/sockets/ipc
> handles in the outside namespace. I think moving a process outside of
> cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
like that?  You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.


>>
>>> And with explicit permission from
>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>> suggested previously), we can make sure that unprivileged processes
>>> cannot pin themselves. Also, maintaining this invariant (your current
>>> cgroup is always under your cgroupns-root) keeps the code and the
>>> semantics simple.
>>
>> I actually think it makes the semantics more complex.  The less policy
>> you stick in the kernel, the easier it is to understand the impact of
>> that policy.
>>
>
> My inclination is towards keeping things simpler - both in code as
> well as in configuration. I agree that cgroupns might seem
> "less-flexible", but in its current form, it encourages consistent
> container configuration. If you have a process that needs to move
> around between cgroups belonging to different containers, then that
> process should probably not be inside any container's cgroup
> namespace. Allowing that will just make the cgroup namespace
> pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

>
>>>
>>> If we ditch the pinning requirement and allow the containarized
>>> process to move outside of its cgroupns-root, we will have to address
>>> atleast the following:
>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>> look like? We might need to just not show anything in
>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>
>> The process should see the cgroup path relative to its cgroup ns.
>> Whether this requires a new /proc mount or happens automatically is an
>> open question.  (I *hate* procfs for reasons like this.)
>>
>>> * how should future setns() and unshare() by such process behave?
>>
>> Open question.
>>
>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>
>> You could disallow that and instead require 'mount -t cgroup -o
>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>> relative to the caller's cgroupns.
>>
>>> * container will not remain migratable
>>
>> Why not?
>>
>
> Well, the processes running outside of cgroupns root will be exposed
> to information outside of the container (i.e., its /proc/self/cgroup
> will show paths involving other containers and potentially system
> level information). So unless you even restore them, it will be
> difficult to restore these processes. The whole point of virtualizing
> the /proc/self/cgroup view was so that the processes don't see outside
> cgroups.
>

So don't do that?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
       [not found]                                                 ` <CAGr1F2FdQ4VF1_o7mdybZ-WhLLhFxdgkNnzotHOwnhLU8W+YCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-21 22:42                                                   ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@google.com> wrote:
> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>>
>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>> implementation.
>>>>
>>>> Could be.  I'll defer to Aditya for that one.
>>>>
>>>
>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>> addition to restricting the process to a cgroup-root, new processes
>>> entering the container should also be implicitly contained within the
>>> cgroup-root of that container.
>>
>> Why?  Concretely, why should this be in the kernel namespace code
>> instead of in userspace?
>>
>
> Userspace can do it too. Though then there will be possibility of
> having processes in the same mount namespace with different
> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
> more complex. Thats another reason why it might not be good idea to
> tie cgroups with mount namespace.
>
>>> Implementing pivot_cgroup_root would
>>> probably involve overloading mount-namespace to now understand cgroup
>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>> earlier (not via a new syscall though), but came to the conclusion
>>> that its just simpler to have a separate cgroup namespace and get
>>> clear semantics. One of the issues was that implicitly changing cgroup
>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>
>>> About pinning: I really feel that it should be OK to pin processes
>>> within cgroupns-root. I think thats one of the most important feature
>>> of cgroup-namespace since its most common usecase is to containerize
>>> un-trusted processes - processes that, for their entire lifetime, need
>>> to remain inside their container.
>>
>> So don't let them out.  None of the other namespaces have this kind of
>> constraint:
>>
>>  - If you're in a mntns, you can still use fds from outside.
>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>  - If you're in an ipcns, you can still use ipc handles from outside.
>
> But none of the namespaces allow you to allocate new fds/sockets/ipc
> handles in the outside namespace. I think moving a process outside of
> cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
like that?  You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.


>>
>>> And with explicit permission from
>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>> suggested previously), we can make sure that unprivileged processes
>>> cannot pin themselves. Also, maintaining this invariant (your current
>>> cgroup is always under your cgroupns-root) keeps the code and the
>>> semantics simple.
>>
>> I actually think it makes the semantics more complex.  The less policy
>> you stick in the kernel, the easier it is to understand the impact of
>> that policy.
>>
>
> My inclination is towards keeping things simpler - both in code as
> well as in configuration. I agree that cgroupns might seem
> "less-flexible", but in its current form, it encourages consistent
> container configuration. If you have a process that needs to move
> around between cgroups belonging to different containers, then that
> process should probably not be inside any container's cgroup
> namespace. Allowing that will just make the cgroup namespace
> pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

>
>>>
>>> If we ditch the pinning requirement and allow the containarized
>>> process to move outside of its cgroupns-root, we will have to address
>>> atleast the following:
>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>> look like? We might need to just not show anything in
>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>
>> The process should see the cgroup path relative to its cgroup ns.
>> Whether this requires a new /proc mount or happens automatically is an
>> open question.  (I *hate* procfs for reasons like this.)
>>
>>> * how should future setns() and unshare() by such process behave?
>>
>> Open question.
>>
>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>
>> You could disallow that and instead require 'mount -t cgroup -o
>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>> relative to the caller's cgroupns.
>>
>>> * container will not remain migratable
>>
>> Why not?
>>
>
> Well, the processes running outside of cgroupns root will be exposed
> to information outside of the container (i.e., its /proc/self/cgroup
> will show paths involving other containers and potentially system
> level information). So unless you even restore them, it will be
> difficult to restore these processes. The whole point of virtualizing
> the /proc/self/cgroup view was so that the processes don't see outside
> cgroups.
>

So don't do that?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-21 22:42                                                   ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-21 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>
>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>> implementation.
>>>>
>>>> Could be.  I'll defer to Aditya for that one.
>>>>
>>>
>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>> addition to restricting the process to a cgroup-root, new processes
>>> entering the container should also be implicitly contained within the
>>> cgroup-root of that container.
>>
>> Why?  Concretely, why should this be in the kernel namespace code
>> instead of in userspace?
>>
>
> Userspace can do it too. Though then there will be possibility of
> having processes in the same mount namespace with different
> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
> more complex. Thats another reason why it might not be good idea to
> tie cgroups with mount namespace.
>
>>> Implementing pivot_cgroup_root would
>>> probably involve overloading mount-namespace to now understand cgroup
>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>> earlier (not via a new syscall though), but came to the conclusion
>>> that its just simpler to have a separate cgroup namespace and get
>>> clear semantics. One of the issues was that implicitly changing cgroup
>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>
>>> About pinning: I really feel that it should be OK to pin processes
>>> within cgroupns-root. I think thats one of the most important feature
>>> of cgroup-namespace since its most common usecase is to containerize
>>> un-trusted processes - processes that, for their entire lifetime, need
>>> to remain inside their container.
>>
>> So don't let them out.  None of the other namespaces have this kind of
>> constraint:
>>
>>  - If you're in a mntns, you can still use fds from outside.
>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>  - If you're in an ipcns, you can still use ipc handles from outside.
>
> But none of the namespaces allow you to allocate new fds/sockets/ipc
> handles in the outside namespace. I think moving a process outside of
> cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
like that?  You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.


>>
>>> And with explicit permission from
>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>> suggested previously), we can make sure that unprivileged processes
>>> cannot pin themselves. Also, maintaining this invariant (your current
>>> cgroup is always under your cgroupns-root) keeps the code and the
>>> semantics simple.
>>
>> I actually think it makes the semantics more complex.  The less policy
>> you stick in the kernel, the easier it is to understand the impact of
>> that policy.
>>
>
> My inclination is towards keeping things simpler - both in code as
> well as in configuration. I agree that cgroupns might seem
> "less-flexible", but in its current form, it encourages consistent
> container configuration. If you have a process that needs to move
> around between cgroups belonging to different containers, then that
> process should probably not be inside any container's cgroup
> namespace. Allowing that will just make the cgroup namespace
> pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

>
>>>
>>> If we ditch the pinning requirement and allow the containarized
>>> process to move outside of its cgroupns-root, we will have to address
>>> atleast the following:
>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>> look like? We might need to just not show anything in
>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>
>> The process should see the cgroup path relative to its cgroup ns.
>> Whether this requires a new /proc mount or happens automatically is an
>> open question.  (I *hate* procfs for reasons like this.)
>>
>>> * how should future setns() and unshare() by such process behave?
>>
>> Open question.
>>
>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>
>> You could disallow that and instead require 'mount -t cgroup -o
>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>> relative to the caller's cgroupns.
>>
>>> * container will not remain migratable
>>
>> Why not?
>>
>
> Well, the processes running outside of cgroupns root will be exposed
> to information outside of the container (i.e., its /proc/self/cgroup
> will show paths involving other containers and potentially system
> level information). So unless you even restore them, it will be
> difficult to restore these processes. The whole point of virtualizing
> the /proc/self/cgroup view was so that the processes don't see outside
> cgroups.
>

So don't do that?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-21 22:42                                                   ` Andy Lutomirski
@ 2014-10-22  0:46                                                       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22  0:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>
>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>> implementation.
>>>>>
>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>
>>>>
>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>> addition to restricting the process to a cgroup-root, new processes
>>>> entering the container should also be implicitly contained within the
>>>> cgroup-root of that container.
>>>
>>> Why?  Concretely, why should this be in the kernel namespace code
>>> instead of in userspace?
>>>
>>
>> Userspace can do it too. Though then there will be possibility of
>> having processes in the same mount namespace with different
>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>> more complex. Thats another reason why it might not be good idea to
>> tie cgroups with mount namespace.
>>
>>>> Implementing pivot_cgroup_root would
>>>> probably involve overloading mount-namespace to now understand cgroup
>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>> earlier (not via a new syscall though), but came to the conclusion
>>>> that its just simpler to have a separate cgroup namespace and get
>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>
>>>> About pinning: I really feel that it should be OK to pin processes
>>>> within cgroupns-root. I think thats one of the most important feature
>>>> of cgroup-namespace since its most common usecase is to containerize
>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>> to remain inside their container.
>>>
>>> So don't let them out.  None of the other namespaces have this kind of
>>> constraint:
>>>
>>>  - If you're in a mntns, you can still use fds from outside.
>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>
>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>> handles in the outside namespace. I think moving a process outside of
>> cgroupns-root is like allocating a resource outside of your namespace.
>
> In a pidns, you can see outside tasks if you have an outside procfs
> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
> like that?  You wouldn't be able to escape your cgroup as long as you
> don't have an inappropriate cgroupfs mounted.
>

I am not if we should only depend on restricted visibility for this
though. More details below.

>
>>>
>>>> And with explicit permission from
>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>> suggested previously), we can make sure that unprivileged processes
>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>> semantics simple.
>>>
>>> I actually think it makes the semantics more complex.  The less policy
>>> you stick in the kernel, the easier it is to understand the impact of
>>> that policy.
>>>
>>
>> My inclination is towards keeping things simpler - both in code as
>> well as in configuration. I agree that cgroupns might seem
>> "less-flexible", but in its current form, it encourages consistent
>> container configuration. If you have a process that needs to move
>> around between cgroups belonging to different containers, then that
>> process should probably not be inside any container's cgroup
>> namespace. Allowing that will just make the cgroup namespace
>> pretty-much meaningless.
>
> The problem with pinning is that preventing it causes problems
> (specifically, either something potentially complex and incompatible
> needs to be added or unprivileged processes will be able to pin
> themselves).
>
> Unless I'm missing something, a normal cgroupns user doesn't actually
> need kernel pinning support to effectively constrain its members'
> cgroups.
>

So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process doing the write is an admin process (with CAP_SYS_ADMIN in its
userns AND over P's cgroupns->user_ns).
If this write succeeds, then:
(a) process P's /proc/<pid>/cgroup does not show anything when viewed
by 'self' or any other process in P's cgrgroupns. I would really like
to avoid showing relative paths or paths outside the cgroupns-root
(b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
only able to mount and see cgroup hierarchy under its cgroupns-root
(d) if process P tries to write to any cgroup file outside of its
cgroupns-root (assuming that hierarchy is visible to it for whatever
reason), it will fail as in (1)

i.e., in summary, you can't escape out of cgroupns-root yourself. You
will need help from an admin process running under some parent
cgroupns-root to move you out. Is that workable for your usecase? Most
of the things above already happen with the current patch-set, so it
should be easy to enable this.

Though there are still some open issues like:
* what happens if you move all the processes out of /container1 and
then 'rmdir /container1'? As it is now, you won't be able to setns()
to that cgroupns anymore. But the cgroupns will still hang around
until the processes switch their cgroupns.
* should we then also allow setns() without first entering the
cgroupns-root? setns also checks the same conditions as in (a) plus it
checks that your current cgroup is descendant of target cgroupns-root.
Alternatively we can special-case setns() to own cgroupns so that it
doesn't fail.
* migration for these processes will be tricky, if not impossible. But
the admin trying to do this probably doesn't care about it or will
provision for it.

>>
>>>>
>>>> If we ditch the pinning requirement and allow the containarized
>>>> process to move outside of its cgroupns-root, we will have to address
>>>> atleast the following:
>>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>>> look like? We might need to just not show anything in
>>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>>
>>> The process should see the cgroup path relative to its cgroup ns.
>>> Whether this requires a new /proc mount or happens automatically is an
>>> open question.  (I *hate* procfs for reasons like this.)
>>>
>>>> * how should future setns() and unshare() by such process behave?
>>>
>>> Open question.
>>>
>>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>>
>>> You could disallow that and instead require 'mount -t cgroup -o
>>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>>> relative to the caller's cgroupns.
>>>
>>>> * container will not remain migratable
>>>
>>> Why not?
>>>
>>
>> Well, the processes running outside of cgroupns root will be exposed
>> to information outside of the container (i.e., its /proc/self/cgroup
>> will show paths involving other containers and potentially system
>> level information). So unless you even restore them, it will be
>> difficult to restore these processes. The whole point of virtualizing
>> the /proc/self/cgroup view was so that the processes don't see outside
>> cgroups.
>>
>
> So don't do that?
>

Lot of non-cgroup-manager userspace processes have legitimate reasons
to read /proc/self/cgroup. One example is to register for OOM
notifications. Migratability of the container is also very important.
So "don't do that" is not always an option :)


> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-22  0:46                                                       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22  0:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>>>
>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>> implementation.
>>>>>
>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>
>>>>
>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>> addition to restricting the process to a cgroup-root, new processes
>>>> entering the container should also be implicitly contained within the
>>>> cgroup-root of that container.
>>>
>>> Why?  Concretely, why should this be in the kernel namespace code
>>> instead of in userspace?
>>>
>>
>> Userspace can do it too. Though then there will be possibility of
>> having processes in the same mount namespace with different
>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>> more complex. Thats another reason why it might not be good idea to
>> tie cgroups with mount namespace.
>>
>>>> Implementing pivot_cgroup_root would
>>>> probably involve overloading mount-namespace to now understand cgroup
>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>> earlier (not via a new syscall though), but came to the conclusion
>>>> that its just simpler to have a separate cgroup namespace and get
>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>
>>>> About pinning: I really feel that it should be OK to pin processes
>>>> within cgroupns-root. I think thats one of the most important feature
>>>> of cgroup-namespace since its most common usecase is to containerize
>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>> to remain inside their container.
>>>
>>> So don't let them out.  None of the other namespaces have this kind of
>>> constraint:
>>>
>>>  - If you're in a mntns, you can still use fds from outside.
>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>
>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>> handles in the outside namespace. I think moving a process outside of
>> cgroupns-root is like allocating a resource outside of your namespace.
>
> In a pidns, you can see outside tasks if you have an outside procfs
> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
> like that?  You wouldn't be able to escape your cgroup as long as you
> don't have an inappropriate cgroupfs mounted.
>

I am not if we should only depend on restricted visibility for this
though. More details below.

>
>>>
>>>> And with explicit permission from
>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>> suggested previously), we can make sure that unprivileged processes
>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>> semantics simple.
>>>
>>> I actually think it makes the semantics more complex.  The less policy
>>> you stick in the kernel, the easier it is to understand the impact of
>>> that policy.
>>>
>>
>> My inclination is towards keeping things simpler - both in code as
>> well as in configuration. I agree that cgroupns might seem
>> "less-flexible", but in its current form, it encourages consistent
>> container configuration. If you have a process that needs to move
>> around between cgroups belonging to different containers, then that
>> process should probably not be inside any container's cgroup
>> namespace. Allowing that will just make the cgroup namespace
>> pretty-much meaningless.
>
> The problem with pinning is that preventing it causes problems
> (specifically, either something potentially complex and incompatible
> needs to be added or unprivileged processes will be able to pin
> themselves).
>
> Unless I'm missing something, a normal cgroupns user doesn't actually
> need kernel pinning support to effectively constrain its members'
> cgroups.
>

So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process doing the write is an admin process (with CAP_SYS_ADMIN in its
userns AND over P's cgroupns->user_ns).
If this write succeeds, then:
(a) process P's /proc/<pid>/cgroup does not show anything when viewed
by 'self' or any other process in P's cgrgroupns. I would really like
to avoid showing relative paths or paths outside the cgroupns-root
(b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
only able to mount and see cgroup hierarchy under its cgroupns-root
(d) if process P tries to write to any cgroup file outside of its
cgroupns-root (assuming that hierarchy is visible to it for whatever
reason), it will fail as in (1)

i.e., in summary, you can't escape out of cgroupns-root yourself. You
will need help from an admin process running under some parent
cgroupns-root to move you out. Is that workable for your usecase? Most
of the things above already happen with the current patch-set, so it
should be easy to enable this.

Though there are still some open issues like:
* what happens if you move all the processes out of /container1 and
then 'rmdir /container1'? As it is now, you won't be able to setns()
to that cgroupns anymore. But the cgroupns will still hang around
until the processes switch their cgroupns.
* should we then also allow setns() without first entering the
cgroupns-root? setns also checks the same conditions as in (a) plus it
checks that your current cgroup is descendant of target cgroupns-root.
Alternatively we can special-case setns() to own cgroupns so that it
doesn't fail.
* migration for these processes will be tricky, if not impossible. But
the admin trying to do this probably doesn't care about it or will
provision for it.

>>
>>>>
>>>> If we ditch the pinning requirement and allow the containarized
>>>> process to move outside of its cgroupns-root, we will have to address
>>>> atleast the following:
>>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>>> look like? We might need to just not show anything in
>>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>>
>>> The process should see the cgroup path relative to its cgroup ns.
>>> Whether this requires a new /proc mount or happens automatically is an
>>> open question.  (I *hate* procfs for reasons like this.)
>>>
>>>> * how should future setns() and unshare() by such process behave?
>>>
>>> Open question.
>>>
>>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>>
>>> You could disallow that and instead require 'mount -t cgroup -o
>>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>>> relative to the caller's cgroupns.
>>>
>>>> * container will not remain migratable
>>>
>>> Why not?
>>>
>>
>> Well, the processes running outside of cgroupns root will be exposed
>> to information outside of the container (i.e., its /proc/self/cgroup
>> will show paths involving other containers and potentially system
>> level information). So unless you even restore them, it will be
>> difficult to restore these processes. The whole point of virtualizing
>> the /proc/self/cgroup view was so that the processes don't see outside
>> cgroups.
>>
>
> So don't do that?
>

Lot of non-cgroup-manager userspace processes have legitimate reasons
to read /proc/self/cgroup. One example is to register for OOM
notifications. Migratability of the container is also very important.
So "don't do that" is not always an option :)


> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-22  0:46                                                       ` Aditya Kali
@ 2014-10-22  0:58                                                           ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-22  0:58 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>
>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>> implementation.
>>>>>>
>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>
>>>>>
>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>> entering the container should also be implicitly contained within the
>>>>> cgroup-root of that container.
>>>>
>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>> instead of in userspace?
>>>>
>>>
>>> Userspace can do it too. Though then there will be possibility of
>>> having processes in the same mount namespace with different
>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>> more complex. Thats another reason why it might not be good idea to
>>> tie cgroups with mount namespace.
>>>
>>>>> Implementing pivot_cgroup_root would
>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>
>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>> to remain inside their container.
>>>>
>>>> So don't let them out.  None of the other namespaces have this kind of
>>>> constraint:
>>>>
>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>
>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>> handles in the outside namespace. I think moving a process outside of
>>> cgroupns-root is like allocating a resource outside of your namespace.
>>
>> In a pidns, you can see outside tasks if you have an outside procfs
>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>> like that?  You wouldn't be able to escape your cgroup as long as you
>> don't have an inappropriate cgroupfs mounted.
>>
>
> I am not if we should only depend on restricted visibility for this
> though. More details below.
>
>>
>>>>
>>>>> And with explicit permission from
>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>> suggested previously), we can make sure that unprivileged processes
>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>> semantics simple.
>>>>
>>>> I actually think it makes the semantics more complex.  The less policy
>>>> you stick in the kernel, the easier it is to understand the impact of
>>>> that policy.
>>>>
>>>
>>> My inclination is towards keeping things simpler - both in code as
>>> well as in configuration. I agree that cgroupns might seem
>>> "less-flexible", but in its current form, it encourages consistent
>>> container configuration. If you have a process that needs to move
>>> around between cgroups belonging to different containers, then that
>>> process should probably not be inside any container's cgroup
>>> namespace. Allowing that will just make the cgroup namespace
>>> pretty-much meaningless.
>>
>> The problem with pinning is that preventing it causes problems
>> (specifically, either something potentially complex and incompatible
>> needs to be added or unprivileged processes will be able to pin
>> themselves).
>>
>> Unless I'm missing something, a normal cgroupns user doesn't actually
>> need kernel pinning support to effectively constrain its members'
>> cgroups.
>>
>
> So there are 2 scenarios to consider:
>
> We have 2 containers with cgroups: /container1 and /container2
> Assume process P is running under cgroupns-root '/container1'
>
> (1) process P wants to 'write' to cgroup.procs outside its
> cgroupns-root (say to /container2/cgroup.procs)

This, at least, doesn't have the problem with unprivileged processes
pinning themselves.

> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
> with cgroupns-root above /container1) wants to write pid of process P
> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>
> For (1), I think its ok to reject such a write. This is consistent
> with the restriction in cgroup_file_write added in 'Patch 6' of this
> set. I believe this should be independent of visibility of the cgroup
> hierarchy for P.
>
> For (2), we may allow the write to succeed if we make sure that the
> process doing the write is an admin process (with CAP_SYS_ADMIN in its
> userns AND over P's cgroupns->user_ns).

Why is its userns relevant?

Why not just check whether the target cgroup is in the process doing
the write's cgroupns? (NB: you need to check f_cred, here, not
current_cred(), but that's orthogonal.)  Then the policy becomes: no
user of cgroupfs can move any process outside of the cgroupfs's user's
cgroupns root.

I think I'm okay with this.

> If this write succeeds, then:
> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
> by 'self' or any other process in P's cgrgroupns. I would really like
> to avoid showing relative paths or paths outside the cgroupns-root

The empty string seems just as problematic to me.

> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
> only able to mount and see cgroup hierarchy under its cgroupns-root
> (d) if process P tries to write to any cgroup file outside of its
> cgroupns-root (assuming that hierarchy is visible to it for whatever
> reason), it will fail as in (1)

I'm still unconvinced that this serves any purpose.  If you give
DAC/MAC permission to a task to write to something, and you give it
access to an fd or mount pointing there, and you don't want it writing
there, then *don't do that*.  I'm not really seeing why cgroupns needs
special treatment here.

>
> i.e., in summary, you can't escape out of cgroupns-root yourself. You
> will need help from an admin process running under some parent
> cgroupns-root to move you out. Is that workable for your usecase? Most
> of the things above already happen with the current patch-set, so it
> should be easy to enable this.
>
> Though there are still some open issues like:
> * what happens if you move all the processes out of /container1 and
> then 'rmdir /container1'? As it is now, you won't be able to setns()
> to that cgroupns anymore. But the cgroupns will still hang around
> until the processes switch their cgroupns.

Seems okay.

> * should we then also allow setns() without first entering the
> cgroupns-root? setns also checks the same conditions as in (a) plus it
> checks that your current cgroup is descendant of target cgroupns-root.
> Alternatively we can special-case setns() to own cgroupns so that it
> doesn't fail.

I think setns should completely ignore the caller's cgroup and should
not change it.  Userspace can do this.

> * migration for these processes will be tricky, if not impossible. But
> the admin trying to do this probably doesn't care about it or will
> provision for it.

Migration for processes in a mntns that have a current directory
outside their mntns is also difficult or impossible.  Same with
pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
procfs.  Nothing new here.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-22  0:58                                                           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-22  0:58 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali@google.com> wrote:
> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@google.com> wrote:
>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>
>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>> implementation.
>>>>>>
>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>
>>>>>
>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>> entering the container should also be implicitly contained within the
>>>>> cgroup-root of that container.
>>>>
>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>> instead of in userspace?
>>>>
>>>
>>> Userspace can do it too. Though then there will be possibility of
>>> having processes in the same mount namespace with different
>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>> more complex. Thats another reason why it might not be good idea to
>>> tie cgroups with mount namespace.
>>>
>>>>> Implementing pivot_cgroup_root would
>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>
>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>> to remain inside their container.
>>>>
>>>> So don't let them out.  None of the other namespaces have this kind of
>>>> constraint:
>>>>
>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>
>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>> handles in the outside namespace. I think moving a process outside of
>>> cgroupns-root is like allocating a resource outside of your namespace.
>>
>> In a pidns, you can see outside tasks if you have an outside procfs
>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>> like that?  You wouldn't be able to escape your cgroup as long as you
>> don't have an inappropriate cgroupfs mounted.
>>
>
> I am not if we should only depend on restricted visibility for this
> though. More details below.
>
>>
>>>>
>>>>> And with explicit permission from
>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>> suggested previously), we can make sure that unprivileged processes
>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>> semantics simple.
>>>>
>>>> I actually think it makes the semantics more complex.  The less policy
>>>> you stick in the kernel, the easier it is to understand the impact of
>>>> that policy.
>>>>
>>>
>>> My inclination is towards keeping things simpler - both in code as
>>> well as in configuration. I agree that cgroupns might seem
>>> "less-flexible", but in its current form, it encourages consistent
>>> container configuration. If you have a process that needs to move
>>> around between cgroups belonging to different containers, then that
>>> process should probably not be inside any container's cgroup
>>> namespace. Allowing that will just make the cgroup namespace
>>> pretty-much meaningless.
>>
>> The problem with pinning is that preventing it causes problems
>> (specifically, either something potentially complex and incompatible
>> needs to be added or unprivileged processes will be able to pin
>> themselves).
>>
>> Unless I'm missing something, a normal cgroupns user doesn't actually
>> need kernel pinning support to effectively constrain its members'
>> cgroups.
>>
>
> So there are 2 scenarios to consider:
>
> We have 2 containers with cgroups: /container1 and /container2
> Assume process P is running under cgroupns-root '/container1'
>
> (1) process P wants to 'write' to cgroup.procs outside its
> cgroupns-root (say to /container2/cgroup.procs)

This, at least, doesn't have the problem with unprivileged processes
pinning themselves.

> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
> with cgroupns-root above /container1) wants to write pid of process P
> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>
> For (1), I think its ok to reject such a write. This is consistent
> with the restriction in cgroup_file_write added in 'Patch 6' of this
> set. I believe this should be independent of visibility of the cgroup
> hierarchy for P.
>
> For (2), we may allow the write to succeed if we make sure that the
> process doing the write is an admin process (with CAP_SYS_ADMIN in its
> userns AND over P's cgroupns->user_ns).

Why is its userns relevant?

Why not just check whether the target cgroup is in the process doing
the write's cgroupns? (NB: you need to check f_cred, here, not
current_cred(), but that's orthogonal.)  Then the policy becomes: no
user of cgroupfs can move any process outside of the cgroupfs's user's
cgroupns root.

I think I'm okay with this.

> If this write succeeds, then:
> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
> by 'self' or any other process in P's cgrgroupns. I would really like
> to avoid showing relative paths or paths outside the cgroupns-root

The empty string seems just as problematic to me.

> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
> only able to mount and see cgroup hierarchy under its cgroupns-root
> (d) if process P tries to write to any cgroup file outside of its
> cgroupns-root (assuming that hierarchy is visible to it for whatever
> reason), it will fail as in (1)

I'm still unconvinced that this serves any purpose.  If you give
DAC/MAC permission to a task to write to something, and you give it
access to an fd or mount pointing there, and you don't want it writing
there, then *don't do that*.  I'm not really seeing why cgroupns needs
special treatment here.

>
> i.e., in summary, you can't escape out of cgroupns-root yourself. You
> will need help from an admin process running under some parent
> cgroupns-root to move you out. Is that workable for your usecase? Most
> of the things above already happen with the current patch-set, so it
> should be easy to enable this.
>
> Though there are still some open issues like:
> * what happens if you move all the processes out of /container1 and
> then 'rmdir /container1'? As it is now, you won't be able to setns()
> to that cgroupns anymore. But the cgroupns will still hang around
> until the processes switch their cgroupns.

Seems okay.

> * should we then also allow setns() without first entering the
> cgroupns-root? setns also checks the same conditions as in (a) plus it
> checks that your current cgroup is descendant of target cgroupns-root.
> Alternatively we can special-case setns() to own cgroupns so that it
> doesn't fail.

I think setns should completely ignore the caller's cgroup and should
not change it.  Userspace can do this.

> * migration for these processes will be tricky, if not impossible. But
> the admin trying to do this probably doesn't care about it or will
> provision for it.

Migration for processes in a mntns that have a current directory
outside their mntns is also difficult or impossible.  Same with
pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
procfs.  Nothing new here.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-22  0:58                                                           ` Andy Lutomirski
@ 2014-10-22 18:37                                                               ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out.  None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>>> like that?  You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>>
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>>
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>>>
>>
>> So there are 2 scenarios to consider:
>>
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>>
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
>
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>>
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
>
> Why is its userns relevant?
>
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.)  Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
>
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.

> I think I'm okay with this.
>
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
>
> The empty string seems just as problematic to me.

Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
cgroup
-- this exposes system

* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a
surprise!

In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.

>
>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
>
> I'm still unconvinced that this serves any purpose.  If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*.  I'm not really seeing why cgroupns needs
> special treatment here.
>

There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.

>>
>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>>
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
>
> Seems okay.
>
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
>
> I think setns should completely ignore the caller's cgroup and should
> not change it.  Userspace can do this.
>

All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?

>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
>
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible.  Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs.  Nothing new here.
>
> --Andy

Thanks for the review!

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-22 18:37                                                               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@google.com> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@google.com> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out.  None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>>> like that?  You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>>
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>>
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>>>
>>
>> So there are 2 scenarios to consider:
>>
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>>
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
>
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>>
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
>
> Why is its userns relevant?
>
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.)  Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
>
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.

> I think I'm okay with this.
>
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
>
> The empty string seems just as problematic to me.

Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
cgroup
-- this exposes system

* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a
surprise!

In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.

>
>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
>
> I'm still unconvinced that this serves any purpose.  If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*.  I'm not really seeing why cgroupns needs
> special treatment here.
>

There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.

>>
>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>>
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
>
> Seems okay.
>
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
>
> I think setns should completely ignore the caller's cgroup and should
> not change it.  Userspace can do this.
>

All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?

>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
>
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible.  Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs.  Nothing new here.
>
> --Andy

Thanks for the review!

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-22 18:37                                                               ` Aditya Kali
@ 2014-10-22 18:50                                                                   ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-22 18:50 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 22, 2014 at 11:37 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>
>>>>>>> And with explicit permission from
>>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>>> semantics simple.
>>>>>>
>>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>>> that policy.
>>>>>>
>>>>>
>>>>> My inclination is towards keeping things simpler - both in code as
>>>>> well as in configuration. I agree that cgroupns might seem
>>>>> "less-flexible", but in its current form, it encourages consistent
>>>>> container configuration. If you have a process that needs to move
>>>>> around between cgroups belonging to different containers, then that
>>>>> process should probably not be inside any container's cgroup
>>>>> namespace. Allowing that will just make the cgroup namespace
>>>>> pretty-much meaningless.
>>>>
>>>> The problem with pinning is that preventing it causes problems
>>>> (specifically, either something potentially complex and incompatible
>>>> needs to be added or unprivileged processes will be able to pin
>>>> themselves).
>>>>
>>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>>> need kernel pinning support to effectively constrain its members'
>>>> cgroups.
>>>>
>>>
>>> So there are 2 scenarios to consider:
>>>
>>> We have 2 containers with cgroups: /container1 and /container2
>>> Assume process P is running under cgroupns-root '/container1'
>>>
>>> (1) process P wants to 'write' to cgroup.procs outside its
>>> cgroupns-root (say to /container2/cgroup.procs)
>>
>> This, at least, doesn't have the problem with unprivileged processes
>> pinning themselves.
>>
>>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>>> with cgroupns-root above /container1) wants to write pid of process P
>>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>>
>>> For (1), I think its ok to reject such a write. This is consistent
>>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>>> set. I believe this should be independent of visibility of the cgroup
>>> hierarchy for P.
>>>
>>> For (2), we may allow the write to succeed if we make sure that the
>>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>>> userns AND over P's cgroupns->user_ns).
>>
>> Why is its userns relevant?
>>
>> Why not just check whether the target cgroup is in the process doing
>> the write's cgroupns? (NB: you need to check f_cred, here, not
>> current_cred(), but that's orthogonal.)  Then the policy becomes: no
>> user of cgroupfs can move any process outside of the cgroupfs's user's
>> cgroupns root.
>>
> Humm .. it doesn't have to be. I think its simpler to not enforce
> artificial permission checks unless there is a security concern (and
> in this case, there doesn't seem to be any). So I will leave the
> capability check out from here.
>
>> I think I'm okay with this.
>>
>>> If this write succeeds, then:
>>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>>> by 'self' or any other process in P's cgrgroupns. I would really like
>>> to avoid showing relative paths or paths outside the cgroupns-root
>>
>> The empty string seems just as problematic to me.
>
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
>

Confused now.  If ".." in /proc/pid/group would be ambiguous, then so
would a path relative to cgroupns root, right?  Or am I missing
something?

(I'm not saying that ".." is beautiful or that it won't confuse
things.  I'm just not sure why it's ambiguous.)

> * show absolute path
> -- this will also wrong as the process won't be able to make sense of
> it unless it has exposure to the global cgroup hierarchy.
> -- worse case is this that the global path also exists under the
> cgroupns-root ... so now the process thinks its in completely wrong
> cgroup
> -- this exposes system
>
> * show only "/"
> -- this is arguably better, but if the process tires to verify that
> its pid is in cgroup.procs of the cgroupns-root, its in for a
> surprise!
>
> In either case, whatever we expose, the userspace won't be able to use
> this path correctly (worse yet, it associates wrong cgroup for that
> path). So I think its best to not print out the line for default
> hierarchy at all. This happens today when cgroupfs is not mounted. I
> am open to other suggestions.

I suppose that ".." is a possible security problem.  If I can force
you to see lots of ..s in there, then I might be about to get you to
write outside cgroupfs.

Grr.  No great solution here.  I suppose that the empty string isn't
so bad.  We could also write something obviously invalid like
"(unreachable)".  As long as no one actually creates a cgroup called
"(unreachable)", then this could result in errors but not actual
confusion.

>>> * should we then also allow setns() without first entering the
>>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>>> checks that your current cgroup is descendant of target cgroupns-root.
>>> Alternatively we can special-case setns() to own cgroupns so that it
>>> doesn't fail.
>>
>> I think setns should completely ignore the caller's cgroup and should
>> not change it.  Userspace can do this.
>>
>
> All above changes more or less means that tasks cannot pin themselves
> by unsharing cgroupns. Do you agree that we don't need that "explicit
> permission from cgroupfs" anymore (via cgroup.may_unshare file or
> other mechanism)?

Yes, I agree.

>
>>> * migration for these processes will be tricky, if not impossible. But
>>> the admin trying to do this probably doesn't care about it or will
>>> provision for it.
>>
>> Migration for processes in a mntns that have a current directory
>> outside their mntns is also difficult or impossible.  Same with
>> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
>> procfs.  Nothing new here.
>>
>> --Andy
>
> Thanks for the review!

No problem.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-22 18:50                                                                   ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-10-22 18:50 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Tejun Heo, cgroups, Ingo Molnar

On Wed, Oct 22, 2014 at 11:37 AM, Aditya Kali <adityakali@google.com> wrote:
> On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali@google.com> wrote:
>>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>>
>>>>>>> And with explicit permission from
>>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>>> semantics simple.
>>>>>>
>>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>>> that policy.
>>>>>>
>>>>>
>>>>> My inclination is towards keeping things simpler - both in code as
>>>>> well as in configuration. I agree that cgroupns might seem
>>>>> "less-flexible", but in its current form, it encourages consistent
>>>>> container configuration. If you have a process that needs to move
>>>>> around between cgroups belonging to different containers, then that
>>>>> process should probably not be inside any container's cgroup
>>>>> namespace. Allowing that will just make the cgroup namespace
>>>>> pretty-much meaningless.
>>>>
>>>> The problem with pinning is that preventing it causes problems
>>>> (specifically, either something potentially complex and incompatible
>>>> needs to be added or unprivileged processes will be able to pin
>>>> themselves).
>>>>
>>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>>> need kernel pinning support to effectively constrain its members'
>>>> cgroups.
>>>>
>>>
>>> So there are 2 scenarios to consider:
>>>
>>> We have 2 containers with cgroups: /container1 and /container2
>>> Assume process P is running under cgroupns-root '/container1'
>>>
>>> (1) process P wants to 'write' to cgroup.procs outside its
>>> cgroupns-root (say to /container2/cgroup.procs)
>>
>> This, at least, doesn't have the problem with unprivileged processes
>> pinning themselves.
>>
>>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>>> with cgroupns-root above /container1) wants to write pid of process P
>>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>>
>>> For (1), I think its ok to reject such a write. This is consistent
>>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>>> set. I believe this should be independent of visibility of the cgroup
>>> hierarchy for P.
>>>
>>> For (2), we may allow the write to succeed if we make sure that the
>>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>>> userns AND over P's cgroupns->user_ns).
>>
>> Why is its userns relevant?
>>
>> Why not just check whether the target cgroup is in the process doing
>> the write's cgroupns? (NB: you need to check f_cred, here, not
>> current_cred(), but that's orthogonal.)  Then the policy becomes: no
>> user of cgroupfs can move any process outside of the cgroupfs's user's
>> cgroupns root.
>>
> Humm .. it doesn't have to be. I think its simpler to not enforce
> artificial permission checks unless there is a security concern (and
> in this case, there doesn't seem to be any). So I will leave the
> capability check out from here.
>
>> I think I'm okay with this.
>>
>>> If this write succeeds, then:
>>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>>> by 'self' or any other process in P's cgrgroupns. I would really like
>>> to avoid showing relative paths or paths outside the cgroupns-root
>>
>> The empty string seems just as problematic to me.
>
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
>

Confused now.  If ".." in /proc/pid/group would be ambiguous, then so
would a path relative to cgroupns root, right?  Or am I missing
something?

(I'm not saying that ".." is beautiful or that it won't confuse
things.  I'm just not sure why it's ambiguous.)

> * show absolute path
> -- this will also wrong as the process won't be able to make sense of
> it unless it has exposure to the global cgroup hierarchy.
> -- worse case is this that the global path also exists under the
> cgroupns-root ... so now the process thinks its in completely wrong
> cgroup
> -- this exposes system
>
> * show only "/"
> -- this is arguably better, but if the process tires to verify that
> its pid is in cgroup.procs of the cgroupns-root, its in for a
> surprise!
>
> In either case, whatever we expose, the userspace won't be able to use
> this path correctly (worse yet, it associates wrong cgroup for that
> path). So I think its best to not print out the line for default
> hierarchy at all. This happens today when cgroupfs is not mounted. I
> am open to other suggestions.

I suppose that ".." is a possible security problem.  If I can force
you to see lots of ..s in there, then I might be about to get you to
write outside cgroupfs.

Grr.  No great solution here.  I suppose that the empty string isn't
so bad.  We could also write something obviously invalid like
"(unreachable)".  As long as no one actually creates a cgroup called
"(unreachable)", then this could result in errors but not actual
confusion.

>>> * should we then also allow setns() without first entering the
>>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>>> checks that your current cgroup is descendant of target cgroupns-root.
>>> Alternatively we can special-case setns() to own cgroupns so that it
>>> doesn't fail.
>>
>> I think setns should completely ignore the caller's cgroup and should
>> not change it.  Userspace can do this.
>>
>
> All above changes more or less means that tasks cannot pin themselves
> by unsharing cgroupns. Do you agree that we don't need that "explicit
> permission from cgroupfs" anymore (via cgroup.may_unshare file or
> other mechanism)?

Yes, I agree.

>
>>> * migration for these processes will be tricky, if not impossible. But
>>> the admin trying to do this probably doesn't care about it or will
>>> provision for it.
>>
>> Migration for processes in a mntns that have a current directory
>> outside their mntns is also difficult or impossible.  Same with
>> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
>> procfs.  Nothing new here.
>>
>> --Andy
>
> Thanks for the review!

No problem.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
       [not found]           ` <20141017092814.GA8848-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2014-10-22 19:06             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22 19:06 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Ingo Molnar

On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed.  (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

-     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
            return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> just wanted to point this out.
>
>> ---
>>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>>       struct task_struct *task;
>>       int ret;
>>
>> +     /* Only allow changing cgroups accessible within task's cgroup
>> +      * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> +      * cgroupns->root_cgrp. */
>> +     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> +             return -EPERM;
>> +
>>       /* look up all src csets */
>>       down_read(&css_set_rwsem);
>>       rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>>       struct cgroup_subsys_state *css;
>>       int ret;
>>
>> +     /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +             return -EINVAL;
>> +
>>       if (cft->write)
>>               return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>>       parent = cgroup_kn_lock_live(parent_kn);
>>       if (!parent)
>>               return -ENODEV;
>> +
>> +     /* Allow mkdir only within process's cgroup namespace root. */
>> +     if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> +             ret = -EPERM;
>> +             goto out_unlock;
>> +     }
>> +
>>       root = parent->root;
>>
>>       /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>>       if (!cgrp)
>>               return 0;
>>
>> +     /* Allow rmdir only within process's cgroup namespace root.
>> +      * The process can't delete its own root anyways. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> +             cgroup_kn_unlock(kn);
>> +             return -EPERM;
>> +     }
>> +
>>       ret = cgroup_destroy_locked(cgrp);
>>
>>       cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>               if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>>                       continue;
>>
>> +             cgrp = task_cgroup_from_root(tsk, root);
>> +
>> +             /* The cgroup path on default hierarchy is shown only if it
>> +              * falls under current task's cgroupns-root.
>> +              */
>> +             if (root == &cgrp_dfl_root &&
>> +                 !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +                     continue;
>> +
>>               seq_printf(m, "%d:", root->hierarchy_id);
>>               for_each_subsys(ss, ssid)
>>                       if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>                       seq_printf(m, "%sname=%s", count ? "," : "",
>>                                  root->name);
>>               seq_putc(m, ':');
>> -             cgrp = task_cgroup_from_root(tsk, root);
>>               path = cgroup_path(cgrp, buf, PATH_MAX);
>>               if (!path) {
>>                       retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the reiview!

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
       [not found]           ` <20141017092814.GA8848-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2014-10-22 19:06             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22 19:06 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Eric W. Biederman

On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed.  (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

-     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
            return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
>
> just wanted to point this out.
>
>> ---
>>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>>       struct task_struct *task;
>>       int ret;
>>
>> +     /* Only allow changing cgroups accessible within task's cgroup
>> +      * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> +      * cgroupns->root_cgrp. */
>> +     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> +             return -EPERM;
>> +
>>       /* look up all src csets */
>>       down_read(&css_set_rwsem);
>>       rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>>       struct cgroup_subsys_state *css;
>>       int ret;
>>
>> +     /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +             return -EINVAL;
>> +
>>       if (cft->write)
>>               return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>>       parent = cgroup_kn_lock_live(parent_kn);
>>       if (!parent)
>>               return -ENODEV;
>> +
>> +     /* Allow mkdir only within process's cgroup namespace root. */
>> +     if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> +             ret = -EPERM;
>> +             goto out_unlock;
>> +     }
>> +
>>       root = parent->root;
>>
>>       /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>>       if (!cgrp)
>>               return 0;
>>
>> +     /* Allow rmdir only within process's cgroup namespace root.
>> +      * The process can't delete its own root anyways. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> +             cgroup_kn_unlock(kn);
>> +             return -EPERM;
>> +     }
>> +
>>       ret = cgroup_destroy_locked(cgrp);
>>
>>       cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>               if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>>                       continue;
>>
>> +             cgrp = task_cgroup_from_root(tsk, root);
>> +
>> +             /* The cgroup path on default hierarchy is shown only if it
>> +              * falls under current task's cgroupns-root.
>> +              */
>> +             if (root == &cgrp_dfl_root &&
>> +                 !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +                     continue;
>> +
>>               seq_printf(m, "%d:", root->hierarchy_id);
>>               for_each_subsys(ss, ssid)
>>                       if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>                       seq_printf(m, "%sname=%s", count ? "," : "",
>>                                  root->name);
>>               seq_putc(m, ':');
>> -             cgrp = task_cgroup_from_root(tsk, root);
>>               path = cgroup_path(cgrp, buf, PATH_MAX);
>>               if (!path) {
>>                       retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the reiview!

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
@ 2014-10-22 19:06             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-22 19:06 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Eric W. Biederman

On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed.  (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

-     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
            return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> just wanted to point this out.
>
>> ---
>>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>>       struct task_struct *task;
>>       int ret;
>>
>> +     /* Only allow changing cgroups accessible within task's cgroup
>> +      * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> +      * cgroupns->root_cgrp. */
>> +     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> +             return -EPERM;
>> +
>>       /* look up all src csets */
>>       down_read(&css_set_rwsem);
>>       rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>>       struct cgroup_subsys_state *css;
>>       int ret;
>>
>> +     /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +             return -EINVAL;
>> +
>>       if (cft->write)
>>               return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>>       parent = cgroup_kn_lock_live(parent_kn);
>>       if (!parent)
>>               return -ENODEV;
>> +
>> +     /* Allow mkdir only within process's cgroup namespace root. */
>> +     if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> +             ret = -EPERM;
>> +             goto out_unlock;
>> +     }
>> +
>>       root = parent->root;
>>
>>       /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>>       if (!cgrp)
>>               return 0;
>>
>> +     /* Allow rmdir only within process's cgroup namespace root.
>> +      * The process can't delete its own root anyways. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> +             cgroup_kn_unlock(kn);
>> +             return -EPERM;
>> +     }
>> +
>>       ret = cgroup_destroy_locked(cgrp);
>>
>>       cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>               if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>>                       continue;
>>
>> +             cgrp = task_cgroup_from_root(tsk, root);
>> +
>> +             /* The cgroup path on default hierarchy is shown only if it
>> +              * falls under current task's cgroupns-root.
>> +              */
>> +             if (root == &cgrp_dfl_root &&
>> +                 !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +                     continue;
>> +
>>               seq_printf(m, "%d:", root->hierarchy_id);
>>               for_each_subsys(ss, ssid)
>>                       if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>                       seq_printf(m, "%sname=%s", count ? "," : "",
>>                                  root->name);
>>               seq_putc(m, ':');
>> -             cgrp = task_cgroup_from_root(tsk, root);
>>               path = cgroup_path(cgrp, buf, PATH_MAX);
>>               if (!path) {
>>                       retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the reiview!

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
  2014-10-22 18:37                                                               ` Aditya Kali
@ 2014-10-22 19:42                                                                   ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-10-22 19:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Oct 22, 2014 at 11:37:55AM -0700, Aditya Kali wrote:
...
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

Let's go with this w/o pinning.  The only necessary feature for
cgroupns is making the /proc/*/cgroups relative to its own root.  It's
not like containers can avoid trusting its outside world anyway and
playing tricks with things like this tend to lead to weird surprises
down the road.  If userland messes up, userland messes up.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
@ 2014-10-22 19:42                                                                   ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-10-22 19:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Andy Lutomirski, Eric W. Biederman, Serge E. Hallyn, Linux API,
	Linux Containers, Serge Hallyn, linux-kernel, cgroups,
	Ingo Molnar

Hello,

On Wed, Oct 22, 2014 at 11:37:55AM -0700, Aditya Kali wrote:
...
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

Let's go with this w/o pinning.  The only necessary feature for
cgroupns is making the /proc/*/cgroups relative to its own root.  It's
not like containers can avoid trusting its outside world anyway and
playing tricks with things like this tend to lead to weird surprises
down the road.  If userland messes up, userland messes up.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
  2014-10-16 16:37         ` Serge E. Hallyn
@ 2014-10-24  1:03             ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-24  1:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
> have cgroups in the kernel this won't add much in the way of memory
> usage, right?  And I think the 'experimental' argument has long since
> been squashed.  So I'd argue for simplifying this patch by removing
> CONFIG_CGROUP_NS.
>

With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

> (more below)
>
>> ---
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  18 +++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 ++
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  |  11 ++++
>>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++++-
>>  11 files changed, 255 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 8902609..e04ed4b 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>>       &userns_operations,
>>  #endif
>>       &mntns_operations,
>> +#ifdef CONFIG_CGROUP_NS
>> +     &cgroupns_operations,
>> +#endif
>>  };
>>
>>  static const struct file_operations ns_file_operations = {
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 4a0eb2d..aa86495 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -22,6 +22,8 @@
>>  #include <linux/seq_file.h>
>>  #include <linux/kernfs.h>
>>  #include <linux/wait.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/types.h>
>>
>>  #ifdef CONFIG_CGROUPS
>>
>> @@ -460,6 +462,13 @@ struct cftype {
>>  #endif
>>  };
>>
>> +struct cgroup_namespace {
>> +     atomic_t                count;
>> +     unsigned int            proc_inum;
>> +     struct user_namespace   *user_ns;
>> +     struct cgroup           *root_cgrp;
>> +};
>> +
>>  extern struct cgroup_root cgrp_dfl_root;
>>  extern struct css_set init_css_set;
>>
>> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>>       return kernfs_name(cgrp->kn, buf, buflen);
>>  }
>>
>> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
>> +                                              struct cgroup *cgrp, char *buf,
>> +                                              size_t buflen)
>> +{
>> +     return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
>> +}
>> +
>>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>>                                             size_t buflen)
>>  {
>> -     return kernfs_path(cgrp->kn, buf, buflen);
>> +     return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>>  }
>>
>>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
>> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
>> new file mode 100644
>> index 0000000..9f637fe
>> --- /dev/null
>> +++ b/include/linux/cgroup_namespace.h
>> @@ -0,0 +1,62 @@
>> +#ifndef _LINUX_CGROUP_NAMESPACE_H
>> +#define _LINUX_CGROUP_NAMESPACE_H
>> +
>> +#include <linux/nsproxy.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/types.h>
>> +#include <linux/user_namespace.h>
>> +
>> +extern struct cgroup_namespace init_cgroup_ns;
>> +
>> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
>> +{
>> +     return tsk->nsproxy->cgroup_ns->root_cgrp;
>
> Per the rules in nsproxy.h, you should be taking the task_lock here.
>
> (If you are making assumptions about tsk then you need to state them
> here - I only looked quickly enough that you pass in 'leader')
>

In the new version of the patch, we call this function only for the
'current' task. As per nsproxy.h, no special precautions needed when
reading current task's nsproxy. So I just remodeled this function into
"current_cgroupns_root(void)".

>> +}
>> +
>> +#ifdef CONFIG_CGROUP_NS
>> +
>> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     if (ns)
>> +             atomic_inc(&ns->count);
>> +     return ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     if (ns && atomic_dec_and_test(&ns->count))
>> +             free_cgroup_ns(ns);
>> +}
>> +
>> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                            struct user_namespace *user_ns,
>> +                                            struct cgroup_namespace *old_ns);
>> +
>> +#else  /* CONFIG_CGROUP_NS */
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     return &init_cgroup_ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +}
>> +
>> +static inline struct cgroup_namespace *copy_cgroup_ns(
>> +             unsigned long flags,
>> +             struct user_namespace *user_ns,
>> +             struct cgroup_namespace *old_ns) {
>> +     if (flags & CLONE_NEWCGROUP)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     return old_ns;
>> +}
>> +
>> +#endif  /* CONFIG_CGROUP_NS */
>> +
>> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index 35fa08f..ac0d65b 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -8,6 +8,7 @@ struct mnt_namespace;
>>  struct uts_namespace;
>>  struct ipc_namespace;
>>  struct pid_namespace;
>> +struct cgroup_namespace;
>>  struct fs_struct;
>>
>>  /*
>> @@ -33,6 +34,7 @@ struct nsproxy {
>>       struct mnt_namespace *mnt_ns;
>>       struct pid_namespace *pid_ns_for_children;
>>       struct net           *net_ns;
>> +     struct cgroup_namespace *cgroup_ns;
>>  };
>>  extern struct nsproxy init_nsproxy;
>>
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index 34a1e10..e56dd73 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -6,6 +6,8 @@
>>
>>  struct pid_namespace;
>>  struct nsproxy;
>> +struct task_struct;
>> +struct inode;
>>
>>  struct proc_ns_operations {
>>       const char *name;
>> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>>  extern const struct proc_ns_operations pidns_operations;
>>  extern const struct proc_ns_operations userns_operations;
>>  extern const struct proc_ns_operations mntns_operations;
>> +extern const struct proc_ns_operations cgroupns_operations;
>>
>>  /*
>>   * We always define these enumerators
>> @@ -37,6 +40,7 @@ enum {
>>       PROC_UTS_INIT_INO       = 0xEFFFFFFEU,
>>       PROC_USER_INIT_INO      = 0xEFFFFFFDU,
>>       PROC_PID_INIT_INO       = 0xEFFFFFFCU,
>> +     PROC_CGROUP_INIT_INO    = 0xEFFFFFFBU,
>>  };
>>
>>  #ifdef CONFIG_PROC_FS
>> diff --git a/init/Kconfig b/init/Kconfig
>> index e84c642..c3be001 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>>       Enable some debugging help. Currently it exports additional stat
>>       files in a cgroup which can be useful for debugging.
>>
>> +config CGROUP_NS
>> +     bool "CGroup Namespaces"
>> +     default n
>> +     help
>> +       This options enables CGroup Namespaces which can be used to isolate
>> +       cgroup paths. This feature is only useful when unified cgroup
>> +       hierarchy is in use (i.e. cgroups are mounted with sane_behavior
>> +       option).
>> +
>>  endif # CGROUPS
>>
>>  config CHECKPOINT_RESTORE
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index dc5c775..75334f8 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>>  obj-$(CONFIG_COMPAT) += compat.o
>>  obj-$(CONFIG_CGROUPS) += cgroup.o
>> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>>  obj-$(CONFIG_CPUSETS) += cpuset.o
>>  obj-$(CONFIG_UTS_NS) += utsname.o
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 2b3e9f9..f8099b4 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -57,6 +57,8 @@
>>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>>  #include <linux/kthread.h>
>>  #include <linux/delay.h>
>> +#include <linux/proc_ns.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  #include <linux/atomic.h>
>>
>> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>>                             bool is_add);
>>
>> +struct cgroup_namespace init_cgroup_ns = {
>> +     .count = {
>> +             .counter = 1,
>> +     },
>> +     .proc_inum = PROC_CGROUP_INIT_INO,
>> +     .user_ns = &init_user_ns,
>
> This might mean that you should bump the init_user_ns refcount.
>

Humm. Doesn't look like all other namespaces are doing it though (ex:
init_pid_ns or init_ipc_ns). The initial count in init_user_ns is set
to 3 which only accounts for some current users, but not all. I will
increment it for init_cgroup_ns nevertheless (in cgroup_init()).

>> +     .root_cgrp = &cgrp_dfl_root.cgrp,
>> +};
>> +
>>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>>                           gfp_t gfp_mask)
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> new file mode 100644
>> index 0000000..c16604f
>> --- /dev/null
>> +++ b/kernel/cgroup_namespace.c
>> @@ -0,0 +1,128 @@
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cgroup_namespace.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/proc_ns.h>
>> +
>> +static struct cgroup_namespace *alloc_cgroup_ns(void)
>> +{
>> +     struct cgroup_namespace *new_ns;
>> +
>> +     new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
>> +     if (new_ns)
>> +             atomic_set(&new_ns->count, 1);
>> +     return new_ns;
>> +}
>> +
>> +void free_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     cgroup_put(ns->root_cgrp);
>> +     put_user_ns(ns->user_ns);
>
> This is a problem on error patch in copy_cgroup_ns.  The
> alloc_cgroup_ns() doesn't initialize these values, so if
> you should fail in proc_alloc_inum() you'll show up here
> with fandom values in ns->*.
>

I don't see the codepath that leads to calling free_cgroup_ns() with
uninitialized members. We don't call free_cgroup_ns() on the error
path in copy_cgroup_ns().

>> +     proc_free_inum(ns->proc_inum);

BTW, I was missing the actual kfree(ns) here. Added it.

>> +}
>> +EXPORT_SYMBOL(free_cgroup_ns);
>> +
>> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                     struct user_namespace *user_ns,
>> +                                     struct cgroup_namespace *old_ns)
>> +{
>> +     struct cgroup_namespace *new_ns = NULL;
>> +     struct cgroup *cgrp = NULL;
>> +     int err;
>> +
>> +     BUG_ON(!old_ns);
>> +
>> +     if (!(flags & CLONE_NEWCGROUP))
>> +             return get_cgroup_ns(old_ns);
>> +
>> +     /* Allow only sysadmin to create cgroup namespace. */
>> +     err = -EPERM;
>> +     if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>> +             goto err_out;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(current);
>> +
>> +     cgrp = get_task_cgroup(current);
>> +
>> +     /* Creating new CGROUPNS is supported only when unified hierarchy is in
>> +      * use. */
>
> Oh, drat.  Well, I'll take, it, but under protest  :)
>

Actually, I realized that this comment and the check below is bogus.
The 'get_task_cgroup(current)' always only returns the cgroup on the
default hierarchy. And so, the check below is unnecessary.
What this comment should really say is that cgroup namespace only
virtualizes the cgroup path for the default(unified) hierarchy. Its
fine if you have other hierarchies mounted too. Just that for those
hierarchies, full (non-virtualized) cgroup path will be visible in
/proc/self/cgroup. So cgroupns won't help there.

I have updated the comment in the new version of the patch.

>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto err_out_unlock;
>> +
>> +     err = -ENOMEM;
>> +     new_ns = alloc_cgroup_ns();
>> +     if (!new_ns)
>> +             goto err_out_unlock;
>> +
>> +     err = proc_alloc_inum(&new_ns->proc_inum);
>> +     if (err)
>> +             goto err_out_unlock;
>> +
>> +     new_ns->user_ns = get_user_ns(user_ns);
>> +     new_ns->root_cgrp = cgrp;
>> +
>> +     threadgroup_unlock(current);
>> +
>> +     return new_ns;
>> +
>> +err_out_unlock:
>> +     threadgroup_unlock(current);
>> +err_out:
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     kfree(new_ns);
>> +     return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +     pr_info("setns not supported for cgroup namespace");
>> +     return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +     struct cgroup_namespace *ns = NULL;
>> +     struct nsproxy *nsproxy;
>> +
>> +     rcu_read_lock();
>> +     nsproxy = task->nsproxy;
>> +     if (nsproxy) {
>> +             ns = nsproxy->cgroup_ns;
>> +             get_cgroup_ns(ns);
>> +     }
>> +     rcu_read_unlock();
>> +
>> +     return ns;
>> +}
>> +
>> +static void cgroupns_put(void *ns)
>> +{
>> +     put_cgroup_ns(ns);
>> +}
>> +
>> +static unsigned int cgroupns_inum(void *ns)
>> +{
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +
>> +     return cgroup_ns->proc_inum;
>> +}
>> +
>> +const struct proc_ns_operations cgroupns_operations = {
>> +     .name           = "cgroup",
>> +     .type           = CLONE_NEWCGROUP,
>> +     .get            = cgroupns_get,
>> +     .put            = cgroupns_put,
>> +     .install        = cgroupns_install,
>> +     .inum           = cgroupns_inum,
>> +};
>> +
>> +static __init int cgroup_namespaces_init(void)
>> +{
>> +     return 0;
>> +}
>> +subsys_initcall(cgroup_namespaces_init);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 0cf9cdb..cc06851 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>>       if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>>                               CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>>                               CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
>> -                             CLONE_NEWUSER|CLONE_NEWPID))
>> +                             CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>>               return -EINVAL;
>>       /*
>>        * Not implemented, but pretend it works if there is nothing to
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index ef42d0a..a8b1970 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -25,6 +25,7 @@
>>  #include <linux/proc_ns.h>
>>  #include <linux/file.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  static struct kmem_cache *nsproxy_cachep;
>>
>> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>>  #ifdef CONFIG_NET
>>       .net_ns                 = &init_net,
>>  #endif
>> +     .cgroup_ns              = &init_cgroup_ns,
>>  };
>>
>>  static inline struct nsproxy *create_nsproxy(void)
>> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>               goto out_pid;
>>       }
>>
>> +     new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
>> +                                         tsk->nsproxy->cgroup_ns);
>> +     if (IS_ERR(new_nsp->cgroup_ns)) {
>> +             err = PTR_ERR(new_nsp->cgroup_ns);
>> +             goto out_cgroup;
>> +     }
>> +
>>       new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>>       if (IS_ERR(new_nsp->net_ns)) {
>>               err = PTR_ERR(new_nsp->net_ns);
>> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>       return new_nsp;
>>
>>  out_net:
>> +     if (new_nsp->cgroup_ns)
>> +             put_cgroup_ns(new_nsp->cgroup_ns);
>> +out_cgroup:
>>       if (new_nsp->pid_ns_for_children)
>>               put_pid_ns(new_nsp->pid_ns_for_children);
>>  out_pid:
>> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>>       struct nsproxy *new_ns;
>>
>>       if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                           CLONE_NEWPID | CLONE_NEWNET)))) {
>> +                           CLONE_NEWPID | CLONE_NEWNET |
>> +                           CLONE_NEWCGROUP)))) {
>>               get_nsproxy(old_ns);
>>               return 0;
>>       }
>> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>>               put_ipc_ns(ns->ipc_ns);
>>       if (ns->pid_ns_for_children)
>>               put_pid_ns(ns->pid_ns_for_children);
>> +     if (ns->cgroup_ns)
>> +             put_cgroup_ns(ns->cgroup_ns);
>>       put_net(ns->net_ns);
>>       kmem_cache_free(nsproxy_cachep, ns);
>>  }
>> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>>       int err = 0;
>>
>>       if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                            CLONE_NEWNET | CLONE_NEWPID)))
>> +                            CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>>               return 0;
>>
>>       user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the review!
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
@ 2014-10-24  1:03             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-24  1:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers

I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Aditya Kali (adityakali@google.com):
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>
> I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
> have cgroups in the kernel this won't add much in the way of memory
> usage, right?  And I think the 'experimental' argument has long since
> been squashed.  So I'd argue for simplifying this patch by removing
> CONFIG_CGROUP_NS.
>

With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

> (more below)
>
>> ---
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  18 +++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 ++
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  |  11 ++++
>>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++++-
>>  11 files changed, 255 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 8902609..e04ed4b 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>>       &userns_operations,
>>  #endif
>>       &mntns_operations,
>> +#ifdef CONFIG_CGROUP_NS
>> +     &cgroupns_operations,
>> +#endif
>>  };
>>
>>  static const struct file_operations ns_file_operations = {
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 4a0eb2d..aa86495 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -22,6 +22,8 @@
>>  #include <linux/seq_file.h>
>>  #include <linux/kernfs.h>
>>  #include <linux/wait.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/types.h>
>>
>>  #ifdef CONFIG_CGROUPS
>>
>> @@ -460,6 +462,13 @@ struct cftype {
>>  #endif
>>  };
>>
>> +struct cgroup_namespace {
>> +     atomic_t                count;
>> +     unsigned int            proc_inum;
>> +     struct user_namespace   *user_ns;
>> +     struct cgroup           *root_cgrp;
>> +};
>> +
>>  extern struct cgroup_root cgrp_dfl_root;
>>  extern struct css_set init_css_set;
>>
>> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>>       return kernfs_name(cgrp->kn, buf, buflen);
>>  }
>>
>> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
>> +                                              struct cgroup *cgrp, char *buf,
>> +                                              size_t buflen)
>> +{
>> +     return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
>> +}
>> +
>>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>>                                             size_t buflen)
>>  {
>> -     return kernfs_path(cgrp->kn, buf, buflen);
>> +     return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>>  }
>>
>>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
>> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
>> new file mode 100644
>> index 0000000..9f637fe
>> --- /dev/null
>> +++ b/include/linux/cgroup_namespace.h
>> @@ -0,0 +1,62 @@
>> +#ifndef _LINUX_CGROUP_NAMESPACE_H
>> +#define _LINUX_CGROUP_NAMESPACE_H
>> +
>> +#include <linux/nsproxy.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/types.h>
>> +#include <linux/user_namespace.h>
>> +
>> +extern struct cgroup_namespace init_cgroup_ns;
>> +
>> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
>> +{
>> +     return tsk->nsproxy->cgroup_ns->root_cgrp;
>
> Per the rules in nsproxy.h, you should be taking the task_lock here.
>
> (If you are making assumptions about tsk then you need to state them
> here - I only looked quickly enough that you pass in 'leader')
>

In the new version of the patch, we call this function only for the
'current' task. As per nsproxy.h, no special precautions needed when
reading current task's nsproxy. So I just remodeled this function into
"current_cgroupns_root(void)".

>> +}
>> +
>> +#ifdef CONFIG_CGROUP_NS
>> +
>> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     if (ns)
>> +             atomic_inc(&ns->count);
>> +     return ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     if (ns && atomic_dec_and_test(&ns->count))
>> +             free_cgroup_ns(ns);
>> +}
>> +
>> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                            struct user_namespace *user_ns,
>> +                                            struct cgroup_namespace *old_ns);
>> +
>> +#else  /* CONFIG_CGROUP_NS */
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     return &init_cgroup_ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +}
>> +
>> +static inline struct cgroup_namespace *copy_cgroup_ns(
>> +             unsigned long flags,
>> +             struct user_namespace *user_ns,
>> +             struct cgroup_namespace *old_ns) {
>> +     if (flags & CLONE_NEWCGROUP)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     return old_ns;
>> +}
>> +
>> +#endif  /* CONFIG_CGROUP_NS */
>> +
>> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index 35fa08f..ac0d65b 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -8,6 +8,7 @@ struct mnt_namespace;
>>  struct uts_namespace;
>>  struct ipc_namespace;
>>  struct pid_namespace;
>> +struct cgroup_namespace;
>>  struct fs_struct;
>>
>>  /*
>> @@ -33,6 +34,7 @@ struct nsproxy {
>>       struct mnt_namespace *mnt_ns;
>>       struct pid_namespace *pid_ns_for_children;
>>       struct net           *net_ns;
>> +     struct cgroup_namespace *cgroup_ns;
>>  };
>>  extern struct nsproxy init_nsproxy;
>>
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index 34a1e10..e56dd73 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -6,6 +6,8 @@
>>
>>  struct pid_namespace;
>>  struct nsproxy;
>> +struct task_struct;
>> +struct inode;
>>
>>  struct proc_ns_operations {
>>       const char *name;
>> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>>  extern const struct proc_ns_operations pidns_operations;
>>  extern const struct proc_ns_operations userns_operations;
>>  extern const struct proc_ns_operations mntns_operations;
>> +extern const struct proc_ns_operations cgroupns_operations;
>>
>>  /*
>>   * We always define these enumerators
>> @@ -37,6 +40,7 @@ enum {
>>       PROC_UTS_INIT_INO       = 0xEFFFFFFEU,
>>       PROC_USER_INIT_INO      = 0xEFFFFFFDU,
>>       PROC_PID_INIT_INO       = 0xEFFFFFFCU,
>> +     PROC_CGROUP_INIT_INO    = 0xEFFFFFFBU,
>>  };
>>
>>  #ifdef CONFIG_PROC_FS
>> diff --git a/init/Kconfig b/init/Kconfig
>> index e84c642..c3be001 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>>       Enable some debugging help. Currently it exports additional stat
>>       files in a cgroup which can be useful for debugging.
>>
>> +config CGROUP_NS
>> +     bool "CGroup Namespaces"
>> +     default n
>> +     help
>> +       This options enables CGroup Namespaces which can be used to isolate
>> +       cgroup paths. This feature is only useful when unified cgroup
>> +       hierarchy is in use (i.e. cgroups are mounted with sane_behavior
>> +       option).
>> +
>>  endif # CGROUPS
>>
>>  config CHECKPOINT_RESTORE
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index dc5c775..75334f8 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>>  obj-$(CONFIG_COMPAT) += compat.o
>>  obj-$(CONFIG_CGROUPS) += cgroup.o
>> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>>  obj-$(CONFIG_CPUSETS) += cpuset.o
>>  obj-$(CONFIG_UTS_NS) += utsname.o
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 2b3e9f9..f8099b4 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -57,6 +57,8 @@
>>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>>  #include <linux/kthread.h>
>>  #include <linux/delay.h>
>> +#include <linux/proc_ns.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  #include <linux/atomic.h>
>>
>> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>>                             bool is_add);
>>
>> +struct cgroup_namespace init_cgroup_ns = {
>> +     .count = {
>> +             .counter = 1,
>> +     },
>> +     .proc_inum = PROC_CGROUP_INIT_INO,
>> +     .user_ns = &init_user_ns,
>
> This might mean that you should bump the init_user_ns refcount.
>

Humm. Doesn't look like all other namespaces are doing it though (ex:
init_pid_ns or init_ipc_ns). The initial count in init_user_ns is set
to 3 which only accounts for some current users, but not all. I will
increment it for init_cgroup_ns nevertheless (in cgroup_init()).

>> +     .root_cgrp = &cgrp_dfl_root.cgrp,
>> +};
>> +
>>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>>                           gfp_t gfp_mask)
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> new file mode 100644
>> index 0000000..c16604f
>> --- /dev/null
>> +++ b/kernel/cgroup_namespace.c
>> @@ -0,0 +1,128 @@
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cgroup_namespace.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/proc_ns.h>
>> +
>> +static struct cgroup_namespace *alloc_cgroup_ns(void)
>> +{
>> +     struct cgroup_namespace *new_ns;
>> +
>> +     new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
>> +     if (new_ns)
>> +             atomic_set(&new_ns->count, 1);
>> +     return new_ns;
>> +}
>> +
>> +void free_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     cgroup_put(ns->root_cgrp);
>> +     put_user_ns(ns->user_ns);
>
> This is a problem on error patch in copy_cgroup_ns.  The
> alloc_cgroup_ns() doesn't initialize these values, so if
> you should fail in proc_alloc_inum() you'll show up here
> with fandom values in ns->*.
>

I don't see the codepath that leads to calling free_cgroup_ns() with
uninitialized members. We don't call free_cgroup_ns() on the error
path in copy_cgroup_ns().

>> +     proc_free_inum(ns->proc_inum);

BTW, I was missing the actual kfree(ns) here. Added it.

>> +}
>> +EXPORT_SYMBOL(free_cgroup_ns);
>> +
>> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                     struct user_namespace *user_ns,
>> +                                     struct cgroup_namespace *old_ns)
>> +{
>> +     struct cgroup_namespace *new_ns = NULL;
>> +     struct cgroup *cgrp = NULL;
>> +     int err;
>> +
>> +     BUG_ON(!old_ns);
>> +
>> +     if (!(flags & CLONE_NEWCGROUP))
>> +             return get_cgroup_ns(old_ns);
>> +
>> +     /* Allow only sysadmin to create cgroup namespace. */
>> +     err = -EPERM;
>> +     if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>> +             goto err_out;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(current);
>> +
>> +     cgrp = get_task_cgroup(current);
>> +
>> +     /* Creating new CGROUPNS is supported only when unified hierarchy is in
>> +      * use. */
>
> Oh, drat.  Well, I'll take, it, but under protest  :)
>

Actually, I realized that this comment and the check below is bogus.
The 'get_task_cgroup(current)' always only returns the cgroup on the
default hierarchy. And so, the check below is unnecessary.
What this comment should really say is that cgroup namespace only
virtualizes the cgroup path for the default(unified) hierarchy. Its
fine if you have other hierarchies mounted too. Just that for those
hierarchies, full (non-virtualized) cgroup path will be visible in
/proc/self/cgroup. So cgroupns won't help there.

I have updated the comment in the new version of the patch.

>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto err_out_unlock;
>> +
>> +     err = -ENOMEM;
>> +     new_ns = alloc_cgroup_ns();
>> +     if (!new_ns)
>> +             goto err_out_unlock;
>> +
>> +     err = proc_alloc_inum(&new_ns->proc_inum);
>> +     if (err)
>> +             goto err_out_unlock;
>> +
>> +     new_ns->user_ns = get_user_ns(user_ns);
>> +     new_ns->root_cgrp = cgrp;
>> +
>> +     threadgroup_unlock(current);
>> +
>> +     return new_ns;
>> +
>> +err_out_unlock:
>> +     threadgroup_unlock(current);
>> +err_out:
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     kfree(new_ns);
>> +     return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +     pr_info("setns not supported for cgroup namespace");
>> +     return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +     struct cgroup_namespace *ns = NULL;
>> +     struct nsproxy *nsproxy;
>> +
>> +     rcu_read_lock();
>> +     nsproxy = task->nsproxy;
>> +     if (nsproxy) {
>> +             ns = nsproxy->cgroup_ns;
>> +             get_cgroup_ns(ns);
>> +     }
>> +     rcu_read_unlock();
>> +
>> +     return ns;
>> +}
>> +
>> +static void cgroupns_put(void *ns)
>> +{
>> +     put_cgroup_ns(ns);
>> +}
>> +
>> +static unsigned int cgroupns_inum(void *ns)
>> +{
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +
>> +     return cgroup_ns->proc_inum;
>> +}
>> +
>> +const struct proc_ns_operations cgroupns_operations = {
>> +     .name           = "cgroup",
>> +     .type           = CLONE_NEWCGROUP,
>> +     .get            = cgroupns_get,
>> +     .put            = cgroupns_put,
>> +     .install        = cgroupns_install,
>> +     .inum           = cgroupns_inum,
>> +};
>> +
>> +static __init int cgroup_namespaces_init(void)
>> +{
>> +     return 0;
>> +}
>> +subsys_initcall(cgroup_namespaces_init);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 0cf9cdb..cc06851 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>>       if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>>                               CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>>                               CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
>> -                             CLONE_NEWUSER|CLONE_NEWPID))
>> +                             CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>>               return -EINVAL;
>>       /*
>>        * Not implemented, but pretend it works if there is nothing to
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index ef42d0a..a8b1970 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -25,6 +25,7 @@
>>  #include <linux/proc_ns.h>
>>  #include <linux/file.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  static struct kmem_cache *nsproxy_cachep;
>>
>> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>>  #ifdef CONFIG_NET
>>       .net_ns                 = &init_net,
>>  #endif
>> +     .cgroup_ns              = &init_cgroup_ns,
>>  };
>>
>>  static inline struct nsproxy *create_nsproxy(void)
>> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>               goto out_pid;
>>       }
>>
>> +     new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
>> +                                         tsk->nsproxy->cgroup_ns);
>> +     if (IS_ERR(new_nsp->cgroup_ns)) {
>> +             err = PTR_ERR(new_nsp->cgroup_ns);
>> +             goto out_cgroup;
>> +     }
>> +
>>       new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>>       if (IS_ERR(new_nsp->net_ns)) {
>>               err = PTR_ERR(new_nsp->net_ns);
>> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>       return new_nsp;
>>
>>  out_net:
>> +     if (new_nsp->cgroup_ns)
>> +             put_cgroup_ns(new_nsp->cgroup_ns);
>> +out_cgroup:
>>       if (new_nsp->pid_ns_for_children)
>>               put_pid_ns(new_nsp->pid_ns_for_children);
>>  out_pid:
>> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>>       struct nsproxy *new_ns;
>>
>>       if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                           CLONE_NEWPID | CLONE_NEWNET)))) {
>> +                           CLONE_NEWPID | CLONE_NEWNET |
>> +                           CLONE_NEWCGROUP)))) {
>>               get_nsproxy(old_ns);
>>               return 0;
>>       }
>> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>>               put_ipc_ns(ns->ipc_ns);
>>       if (ns->pid_ns_for_children)
>>               put_pid_ns(ns->pid_ns_for_children);
>> +     if (ns->cgroup_ns)
>> +             put_cgroup_ns(ns->cgroup_ns);
>>       put_net(ns->net_ns);
>>       kmem_cache_free(nsproxy_cachep, ns);
>>  }
>> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>>       int err = 0;
>>
>>       if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                            CLONE_NEWNET | CLONE_NEWPID)))
>> +                            CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>>               return 0;
>>
>>       user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the review!
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
  2014-10-24  1:03             ` Aditya Kali
@ 2014-10-25  3:16                 ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-25  3:16 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> +void free_cgroup_ns(struct cgroup_namespace *ns)
> >> +{
> >> +     cgroup_put(ns->root_cgrp);
> >> +     put_user_ns(ns->user_ns);
> >
> > This is a problem on error patch in copy_cgroup_ns.  The
> > alloc_cgroup_ns() doesn't initialize these values, so if
> > you should fail in proc_alloc_inum() you'll show up here
> > with fandom values in ns->*.
> >
> 
> I don't see the codepath that leads to calling free_cgroup_ns() with
> uninitialized members. We don't call free_cgroup_ns() on the error
> path in copy_cgroup_ns().

Hm, yeah, I'm not seeing it now, sorry.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces
@ 2014-10-25  3:16                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-10-25  3:16 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge E. Hallyn, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers

Quoting Aditya Kali (adityakali@google.com):
> >> +void free_cgroup_ns(struct cgroup_namespace *ns)
> >> +{
> >> +     cgroup_put(ns->root_cgrp);
> >> +     put_user_ns(ns->user_ns);
> >
> > This is a problem on error patch in copy_cgroup_ns.  The
> > alloc_cgroup_ns() doesn't initialize these values, so if
> > you should fail in proc_alloc_inum() you'll show up here
> > with fandom values in ns->*.
> >
> 
> I don't see the codepath that leads to calling free_cgroup_ns() with
> uninitialized members. We don't call free_cgroup_ns() on the error
> path in copy_cgroup_ns().

Hm, yeah, I'm not seeing it now, sorry.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv2 0/7] CGroup Namespaces
       [not found] <adityakali-cgroupns>
@ 2014-10-31 19:18   ` Aditya Kali
  2014-07-17 19:52 ` Aditya Kali
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Another attempt at Cgroup Namespace patch-set. This incorporates
suggestions on previous patch-set.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
      path relative to its own cgroupns-root will be shown:
      # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
      [ns2]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2/sub_cgrp_1
      [ns2]$
      Note that the relative path always starts with '/' to indicate that its
      relative to the cgroupns-root of the caller.

  (4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
      (if they have proper access to external cgroups).
      # From inside cgroupns (with cgroupns-root at /batchjobs/c_job_id1), and
      # assuming that the global hierarchy is still accessible inside cgroupns:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2

      Note that this kind of setup is not encouraged. A task inside cgroupns
      should only be exposed to its own cgroupns hierarchy. Otherwise it makes
      the virtualization of /proc/<pid>/cgroup less useful.

  (5) Setns to another cgroup namespace is allowed when:
      (a) the process has CAP_SYS_ADMIN in its current userns
      (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
      No implicit cgroup changes happen with attaching to another cgroupns. It
      is expected that the somone moves the attaching process under the target
      cgroupns-root.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

  (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
      the unified cgroup hierarchy with cgroupns-root as the filesystem root.
      The process needs CAP_SYS_ADMIN in its userns and mntns.

Implementation
  The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.

---

 fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
 fs/kernfs/mount.c                |  48 ++++++++++
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  41 ++++++++-
 include/linux/cgroup_namespace.h |  36 ++++++++
 include/linux/kernfs.h           |   5 +
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 +
 include/uapi/linux/sched.h       |   3 +-
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  | 108 +++++++++++++++++-----
 kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++-
 14 files changed, 561 insertions(+), 52 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

 [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path
 [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup
 [PATCHv2 3/7] cgroup: add function to get task's cgroup on default
 [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()
 [PATCHv2 5/7] cgroup: introduce cgroup namespaces
 [PATCHv2 6/7] cgroup: cgroup namespace setns support
 [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv2 0/7] CGroup Namespaces
@ 2014-10-31 19:18   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal

Another attempt at Cgroup Namespace patch-set. This incorporates
suggestions on previous patch-set.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
      path relative to its own cgroupns-root will be shown:
      # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
      [ns2]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2/sub_cgrp_1
      [ns2]$
      Note that the relative path always starts with '/' to indicate that its
      relative to the cgroupns-root of the caller.

  (4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
      (if they have proper access to external cgroups).
      # From inside cgroupns (with cgroupns-root at /batchjobs/c_job_id1), and
      # assuming that the global hierarchy is still accessible inside cgroupns:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2

      Note that this kind of setup is not encouraged. A task inside cgroupns
      should only be exposed to its own cgroupns hierarchy. Otherwise it makes
      the virtualization of /proc/<pid>/cgroup less useful.

  (5) Setns to another cgroup namespace is allowed when:
      (a) the process has CAP_SYS_ADMIN in its current userns
      (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
      No implicit cgroup changes happen with attaching to another cgroupns. It
      is expected that the somone moves the attaching process under the target
      cgroupns-root.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

  (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
      the unified cgroup hierarchy with cgroupns-root as the filesystem root.
      The process needs CAP_SYS_ADMIN in its userns and mntns.

Implementation
  The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.

---

 fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
 fs/kernfs/mount.c                |  48 ++++++++++
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  41 ++++++++-
 include/linux/cgroup_namespace.h |  36 ++++++++
 include/linux/kernfs.h           |   5 +
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 +
 include/uapi/linux/sched.h       |   3 +-
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  | 108 +++++++++++++++++-----
 kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++-
 14 files changed, 561 insertions(+), 52 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

 [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path
 [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup
 [PATCHv2 3/7] cgroup: add function to get task's cgroup on default
 [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()
 [PATCHv2 5/7] cgroup: introduce cgroup namespaces
 [PATCHv2 6/7] cgroup: cgroup namespace setns support
 [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path
  2014-10-31 19:18   ` Aditya Kali
@ 2014-10-31 19:18       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 194 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 176 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..e49c365 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,158 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-	char *p = buf + buflen;
+	size_t depth = 0;
+
+	BUG_ON(!kn);
+	while (kn->parent) {
+		depth++;
+		kn = kn->parent;
+	}
+	return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_from,
+	struct kernfs_node *kn_to,
+	char *buf,
+	size_t buflen)
+{
+	char *p = buf;
+	struct kernfs_node *kn;
+	size_t depth_from = 0, depth_to, d;
 	int len;
 
-	*--p = '\0';
+	/* We atleast need 2 bytes to write "/\0". */
+	BUG_ON(buflen < 2);
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
+	if (kn_from == kn_to) {
+		*p = '/';
+		*(p + 1) = '\0';
+		return p;
+	}
+
+	/* We can find the relative path only if both the nodes belong to the
+	 * same kernfs root.
+	 */
+	if (kn_from) {
+		BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+		depth_from = kernfs_node_depth(kn_from);
+	}
+
+	depth_to = kernfs_node_depth(kn_to);
+
+	/* We compose path from left to right. So first write out all possible
+	 * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+	 */
+	if (kn_from) {
+		while (depth_from > depth_to) {
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
 		}
+
+		d = depth_to;
+		kn = kn_to;
+		while (depth_from < d) {
+			kn = kn->parent;
+			d--;
+		}
+
+		/* Now we have 'depth_from == depth_to' at this point. Add more
+		 * "/.."s until we reach common ancestor. In the worst case,
+		 * root node will be the common ancestor.
+		 */
+		while (depth_from > 0) {
+			/* If we reached common ancestor, stop. */
+			if (kn_from == kn)
+				break;
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
+			kn = kn->parent;
+		}
+	}
+
+	/* Figure out how many bytes we need to write the path.
+	 */
+	d = depth_to;
+	kn = kn_to;
+	len = 0;
+	while (depth_from < d) {
+		/* Account for "/<name>". */
+		len += strlen(kn->name) + 1;
+		kn = kn->parent;
+		--d;
+	}
+
+	if ((buflen - (p - buf)) < len + 1) {
+		/* buffer not big enough. */
+		buf[0] = '\0';
+		return NULL;
+	}
+
+	/* We have enough space. Move 'p' ahead by computed length and start
+	 * writing node names into buffer.
+	 */
+	p += len;
+	*p = '\0';
+	d = depth_to;
+	kn = kn_to;
+	while (d > depth_from) {
+		len = strlen(kn->name);
 		p -= len;
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
-	} while (kn && kn->parent);
+		--d;
+	}
 
-	return p;
+	return buf;
 }
 
 /**
@@ -92,26 +222,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +297,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path
@ 2014-10-31 19:18       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/dir.c        | 194 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 176 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..e49c365 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,158 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-	char *p = buf + buflen;
+	size_t depth = 0;
+
+	BUG_ON(!kn);
+	while (kn->parent) {
+		depth++;
+		kn = kn->parent;
+	}
+	return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_from,
+	struct kernfs_node *kn_to,
+	char *buf,
+	size_t buflen)
+{
+	char *p = buf;
+	struct kernfs_node *kn;
+	size_t depth_from = 0, depth_to, d;
 	int len;
 
-	*--p = '\0';
+	/* We atleast need 2 bytes to write "/\0". */
+	BUG_ON(buflen < 2);
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
+	if (kn_from == kn_to) {
+		*p = '/';
+		*(p + 1) = '\0';
+		return p;
+	}
+
+	/* We can find the relative path only if both the nodes belong to the
+	 * same kernfs root.
+	 */
+	if (kn_from) {
+		BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+		depth_from = kernfs_node_depth(kn_from);
+	}
+
+	depth_to = kernfs_node_depth(kn_to);
+
+	/* We compose path from left to right. So first write out all possible
+	 * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+	 */
+	if (kn_from) {
+		while (depth_from > depth_to) {
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
 		}
+
+		d = depth_to;
+		kn = kn_to;
+		while (depth_from < d) {
+			kn = kn->parent;
+			d--;
+		}
+
+		/* Now we have 'depth_from == depth_to' at this point. Add more
+		 * "/.."s until we reach common ancestor. In the worst case,
+		 * root node will be the common ancestor.
+		 */
+		while (depth_from > 0) {
+			/* If we reached common ancestor, stop. */
+			if (kn_from == kn)
+				break;
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
+			kn = kn->parent;
+		}
+	}
+
+	/* Figure out how many bytes we need to write the path.
+	 */
+	d = depth_to;
+	kn = kn_to;
+	len = 0;
+	while (depth_from < d) {
+		/* Account for "/<name>". */
+		len += strlen(kn->name) + 1;
+		kn = kn->parent;
+		--d;
+	}
+
+	if ((buflen - (p - buf)) < len + 1) {
+		/* buffer not big enough. */
+		buf[0] = '\0';
+		return NULL;
+	}
+
+	/* We have enough space. Move 'p' ahead by computed length and start
+	 * writing node names into buffer.
+	 */
+	p += len;
+	*p = '\0';
+	d = depth_to;
+	kn = kn_to;
+	while (d > depth_from) {
+		len = strlen(kn->name);
 		p -= len;
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
-	} while (kn && kn->parent);
+		--d;
+	}
 
-	return p;
+	return buf;
 }
 
 /**
@@ -92,26 +222,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +297,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2014-10-31 19:18   ` Aditya Kali
@ 2014-10-31 19:18       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-10-31 19:18       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy
  2014-10-31 19:18   ` Aditya Kali
@ 2014-10-31 19:18       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 136ecea..50fa8e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1917,6 +1917,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy
@ 2014-10-31 19:18       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 136ecea..50fa8e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1917,6 +1917,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()
  2014-10-31 19:18   ` Aditya Kali
@ 2014-10-31 19:18       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 50fa8e3..9c622b9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()
@ 2014-10-31 19:18       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 50fa8e3..9c622b9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-10-31 19:18   ` Aditya Kali
@ 2014-10-31 19:18       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  36 +++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  |  14 ++++
 kernel/cgroup_namespace.c        | 134 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 10 files changed, 227 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+	&cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
 	unsigned long key;
 	int ssid, err;
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..7e9bda0
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,134 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-10-31 19:18       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:18 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  36 +++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  |  14 ++++
 kernel/cgroup_namespace.c        | 134 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 10 files changed, 227 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+	&cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	parent = cgroup_kn_lock_live(parent_kn);
 	if (!parent)
 		return -ENODEV;
+
 	root = parent->root;
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
 	unsigned long key;
 	int ssid, err;
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..7e9bda0
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,134 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali@google.com)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out_unlock;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out_unlock;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	threadgroup_unlock(current);
+
+	return new_ns;
+
+err_out_unlock:
+	threadgroup_unlock(current);
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	rcu_read_lock();
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	rcu_read_unlock();
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 6/7] cgroup: cgroup namespace setns support
       [not found]   ` <1414783141-6947-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (4 preceding siblings ...)
  2014-10-31 19:18       ` Aditya Kali
@ 2014-10-31 19:19     ` Aditya Kali
  2014-10-31 19:19     ` [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
  2014-11-04 13:10       ` Vivek Goyal
  7 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:19 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup_namespace.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	threadgroup_unlock(current);
+
+	return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 6/7] cgroup: cgroup namespace setns support
       [not found]   ` <1414783141-6947-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-10-31 19:19     ` Aditya Kali
  2014-10-31 19:18       ` Aditya Kali
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:19 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 kernel/cgroup_namespace.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	threadgroup_unlock(current);
+
+	return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 6/7] cgroup: cgroup namespace setns support
@ 2014-10-31 19:19     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:19 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA, Aditya Kali

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup_namespace.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Prevent cgroup changes for this task. */
+	threadgroup_lock(current);
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	threadgroup_unlock(current);
+
+	return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]   ` <1414783141-6947-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (5 preceding siblings ...)
  2014-10-31 19:19     ` [PATCHv2 6/7] cgroup: cgroup namespace setns support Aditya Kali
@ 2014-10-31 19:19     ` Aditya Kali
  2014-11-04 13:10       ` Vivek Goyal
  7 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:19 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..250aaec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	memset(opts, 0, sizeof(*opts));
 
+	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+	 * namespace.
+	 */
+	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+	}
+
 	while ((token = strsep(&o, ",")) != NULL) {
 		nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
-		if (nr_opts != 1) {
+		if (nr_opts > 1) {
 			pr_err("sane_behavior: no other mount options allowed\n");
 			return -EINVAL;
 		}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1841,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1834,6 +1875,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-10-31 19:18   ` Aditya Kali
  (?)
  (?)
@ 2014-10-31 19:19   ` Aditya Kali
  2014-11-01  0:07       ` Andy Lutomirski
                       ` (2 more replies)
  -1 siblings, 3 replies; 384+ messages in thread
From: Aditya Kali @ 2014-10-31 19:19 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..250aaec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	memset(opts, 0, sizeof(*opts));
 
+	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+	 * namespace.
+	 */
+	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+	}
+
 	while ((token = strsep(&o, ",")) != NULL) {
 		nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
-		if (nr_opts != 1) {
+		if (nr_opts > 1) {
 			pr_err("sane_behavior: no other mount options allowed\n");
 			return -EINVAL;
 		}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1841,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1834,6 +1875,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-10-31 19:18       ` Aditya Kali
@ 2014-11-01  0:02           ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  0:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
>  they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>

> +       /* Prevent cgroup changes for this task. */
> +       threadgroup_lock(current);

This could just be me being dense, but what is the lock for?

> +
> +       /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> +        */
> +       cgrp = get_task_cgroup(current);
> +
> +       err = -ENOMEM;
> +       new_ns = alloc_cgroup_ns();
> +       if (!new_ns)
> +               goto err_out_unlock;
> +
> +       err = proc_alloc_inum(&new_ns->proc_inum);
> +       if (err)
> +               goto err_out_unlock;
> +
> +       new_ns->user_ns = get_user_ns(user_ns);
> +       new_ns->root_cgrp = cgrp;
> +
> +       threadgroup_unlock(current);
> +
> +       return new_ns;
> +
> +err_out_unlock:
> +       threadgroup_unlock(current);
> +err_out:
> +       if (cgrp)
> +               cgroup_put(cgrp);
> +       kfree(new_ns);
> +       return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +       pr_info("setns not supported for cgroup namespace");
> +       return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +       struct cgroup_namespace *ns = NULL;
> +       struct nsproxy *nsproxy;
> +
> +       rcu_read_lock();
> +       nsproxy = task->nsproxy;
> +       if (nsproxy) {
> +               ns = nsproxy->cgroup_ns;
> +               get_cgroup_ns(ns);
> +       }
> +       rcu_read_unlock();

How is this correct?  Other namespaces do it too, so it Must Be
Correct (tm), but I don't understand.  What is RCU protecting?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-11-01  0:02           ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  0:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali@google.com> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
>  they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>

> +       /* Prevent cgroup changes for this task. */
> +       threadgroup_lock(current);

This could just be me being dense, but what is the lock for?

> +
> +       /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> +        */
> +       cgrp = get_task_cgroup(current);
> +
> +       err = -ENOMEM;
> +       new_ns = alloc_cgroup_ns();
> +       if (!new_ns)
> +               goto err_out_unlock;
> +
> +       err = proc_alloc_inum(&new_ns->proc_inum);
> +       if (err)
> +               goto err_out_unlock;
> +
> +       new_ns->user_ns = get_user_ns(user_ns);
> +       new_ns->root_cgrp = cgrp;
> +
> +       threadgroup_unlock(current);
> +
> +       return new_ns;
> +
> +err_out_unlock:
> +       threadgroup_unlock(current);
> +err_out:
> +       if (cgrp)
> +               cgroup_put(cgrp);
> +       kfree(new_ns);
> +       return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +       pr_info("setns not supported for cgroup namespace");
> +       return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +       struct cgroup_namespace *ns = NULL;
> +       struct nsproxy *nsproxy;
> +
> +       rcu_read_lock();
> +       nsproxy = task->nsproxy;
> +       if (nsproxy) {
> +               ns = nsproxy->cgroup_ns;
> +               get_cgroup_ns(ns);
> +       }
> +       rcu_read_unlock();

How is this correct?  Other namespaces do it too, so it Must Be
Correct (tm), but I don't understand.  What is RCU protecting?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-11-01  0:07       ` Andy Lutomirski
  2014-11-01  1:09         ` Eric W. Biederman
  2014-11-04  1:59       ` Aditya Kali
  2 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  0:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>         return NULL;
>  }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +                                 struct kernfs_node *kn)
> +{

I can't usefully review this, but kernfs_make_root and
kernfs_obtain_root aren't the same string...

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         memset(opts, 0, sizeof(*opts));
>
> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +        * namespace.
> +        */
> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +       }
> +

I don't like this implicit stuff.  Can you just return -EINVAL if sane
behavior isn't requested?

>         while ((token = strsep(&o, ",")) != NULL) {
>                 nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -               if (nr_opts != 1) {
> +               if (nr_opts > 1) {
>                         pr_err("sane_behavior: no other mount options allowed\n");
>                         return -EINVAL;

This looks wrong.  But, if you make the change above, then it'll be right.

> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>         int ret;
>         int i;
>         bool new_sb;
> +       struct cgroup_namespace *ns =
> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +       /* Check if the caller has permission to mount. */
> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +               put_cgroup_ns(ns);
> +               return ERR_PTR(-EPERM);
> +       }

Why is this necessary?

> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>         .name = "cgroup",
>         .mount = cgroup_mount,
>         .kill_sb = cgroup_kill_sb,
> +       .fs_flags = FS_USERNS_MOUNT,

Aargh, another one!  Eric, can you either ack or nack my patch?
Because if my patch goes in, then this line may need to change.  Or
not, but if a stable release with cgroupfs and without my patch
happens, then we'll have an ABI break.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-11-01  0:07       ` Andy Lutomirski
  2014-11-01  1:09         ` Eric W. Biederman
  2014-11-04  1:59       ` Aditya Kali
  2 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  0:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>         return NULL;
>  }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +                                 struct kernfs_node *kn)
> +{

I can't usefully review this, but kernfs_make_root and
kernfs_obtain_root aren't the same string...

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         memset(opts, 0, sizeof(*opts));
>
> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +        * namespace.
> +        */
> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +       }
> +

I don't like this implicit stuff.  Can you just return -EINVAL if sane
behavior isn't requested?

>         while ((token = strsep(&o, ",")) != NULL) {
>                 nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -               if (nr_opts != 1) {
> +               if (nr_opts > 1) {
>                         pr_err("sane_behavior: no other mount options allowed\n");
>                         return -EINVAL;

This looks wrong.  But, if you make the change above, then it'll be right.

> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>         int ret;
>         int i;
>         bool new_sb;
> +       struct cgroup_namespace *ns =
> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +       /* Check if the caller has permission to mount. */
> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +               put_cgroup_ns(ns);
> +               return ERR_PTR(-EPERM);
> +       }

Why is this necessary?

> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>         .name = "cgroup",
>         .mount = cgroup_mount,
>         .kill_sb = cgroup_kill_sb,
> +       .fs_flags = FS_USERNS_MOUNT,

Aargh, another one!  Eric, can you either ack or nack my patch?
Because if my patch goes in, then this line may need to change.  Or
not, but if a stable release with cgroupfs and without my patch
happens, then we'll have an ABI break.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-01  0:07       ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  0:07 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>         return NULL;
>  }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +                                 struct kernfs_node *kn)
> +{

I can't usefully review this, but kernfs_make_root and
kernfs_obtain_root aren't the same string...

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         memset(opts, 0, sizeof(*opts));
>
> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +        * namespace.
> +        */
> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +       }
> +

I don't like this implicit stuff.  Can you just return -EINVAL if sane
behavior isn't requested?

>         while ((token = strsep(&o, ",")) != NULL) {
>                 nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -               if (nr_opts != 1) {
> +               if (nr_opts > 1) {
>                         pr_err("sane_behavior: no other mount options allowed\n");
>                         return -EINVAL;

This looks wrong.  But, if you make the change above, then it'll be right.

> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>         int ret;
>         int i;
>         bool new_sb;
> +       struct cgroup_namespace *ns =
> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +       /* Check if the caller has permission to mount. */
> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +               put_cgroup_ns(ns);
> +               return ERR_PTR(-EPERM);
> +       }

Why is this necessary?

> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>         .name = "cgroup",
>         .mount = cgroup_mount,
>         .kill_sb = cgroup_kill_sb,
> +       .fs_flags = FS_USERNS_MOUNT,

Aargh, another one!  Eric, can you either ack or nack my patch?
Because if my patch goes in, then this line may need to change.  Or
not, but if a stable release with cgroupfs and without my patch
happens, then we'll have an ABI break.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-11-01  0:02           ` Andy Lutomirski
@ 2014-11-01  0:58               ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  0:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

<snip>

>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +       struct cgroup_namespace *ns = NULL;
>> +       struct nsproxy *nsproxy;
>> +
>> +       rcu_read_lock();
>> +       nsproxy = task->nsproxy;
>> +       if (nsproxy) {
>> +               ns = nsproxy->cgroup_ns;
>> +               get_cgroup_ns(ns);
>> +       }
>> +       rcu_read_unlock();
>
> How is this correct?  Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand.  What is RCU protecting?

The code is not correct.  The code needs to use task_lock.

RCU used to protect nsproxy, and now task_lock protects nsproxy.
For the reasons of of all of this I refer you to the commit
that changed this, and the comment in nsproxy.h

commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
Author: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Date:   Mon Feb 3 19:13:49 2014 -0800

    namespaces: Use task_lock and not rcu to protect nsproxy
    
    The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.
    
    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare.  So optimize for same process reads and write
    by switching using rask_lock instead.
    
    This yields a simpler to understand lock, and a faster setns system call.
    
    In particular this fixes a performance regression observed
    by Rafael David Tinoco <rafael.tinoco-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>.
    
    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007.  The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-11-01  0:58               ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  0:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

Andy Lutomirski <luto@amacapital.net> writes:

> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali@google.com> wrote:

<snip>

>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +       struct cgroup_namespace *ns = NULL;
>> +       struct nsproxy *nsproxy;
>> +
>> +       rcu_read_lock();
>> +       nsproxy = task->nsproxy;
>> +       if (nsproxy) {
>> +               ns = nsproxy->cgroup_ns;
>> +               get_cgroup_ns(ns);
>> +       }
>> +       rcu_read_unlock();
>
> How is this correct?  Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand.  What is RCU protecting?

The code is not correct.  The code needs to use task_lock.

RCU used to protect nsproxy, and now task_lock protects nsproxy.
For the reasons of of all of this I refer you to the commit
that changed this, and the comment in nsproxy.h

commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Feb 3 19:13:49 2014 -0800

    namespaces: Use task_lock and not rcu to protect nsproxy
    
    The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.
    
    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare.  So optimize for same process reads and write
    by switching using rask_lock instead.
    
    This yields a simpler to understand lock, and a faster setns system call.
    
    In particular this fixes a performance regression observed
    by Rafael David Tinoco <rafael.tinoco@canonical.com>.
    
    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007.  The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-10-31 19:19   ` [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
@ 2014-11-01  1:09         ` Eric W. Biederman
  2014-11-04  1:59       ` Aditya Kali
       [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  1:09 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.

There is a misdesign in this.  Because files already exist we need the
protections that are present in proc and sysfs that only allow you to
mount the filesystem if it is already mounted.  Otherwise you can wind
up mounting this cgroupfs in a chroot jail when the global root would
not like you to see it.  cgroupfs isn't as bad as proc and sys but there
is at the very least an information leak in mounting it.

Given that we are effectively performing a bind mount in this patch, and
that we need to require cgroupfs be mounted anyway (to be safe).

I don't see the point of this change.  

If we could change the set of cgroups or visible in cgroupfs I could
probably see the point.  But as it is this change seems to be pointless.

Eric


> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */
> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}
> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	memset(opts, 0, sizeof(*opts));
>  
> +	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +	 * namespace.
> +	 */
> +	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +	}
> +
>  	while ((token = strsep(&o, ",")) != NULL) {
>  		nr_opts++;
>  
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -		if (nr_opts != 1) {
> +		if (nr_opts > 1) {
>  			pr_err("sane_behavior: no other mount options allowed\n");
>  			return -EINVAL;
>  		}
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1817,11 +1841,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> +		/* If this mount is for the default hierarchy in non-init cgroup
> +		 * namespace, then instead of root cgroup's dentry, we return
> +		 * the dentry corresponding to the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1834,6 +1875,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-01  1:09         ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  1:09 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal

Aditya Kali <adityakali@google.com> writes:

> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.

There is a misdesign in this.  Because files already exist we need the
protections that are present in proc and sysfs that only allow you to
mount the filesystem if it is already mounted.  Otherwise you can wind
up mounting this cgroupfs in a chroot jail when the global root would
not like you to see it.  cgroupfs isn't as bad as proc and sys but there
is at the very least an information leak in mounting it.

Given that we are effectively performing a bind mount in this patch, and
that we need to require cgroupfs be mounted anyway (to be safe).

I don't see the point of this change.  

If we could change the set of cgroups or visible in cgroupfs I could
probably see the point.  But as it is this change seems to be pointless.

Eric


> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */
> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}
> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	memset(opts, 0, sizeof(*opts));
>  
> +	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +	 * namespace.
> +	 */
> +	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +	}
> +
>  	while ((token = strsep(&o, ",")) != NULL) {
>  		nr_opts++;
>  
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -		if (nr_opts != 1) {
> +		if (nr_opts > 1) {
>  			pr_err("sane_behavior: no other mount options allowed\n");
>  			return -EINVAL;
>  		}
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1817,11 +1841,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> +		/* If this mount is for the default hierarchy in non-init cgroup
> +		 * namespace, then instead of root cgroup's dentry, we return
> +		 * the dentry corresponding to the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1834,6 +1875,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]       ` <CALCETrXTaZ3SJ_t-gnbc93BVZXg-912NqO78kFd0Tpi-5-dZoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-01  2:59         ` Eric W. Biederman
  2014-11-03 23:12           ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  2:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.

cgroupfs has no device nodes.  So as long as we are consistent in any
given release what happens here is orthogonal.

I don't remember if we have managed to get the original problem fixed
with the trivial backportable solution.  I think so.

My apologies for not getting to that I haven't even had time to shepherd
through the regression associated regression fix.  I probably just lock
track of them but I haven't found the Tested-By's for it yet.

Nor have I had time to dig through and figure out how to safely deal
with umount -l aka MOUNT_DETACH.

Along with the question about what to do with nodev, there is also
your patch about nosuid.

Starting in about 5 minutes I am going to be mostly offline until
sometime in the 3rd week in November as I haul all of my stuff accross
the country to someplace that actually has winter and my allergies don't
kill me.

I am going to have to review and merge a lot of code as soon as I am
back to being a programmer full time again.  There is a lot of
interesting stuff coming in right now.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]       ` <CALCETrXTaZ3SJ_t-gnbc93BVZXg-912NqO78kFd0Tpi-5-dZoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-01  2:59         ` Eric W. Biederman
  2014-11-03 23:12           ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  2:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

Andy Lutomirski <luto@amacapital.net> writes:
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.

cgroupfs has no device nodes.  So as long as we are consistent in any
given release what happens here is orthogonal.

I don't remember if we have managed to get the original problem fixed
with the trivial backportable solution.  I think so.

My apologies for not getting to that I haven't even had time to shepherd
through the regression associated regression fix.  I probably just lock
track of them but I haven't found the Tested-By's for it yet.

Nor have I had time to dig through and figure out how to safely deal
with umount -l aka MOUNT_DETACH.

Along with the question about what to do with nodev, there is also
your patch about nosuid.

Starting in about 5 minutes I am going to be mostly offline until
sometime in the 3rd week in November as I haul all of my stuff accross
the country to someplace that actually has winter and my allergies don't
kill me.

I am going to have to review and merge a lot of code as soon as I am
back to being a programmer full time again.  There is a lot of
interesting stuff coming in right now.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-01  2:59         ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2014-11-01  2:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel@vger.kernel.org,
	Linux API, Ingo Molnar, Linux Containers, Rohit Jnagal

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.

cgroupfs has no device nodes.  So as long as we are consistent in any
given release what happens here is orthogonal.

I don't remember if we have managed to get the original problem fixed
with the trivial backportable solution.  I think so.

My apologies for not getting to that I haven't even had time to shepherd
through the regression associated regression fix.  I probably just lock
track of them but I haven't found the Tested-By's for it yet.

Nor have I had time to dig through and figure out how to safely deal
with umount -l aka MOUNT_DETACH.

Along with the question about what to do with nodev, there is also
your patch about nosuid.

Starting in about 5 minutes I am going to be mostly offline until
sometime in the 3rd week in November as I haul all of my stuff accross
the country to someplace that actually has winter and my allergies don't
kill me.

I am going to have to review and merge a lot of code as soon as I am
back to being a programmer full time again.  There is a lot of
interesting stuff coming in right now.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-01  2:59         ` Eric W. Biederman
@ 2014-11-01  3:29             ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  3:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 7:59 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>>         .name = "cgroup",
>>>         .mount = cgroup_mount,
>>>         .kill_sb = cgroup_kill_sb,
>>> +       .fs_flags = FS_USERNS_MOUNT,
>>
>> Aargh, another one!  Eric, can you either ack or nack my patch?
>> Because if my patch goes in, then this line may need to change.  Or
>> not, but if a stable release with cgroupfs and without my patch
>> happens, then we'll have an ABI break.
>
> cgroupfs has no device nodes.  So as long as we are consistent in any
> given release what happens here is orthogonal.
>
> I don't remember if we have managed to get the original problem fixed
> with the trivial backportable solution.  I think so.

I don't remember.  I think the problem is still there, since I think
my patch still applies, and my patch conflicts with your fix.  It's
been long enough that I'm not sure it's worth applying your patch as
an interim fix.

>
> My apologies for not getting to that I haven't even had time to shepherd
> through the regression associated regression fix.  I probably just lock
> track of them but I haven't found the Tested-By's for it yet.

No worries.  I've tested it, but it's my patch, so there's a big grain
of salt there.  I think Serge tested it, too.

>
> Nor have I had time to dig through and figure out how to safely deal
> with umount -l aka MOUNT_DETACH.

If you're talking about the do_remount_sb thing, that's already in Linus' tree.

>
> Along with the question about what to do with nodev, there is also
> your patch about nosuid.

The nosuid patch has a couple versions, and I'm not sure which version
I prefer.  It's certainly debatable.

>
> Starting in about 5 minutes I am going to be mostly offline until
> sometime in the 3rd week in November as I haul all of my stuff accross
> the country to someplace that actually has winter and my allergies don't
> kill me.

Have fun!

--Andy

>
> I am going to have to review and merge a lot of code as soon as I am
> back to being a programmer full time again.  There is a lot of
> interesting stuff coming in right now.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-01  3:29             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-01  3:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 7:59 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>>         .name = "cgroup",
>>>         .mount = cgroup_mount,
>>>         .kill_sb = cgroup_kill_sb,
>>> +       .fs_flags = FS_USERNS_MOUNT,
>>
>> Aargh, another one!  Eric, can you either ack or nack my patch?
>> Because if my patch goes in, then this line may need to change.  Or
>> not, but if a stable release with cgroupfs and without my patch
>> happens, then we'll have an ABI break.
>
> cgroupfs has no device nodes.  So as long as we are consistent in any
> given release what happens here is orthogonal.
>
> I don't remember if we have managed to get the original problem fixed
> with the trivial backportable solution.  I think so.

I don't remember.  I think the problem is still there, since I think
my patch still applies, and my patch conflicts with your fix.  It's
been long enough that I'm not sure it's worth applying your patch as
an interim fix.

>
> My apologies for not getting to that I haven't even had time to shepherd
> through the regression associated regression fix.  I probably just lock
> track of them but I haven't found the Tested-By's for it yet.

No worries.  I've tested it, but it's my patch, so there's a big grain
of salt there.  I think Serge tested it, too.

>
> Nor have I had time to dig through and figure out how to safely deal
> with umount -l aka MOUNT_DETACH.

If you're talking about the do_remount_sb thing, that's already in Linus' tree.

>
> Along with the question about what to do with nodev, there is also
> your patch about nosuid.

The nosuid patch has a couple versions, and I'm not sure which version
I prefer.  It's certainly debatable.

>
> Starting in about 5 minutes I am going to be mostly offline until
> sometime in the 3rd week in November as I haul all of my stuff accross
> the country to someplace that actually has winter and my allergies don't
> kill me.

Have fun!

--Andy

>
> I am going to have to review and merge a lot of code as soon as I am
> back to being a programmer full time again.  There is a lot of
> interesting stuff coming in right now.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]         ` <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2014-11-03 22:43           ` Aditya Kali
       [not found]             ` <CAGr1F2Hd_PS_AscBGMXdZC9qkHGRUp-MeQvJksDOQkRBB3RGoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-11-04 13:46               ` Tejun Heo
  2014-11-03 22:46           ` Aditya Kali
  1 sibling, 2 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 22:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
wrote:

> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
> > This patch enables cgroup mounting inside userns when a process
> > as appropriate privileges. The cgroup filesystem mounted is
> > rooted at the cgroupns-root. Thus, in a container-setup, only
> > the hierarchy under the cgroupns-root is exposed inside the container.
> > This allows container management tools to run inside the containers
> > without depending on any global state.
> > In order to support this, a new kernfs api is added to lookup the
> > dentry for the cgroupns-root.
>
> There is a misdesign in this.  Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted.  Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>
>
I think simply mounting the cgroupfs doesn't give you any more information
than what you don't already know about the system ; specially if the
visibility is restricted within the process's cgroupns-root. The cgroups
still wont be writable by the user, so I think it should be fine to allow
mounting?



> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point.  But as it is this change seems to be pointless.
>
>
I agree that this is effectively bind-mounting, but doing this in kernel
makes it really convenient for the userspace. The process that sets up the
container doesn't need to care whether it should bind-mount cgroupfs inside
the container or not. The tasks inside the container can mount cgroupfs on
as-needed basis. The root container manager can simply unshare cgroupns and
forget about the internal setup. I think this is useful just for the reason
that it makes life much simpler for userspace.



> Eric
>
>
> > Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/kernfs/mount.c      | 48
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/kernfs.h |  2 ++
> >  kernel/cgroup.c        | 47
> +++++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index f973ae9..e334f45 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct
> super_block *sb)
> >       return NULL;
> >  }
> >
> > +/**
> > + * kernfs_make_root - create new root dentry for the given kernfs_node.
> > + * @sb: the kernfs super_block
> > + * @kn: kernfs_node for which a dentry is needed
> > + *
> > + * This can used used by callers which want to mount only a part of the
> kernfs
> > + * as root of the filesystem.
> > + */
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > +                               struct kernfs_node *kn)
> > +{
> > +     struct dentry *dentry;
> > +     struct inode *inode;
> > +
> > +     BUG_ON(sb->s_op != &kernfs_sops);
> > +
> > +     /* inode for the given kernfs_node should already exist. */
> > +     inode = ilookup(sb, kn->ino);
> > +     if (!inode) {
> > +             pr_debug("kernfs: could not get inode for '");
> > +             pr_cont_kernfs_path(kn);
> > +             pr_cont("'.\n");
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> > +     /* instantiate and link root dentry */
> > +     dentry = d_obtain_root(inode);
> > +     if (!dentry) {
> > +             pr_debug("kernfs: could not get dentry for '");
> > +             pr_cont_kernfs_path(kn);
> > +             pr_cont("'.\n");
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> > +     /* If this is a new dentry, set it up. We need kernfs_mutex
> because this
> > +      * may be called by callers other than kernfs_fill_super. */
> > +     mutex_lock(&kernfs_mutex);
> > +     if (!dentry->d_fsdata) {
> > +             kernfs_get(kn);
> > +             dentry->d_fsdata = kn;
> > +     } else {
> > +             WARN_ON(dentry->d_fsdata != kn);
> > +     }
> > +     mutex_unlock(&kernfs_mutex);
> > +
> > +     return dentry;
> > +}
> > +
> >  static int kernfs_fill_super(struct super_block *sb, unsigned long
> magic)
> >  {
> >       struct kernfs_super_info *info = kernfs_info(sb);
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index 3c2be75..b9538e0 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
> >  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
> >  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
> >
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > +                               struct kernfs_node *kn);
> >  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
> >                                      unsigned int flags, void *priv);
> >  void kernfs_destroy_root(struct kernfs_root *root);
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 7e5d597..250aaec 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data,
> struct cgroup_sb_opts *opts)
> >
> >       memset(opts, 0, sizeof(*opts));
> >
> > +     /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init
> cgroup
> > +      * namespace.
> > +      */
> > +     if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> > +             opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> > +     }
> > +
> >       while ((token = strsep(&o, ",")) != NULL) {
> >               nr_opts++;
> >
> > @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data,
> struct cgroup_sb_opts *opts)
> >
> >       if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> >               pr_warn("sane_behavior: this is still under development
> and its behaviors will change, proceed at your own risk\n");
> > -             if (nr_opts != 1) {
> > +             if (nr_opts > 1) {
> >                       pr_err("sane_behavior: no other mount options
> allowed\n");
> >                       return -EINVAL;
> >               }
> > @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root
> *root,
> >               set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> >  }
> >
> > +struct dentry *cgroupns_get_root(struct super_block *sb,
> > +                              struct cgroup_namespace *ns)
> > +{
> > +     struct dentry *nsdentry;
> > +
> > +     nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> > +     return nsdentry;
> > +}
> > +
> >  static int cgroup_setup_root(struct cgroup_root *root, unsigned int
> ss_mask)
> >  {
> >       LIST_HEAD(tmp_links);
> > @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct
> file_system_type *fs_type,
> >       int ret;
> >       int i;
> >       bool new_sb;
> > +     struct cgroup_namespace *ns =
> > +             get_cgroup_ns(current->nsproxy->cgroup_ns);
> > +
> > +     /* Check if the caller has permission to mount. */
> > +     if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> > +             put_cgroup_ns(ns);
> > +             return ERR_PTR(-EPERM);
> > +     }
> >
> >       /*
> >        * The first time anyone tries to mount a cgroup, enable the list
> > @@ -1817,11 +1841,28 @@ out_free:
> >       kfree(opts.release_agent);
> >       kfree(opts.name);
> >
> > -     if (ret)
> > +     if (ret) {
> > +             put_cgroup_ns(ns);
> >               return ERR_PTR(ret);
> > +     }
> >
> >       dentry = kernfs_mount(fs_type, flags, root->kf_root,
> >                               CGROUP_SUPER_MAGIC, &new_sb);
> > +
> > +     if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> > +             /* If this mount is for the default hierarchy in non-init
> cgroup
> > +              * namespace, then instead of root cgroup's dentry, we
> return
> > +              * the dentry corresponding to the cgroupns->root_cgrp.
> > +              */
> > +             if (ns != &init_cgroup_ns) {
> > +                     struct dentry *nsdentry;
> > +
> > +                     nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> > +                     dput(dentry);
> > +                     dentry = nsdentry;
> > +             }
> > +     }
> > +
> >       if (IS_ERR(dentry) || !new_sb)
> >               cgroup_put(&root->cgrp);
> >
> > @@ -1834,6 +1875,7 @@ out_free:
> >               deactivate_super(pinned_sb);
> >       }
> >
> > +     put_cgroup_ns(ns);
> >       return dentry;
> >  }
> >
> > @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
> >       .name = "cgroup",
> >       .mount = cgroup_mount,
> >       .kill_sb = cgroup_kill_sb,
> > +     .fs_flags = FS_USERNS_MOUNT,
> >  };
> >
> >  static struct kobject *cgroup_kobj;
>



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]         ` <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2014-11-03 22:43           ` Aditya Kali
@ 2014-11-03 22:46           ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 22:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this.  Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted.  Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>

I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point.  But as it is this change seems to be pointless.
>

I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

> Eric
>
>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>       return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn)
>> +{
>> +     struct dentry *dentry;
>> +     struct inode *inode;
>> +
>> +     BUG_ON(sb->s_op != &kernfs_sops);
>> +
>> +     /* inode for the given kernfs_node should already exist. */
>> +     inode = ilookup(sb, kn->ino);
>> +     if (!inode) {
>> +             pr_debug("kernfs: could not get inode for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-EINVAL);
>> +     }
>> +
>> +     /* instantiate and link root dentry */
>> +     dentry = d_obtain_root(inode);
>> +     if (!dentry) {
>> +             pr_debug("kernfs: could not get dentry for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-ENOMEM);
>> +     }
>> +
>> +     /* If this is a new dentry, set it up. We need kernfs_mutex because this
>> +      * may be called by callers other than kernfs_fill_super. */
>> +     mutex_lock(&kernfs_mutex);
>> +     if (!dentry->d_fsdata) {
>> +             kernfs_get(kn);
>> +             dentry->d_fsdata = kn;
>> +     } else {
>> +             WARN_ON(dentry->d_fsdata != kn);
>> +     }
>> +     mutex_unlock(&kernfs_mutex);
>> +
>> +     return dentry;
>> +}
>> +
>>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>>  {
>>       struct kernfs_super_info *info = kernfs_info(sb);
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 3c2be75..b9538e0 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>>
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn);
>>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>>                                      unsigned int flags, void *priv);
>>  void kernfs_destroy_root(struct kernfs_root *root);
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       memset(opts, 0, sizeof(*opts));
>>
>> +     /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +      * namespace.
>> +      */
>> +     if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> +             opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +     }
>> +
>>       while ((token = strsep(&o, ",")) != NULL) {
>>               nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>               pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> -             if (nr_opts != 1) {
>> +             if (nr_opts > 1) {
>>                       pr_err("sane_behavior: no other mount options allowed\n");
>>                       return -EINVAL;
>>               }
>> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>>               set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>>  }
>>
>> +struct dentry *cgroupns_get_root(struct super_block *sb,
>> +                              struct cgroup_namespace *ns)
>> +{
>> +     struct dentry *nsdentry;
>> +
>> +     nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
>> +     return nsdentry;
>> +}
>> +
>>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>>  {
>>       LIST_HEAD(tmp_links);
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>       int ret;
>>       int i;
>>       bool new_sb;
>> +     struct cgroup_namespace *ns =
>> +             get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +     /* Check if the caller has permission to mount. */
>> +     if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +             put_cgroup_ns(ns);
>> +             return ERR_PTR(-EPERM);
>> +     }
>>
>>       /*
>>        * The first time anyone tries to mount a cgroup, enable the list
>> @@ -1817,11 +1841,28 @@ out_free:
>>       kfree(opts.release_agent);
>>       kfree(opts.name);
>>
>> -     if (ret)
>> +     if (ret) {
>> +             put_cgroup_ns(ns);
>>               return ERR_PTR(ret);
>> +     }
>>
>>       dentry = kernfs_mount(fs_type, flags, root->kf_root,
>>                               CGROUP_SUPER_MAGIC, &new_sb);
>> +
>> +     if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
>> +             /* If this mount is for the default hierarchy in non-init cgroup
>> +              * namespace, then instead of root cgroup's dentry, we return
>> +              * the dentry corresponding to the cgroupns->root_cgrp.
>> +              */
>> +             if (ns != &init_cgroup_ns) {
>> +                     struct dentry *nsdentry;
>> +
>> +                     nsdentry = cgroupns_get_root(dentry->d_sb, ns);
>> +                     dput(dentry);
>> +                     dentry = nsdentry;
>> +             }
>> +     }
>> +
>>       if (IS_ERR(dentry) || !new_sb)
>>               cgroup_put(&root->cgrp);
>>
>> @@ -1834,6 +1875,7 @@ out_free:
>>               deactivate_super(pinned_sb);
>>       }
>>
>> +     put_cgroup_ns(ns);
>>       return dentry;
>>  }
>>
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>       .name = "cgroup",
>>       .mount = cgroup_mount,
>>       .kill_sb = cgroup_kill_sb,
>> +     .fs_flags = FS_USERNS_MOUNT,
>>  };
>>
>>  static struct kobject *cgroup_kobj;



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]         ` <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2014-11-03 22:46           ` Aditya Kali
  2014-11-03 22:46           ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 22:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Aditya Kali <adityakali@google.com> writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this.  Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted.  Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>

I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point.  But as it is this change seems to be pointless.
>

I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

> Eric
>
>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>> ---
>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>       return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn)
>> +{
>> +     struct dentry *dentry;
>> +     struct inode *inode;
>> +
>> +     BUG_ON(sb->s_op != &kernfs_sops);
>> +
>> +     /* inode for the given kernfs_node should already exist. */
>> +     inode = ilookup(sb, kn->ino);
>> +     if (!inode) {
>> +             pr_debug("kernfs: could not get inode for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-EINVAL);
>> +     }
>> +
>> +     /* instantiate and link root dentry */
>> +     dentry = d_obtain_root(inode);
>> +     if (!dentry) {
>> +             pr_debug("kernfs: could not get dentry for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-ENOMEM);
>> +     }
>> +
>> +     /* If this is a new dentry, set it up. We need kernfs_mutex because this
>> +      * may be called by callers other than kernfs_fill_super. */
>> +     mutex_lock(&kernfs_mutex);
>> +     if (!dentry->d_fsdata) {
>> +             kernfs_get(kn);
>> +             dentry->d_fsdata = kn;
>> +     } else {
>> +             WARN_ON(dentry->d_fsdata != kn);
>> +     }
>> +     mutex_unlock(&kernfs_mutex);
>> +
>> +     return dentry;
>> +}
>> +
>>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>>  {
>>       struct kernfs_super_info *info = kernfs_info(sb);
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 3c2be75..b9538e0 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>>
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn);
>>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>>                                      unsigned int flags, void *priv);
>>  void kernfs_destroy_root(struct kernfs_root *root);
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       memset(opts, 0, sizeof(*opts));
>>
>> +     /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +      * namespace.
>> +      */
>> +     if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> +             opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +     }
>> +
>>       while ((token = strsep(&o, ",")) != NULL) {
>>               nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>               pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> -             if (nr_opts != 1) {
>> +             if (nr_opts > 1) {
>>                       pr_err("sane_behavior: no other mount options allowed\n");
>>                       return -EINVAL;
>>               }
>> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>>               set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>>  }
>>
>> +struct dentry *cgroupns_get_root(struct super_block *sb,
>> +                              struct cgroup_namespace *ns)
>> +{
>> +     struct dentry *nsdentry;
>> +
>> +     nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
>> +     return nsdentry;
>> +}
>> +
>>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>>  {
>>       LIST_HEAD(tmp_links);
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>       int ret;
>>       int i;
>>       bool new_sb;
>> +     struct cgroup_namespace *ns =
>> +             get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +     /* Check if the caller has permission to mount. */
>> +     if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +             put_cgroup_ns(ns);
>> +             return ERR_PTR(-EPERM);
>> +     }
>>
>>       /*
>>        * The first time anyone tries to mount a cgroup, enable the list
>> @@ -1817,11 +1841,28 @@ out_free:
>>       kfree(opts.release_agent);
>>       kfree(opts.name);
>>
>> -     if (ret)
>> +     if (ret) {
>> +             put_cgroup_ns(ns);
>>               return ERR_PTR(ret);
>> +     }
>>
>>       dentry = kernfs_mount(fs_type, flags, root->kf_root,
>>                               CGROUP_SUPER_MAGIC, &new_sb);
>> +
>> +     if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
>> +             /* If this mount is for the default hierarchy in non-init cgroup
>> +              * namespace, then instead of root cgroup's dentry, we return
>> +              * the dentry corresponding to the cgroupns->root_cgrp.
>> +              */
>> +             if (ns != &init_cgroup_ns) {
>> +                     struct dentry *nsdentry;
>> +
>> +                     nsdentry = cgroupns_get_root(dentry->d_sb, ns);
>> +                     dput(dentry);
>> +                     dentry = nsdentry;
>> +             }
>> +     }
>> +
>>       if (IS_ERR(dentry) || !new_sb)
>>               cgroup_put(&root->cgrp);
>>
>> @@ -1834,6 +1875,7 @@ out_free:
>>               deactivate_super(pinned_sb);
>>       }
>>
>> +     put_cgroup_ns(ns);
>>       return dentry;
>>  }
>>
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>       .name = "cgroup",
>>       .mount = cgroup_mount,
>>       .kill_sb = cgroup_kill_sb,
>> +     .fs_flags = FS_USERNS_MOUNT,
>>  };
>>
>>  static struct kobject *cgroup_kobj;



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 22:46           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 22:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this.  Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted.  Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>

I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point.  But as it is this change seems to be pointless.
>

I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

> Eric
>
>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>       return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn)
>> +{
>> +     struct dentry *dentry;
>> +     struct inode *inode;
>> +
>> +     BUG_ON(sb->s_op != &kernfs_sops);
>> +
>> +     /* inode for the given kernfs_node should already exist. */
>> +     inode = ilookup(sb, kn->ino);
>> +     if (!inode) {
>> +             pr_debug("kernfs: could not get inode for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-EINVAL);
>> +     }
>> +
>> +     /* instantiate and link root dentry */
>> +     dentry = d_obtain_root(inode);
>> +     if (!dentry) {
>> +             pr_debug("kernfs: could not get dentry for '");
>> +             pr_cont_kernfs_path(kn);
>> +             pr_cont("'.\n");
>> +             return ERR_PTR(-ENOMEM);
>> +     }
>> +
>> +     /* If this is a new dentry, set it up. We need kernfs_mutex because this
>> +      * may be called by callers other than kernfs_fill_super. */
>> +     mutex_lock(&kernfs_mutex);
>> +     if (!dentry->d_fsdata) {
>> +             kernfs_get(kn);
>> +             dentry->d_fsdata = kn;
>> +     } else {
>> +             WARN_ON(dentry->d_fsdata != kn);
>> +     }
>> +     mutex_unlock(&kernfs_mutex);
>> +
>> +     return dentry;
>> +}
>> +
>>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>>  {
>>       struct kernfs_super_info *info = kernfs_info(sb);
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 3c2be75..b9538e0 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>>
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                               struct kernfs_node *kn);
>>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>>                                      unsigned int flags, void *priv);
>>  void kernfs_destroy_root(struct kernfs_root *root);
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       memset(opts, 0, sizeof(*opts));
>>
>> +     /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +      * namespace.
>> +      */
>> +     if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> +             opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +     }
>> +
>>       while ((token = strsep(&o, ",")) != NULL) {
>>               nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>       if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>               pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> -             if (nr_opts != 1) {
>> +             if (nr_opts > 1) {
>>                       pr_err("sane_behavior: no other mount options allowed\n");
>>                       return -EINVAL;
>>               }
>> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>>               set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>>  }
>>
>> +struct dentry *cgroupns_get_root(struct super_block *sb,
>> +                              struct cgroup_namespace *ns)
>> +{
>> +     struct dentry *nsdentry;
>> +
>> +     nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
>> +     return nsdentry;
>> +}
>> +
>>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>>  {
>>       LIST_HEAD(tmp_links);
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>       int ret;
>>       int i;
>>       bool new_sb;
>> +     struct cgroup_namespace *ns =
>> +             get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +     /* Check if the caller has permission to mount. */
>> +     if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +             put_cgroup_ns(ns);
>> +             return ERR_PTR(-EPERM);
>> +     }
>>
>>       /*
>>        * The first time anyone tries to mount a cgroup, enable the list
>> @@ -1817,11 +1841,28 @@ out_free:
>>       kfree(opts.release_agent);
>>       kfree(opts.name);
>>
>> -     if (ret)
>> +     if (ret) {
>> +             put_cgroup_ns(ns);
>>               return ERR_PTR(ret);
>> +     }
>>
>>       dentry = kernfs_mount(fs_type, flags, root->kf_root,
>>                               CGROUP_SUPER_MAGIC, &new_sb);
>> +
>> +     if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
>> +             /* If this mount is for the default hierarchy in non-init cgroup
>> +              * namespace, then instead of root cgroup's dentry, we return
>> +              * the dentry corresponding to the cgroupns->root_cgrp.
>> +              */
>> +             if (ns != &init_cgroup_ns) {
>> +                     struct dentry *nsdentry;
>> +
>> +                     nsdentry = cgroupns_get_root(dentry->d_sb, ns);
>> +                     dput(dentry);
>> +                     dentry = nsdentry;
>> +             }
>> +     }
>> +
>>       if (IS_ERR(dentry) || !new_sb)
>>               cgroup_put(&root->cgrp);
>>
>> @@ -1834,6 +1875,7 @@ out_free:
>>               deactivate_super(pinned_sb);
>>       }
>>
>> +     put_cgroup_ns(ns);
>>       return dentry;
>>  }
>>
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>       .name = "cgroup",
>>       .mount = cgroup_mount,
>>       .kill_sb = cgroup_kill_sb,
>> +     .fs_flags = FS_USERNS_MOUNT,
>>  };
>>
>>  static struct kobject *cgroup_kobj;



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-03 22:43           ` Aditya Kali
@ 2014-11-03 22:56                 ` Andy Lutomirski
  2014-11-04 13:46               ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 22:56 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 2:43 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> wrote:
>>
>> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>>
>> > This patch enables cgroup mounting inside userns when a process
>> > as appropriate privileges. The cgroup filesystem mounted is
>> > rooted at the cgroupns-root. Thus, in a container-setup, only
>> > the hierarchy under the cgroupns-root is exposed inside the container.
>> > This allows container management tools to run inside the containers
>> > without depending on any global state.
>> > In order to support this, a new kernfs api is added to lookup the
>> > dentry for the cgroupns-root.
>>
>> There is a misdesign in this.  Because files already exist we need the
>> protections that are present in proc and sysfs that only allow you to
>> mount the filesystem if it is already mounted.  Otherwise you can wind
>> up mounting this cgroupfs in a chroot jail when the global root would
>> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
>> is at the very least an information leak in mounting it.
>>
>
> I think simply mounting the cgroupfs doesn't give you any more information
> than what you don't already know about the system ; specially if the
> visibility is restricted within the process's cgroupns-root. The cgroups
> still wont be writable by the user, so I think it should be fine to allow
> mounting?
>

Can we try to figure out a better way to do this than checking at
mount time for a fully-visible procfs/sysfs/cgroupfs?  The current
approach is unpleasant to deal with.

For example, we could check the equivalent conditions when the userns
is created and store then in a per-userns flags field.

>
>>
>> Given that we are effectively performing a bind mount in this patch, and
>> that we need to require cgroupfs be mounted anyway (to be safe).
>>
>> I don't see the point of this change.
>>
>> If we could change the set of cgroups or visible in cgroupfs I could
>> probably see the point.  But as it is this change seems to be pointless.
>>
>
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.
>

If we add the fully-visible check at mount time, then I almost agree
with Eric.  I say almost because fs_fully_visible isn't checking
whether the superblock root is the thing that's mounted, and, if we
fix that, then bind-mounting like this becomes impossible and we'd
have to refine the check.

But if we come up with something less gross than checking for fs
visibility at mount time, then I agree with Aditya: let's let mount do
the right thing, since there may be nothing there to bind mount.  If
we go that route, then I think we might want to make it explicit:
require a mount flag like root=. to indicate that we want to be rooted
at our cgroupns's root cgroup.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 22:56                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 22:56 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 2:43 PM, Aditya Kali <adityakali@google.com> wrote:
>
>
> On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm@xmission.com>
> wrote:
>>
>> Aditya Kali <adityakali@google.com> writes:
>>
>> > This patch enables cgroup mounting inside userns when a process
>> > as appropriate privileges. The cgroup filesystem mounted is
>> > rooted at the cgroupns-root. Thus, in a container-setup, only
>> > the hierarchy under the cgroupns-root is exposed inside the container.
>> > This allows container management tools to run inside the containers
>> > without depending on any global state.
>> > In order to support this, a new kernfs api is added to lookup the
>> > dentry for the cgroupns-root.
>>
>> There is a misdesign in this.  Because files already exist we need the
>> protections that are present in proc and sysfs that only allow you to
>> mount the filesystem if it is already mounted.  Otherwise you can wind
>> up mounting this cgroupfs in a chroot jail when the global root would
>> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
>> is at the very least an information leak in mounting it.
>>
>
> I think simply mounting the cgroupfs doesn't give you any more information
> than what you don't already know about the system ; specially if the
> visibility is restricted within the process's cgroupns-root. The cgroups
> still wont be writable by the user, so I think it should be fine to allow
> mounting?
>

Can we try to figure out a better way to do this than checking at
mount time for a fully-visible procfs/sysfs/cgroupfs?  The current
approach is unpleasant to deal with.

For example, we could check the equivalent conditions when the userns
is created and store then in a per-userns flags field.

>
>>
>> Given that we are effectively performing a bind mount in this patch, and
>> that we need to require cgroupfs be mounted anyway (to be safe).
>>
>> I don't see the point of this change.
>>
>> If we could change the set of cgroups or visible in cgroupfs I could
>> probably see the point.  But as it is this change seems to be pointless.
>>
>
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.
>

If we add the fully-visible check at mount time, then I almost agree
with Eric.  I say almost because fs_fully_visible isn't checking
whether the superblock root is the thing that's mounted, and, if we
fix that, then bind-mounting like this becomes impossible and we'd
have to refine the check.

But if we come up with something less gross than checking for fs
visibility at mount time, then I agree with Aditya: let's let mount do
the right thing, since there may be nothing there to bind mount.  If
we go that route, then I think we might want to make it explicit:
require a mount flag like root=. to indicate that we want to be rooted
at our cgroupns's root cgroup.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-01  0:07       ` Andy Lutomirski
@ 2014-11-03 23:12           ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>         return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                                 struct kernfs_node *kn)
>> +{
>
> I can't usefully review this, but kernfs_make_root and
> kernfs_obtain_root aren't the same string...
>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>         memset(opts, 0, sizeof(*opts));
>>
>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +        * namespace.
>> +        */
>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +       }
>> +
>
> I don't like this implicit stuff.  Can you just return -EINVAL if sane
> behavior isn't requested?
>

I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.

>>         while ((token = strsep(&o, ",")) != NULL) {
>>                 nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> -               if (nr_opts != 1) {
>> +               if (nr_opts > 1) {
>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>                         return -EINVAL;
>
> This looks wrong.  But, if you make the change above, then it'll be right.
>

It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
cgroupns does the right thing automatically.


>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>         int ret;
>>         int i;
>>         bool new_sb;
>> +       struct cgroup_namespace *ns =
>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +       /* Check if the caller has permission to mount. */
>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +               put_cgroup_ns(ns);
>> +               return ERR_PTR(-EPERM);
>> +       }
>
> Why is this necessary?
>

Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns & mntns). This check is there to prevent the
unintended effect of cgroupns feature.

>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 23:12           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>>
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>> ---
>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>         return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +                                 struct kernfs_node *kn)
>> +{
>
> I can't usefully review this, but kernfs_make_root and
> kernfs_obtain_root aren't the same string...
>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>         memset(opts, 0, sizeof(*opts));
>>
>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +        * namespace.
>> +        */
>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +       }
>> +
>
> I don't like this implicit stuff.  Can you just return -EINVAL if sane
> behavior isn't requested?
>

I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.

>>         while ((token = strsep(&o, ",")) != NULL) {
>>                 nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> -               if (nr_opts != 1) {
>> +               if (nr_opts > 1) {
>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>                         return -EINVAL;
>
> This looks wrong.  But, if you make the change above, then it'll be right.
>

It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
cgroupns does the right thing automatically.


>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>         int ret;
>>         int i;
>>         bool new_sb;
>> +       struct cgroup_namespace *ns =
>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +       /* Check if the caller has permission to mount. */
>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +               put_cgroup_ns(ns);
>> +               return ERR_PTR(-EPERM);
>> +       }
>
> Why is this necessary?
>

Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns & mntns). This check is there to prevent the
unintended effect of cgroupns feature.

>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]           ` <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-03 23:15             ` Andy Lutomirski
  2014-11-04 13:57             ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 23:15 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> This patch enables cgroup mounting inside userns when a process
>>> as appropriate privileges. The cgroup filesystem mounted is
>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>> This allows container management tools to run inside the containers
>>> without depending on any global state.
>>> In order to support this, a new kernfs api is added to lookup the
>>> dentry for the cgroupns-root.
>>>
>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/kernfs.h |  2 ++
>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index f973ae9..e334f45 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>         return NULL;
>>>  }
>>>
>>> +/**
>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>> + * @sb: the kernfs super_block
>>> + * @kn: kernfs_node for which a dentry is needed
>>> + *
>>> + * This can used used by callers which want to mount only a part of the kernfs
>>> + * as root of the filesystem.
>>> + */
>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>> +                                 struct kernfs_node *kn)
>>> +{
>>
>> I can't usefully review this, but kernfs_make_root and
>> kernfs_obtain_root aren't the same string...
>>
>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>> index 7e5d597..250aaec 100644
>>> --- a/kernel/cgroup.c
>>> +++ b/kernel/cgroup.c
>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         memset(opts, 0, sizeof(*opts));
>>>
>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>> +        * namespace.
>>> +        */
>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>> +       }
>>> +
>>
>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>> behavior isn't requested?
>>
>
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.
>
>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>                 nr_opts++;
>>>
>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>> -               if (nr_opts != 1) {
>>> +               if (nr_opts > 1) {
>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>                         return -EINVAL;
>>
>> This looks wrong.  But, if you make the change above, then it'll be right.
>>
>
> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
> cgroupns does the right thing automatically.
>

This is a debatable point, but it's not what I meant.  Won't your code
let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?

>
>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>         int ret;
>>>         int i;
>>>         bool new_sb;
>>> +       struct cgroup_namespace *ns =
>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>> +
>>> +       /* Check if the caller has permission to mount. */
>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>> +               put_cgroup_ns(ns);
>>> +               return ERR_PTR(-EPERM);
>>> +       }
>>
>> Why is this necessary?
>>
>
> Without this, if I unshare userns and mntns (but no cgroupns), I will
> be able to mount my parent's cgroupfs hierarchy. This is deviation
> from whats allowed today (i.e., today I can't mount cgroupfs even
> after unsharing userns & mntns). This check is there to prevent the
> unintended effect of cgroupns feature.

Oh, I get it.  I misunderstood the code.

I guess this is reasonable.  If it annoys anyone, it can be reverted
or weakened.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]           ` <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-03 23:15             ` Andy Lutomirski
  2014-11-04 13:57             ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 23:15 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>> This patch enables cgroup mounting inside userns when a process
>>> as appropriate privileges. The cgroup filesystem mounted is
>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>> This allows container management tools to run inside the containers
>>> without depending on any global state.
>>> In order to support this, a new kernfs api is added to lookup the
>>> dentry for the cgroupns-root.
>>>
>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>> ---
>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/kernfs.h |  2 ++
>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index f973ae9..e334f45 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>         return NULL;
>>>  }
>>>
>>> +/**
>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>> + * @sb: the kernfs super_block
>>> + * @kn: kernfs_node for which a dentry is needed
>>> + *
>>> + * This can used used by callers which want to mount only a part of the kernfs
>>> + * as root of the filesystem.
>>> + */
>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>> +                                 struct kernfs_node *kn)
>>> +{
>>
>> I can't usefully review this, but kernfs_make_root and
>> kernfs_obtain_root aren't the same string...
>>
>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>> index 7e5d597..250aaec 100644
>>> --- a/kernel/cgroup.c
>>> +++ b/kernel/cgroup.c
>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         memset(opts, 0, sizeof(*opts));
>>>
>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>> +        * namespace.
>>> +        */
>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>> +       }
>>> +
>>
>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>> behavior isn't requested?
>>
>
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.
>
>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>                 nr_opts++;
>>>
>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>> -               if (nr_opts != 1) {
>>> +               if (nr_opts > 1) {
>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>                         return -EINVAL;
>>
>> This looks wrong.  But, if you make the change above, then it'll be right.
>>
>
> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
> cgroupns does the right thing automatically.
>

This is a debatable point, but it's not what I meant.  Won't your code
let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?

>
>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>         int ret;
>>>         int i;
>>>         bool new_sb;
>>> +       struct cgroup_namespace *ns =
>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>> +
>>> +       /* Check if the caller has permission to mount. */
>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>> +               put_cgroup_ns(ns);
>>> +               return ERR_PTR(-EPERM);
>>> +       }
>>
>> Why is this necessary?
>>
>
> Without this, if I unshare userns and mntns (but no cgroupns), I will
> be able to mount my parent's cgroupfs hierarchy. This is deviation
> from whats allowed today (i.e., today I can't mount cgroupfs even
> after unsharing userns & mntns). This check is there to prevent the
> unintended effect of cgroupns feature.

Oh, I get it.  I misunderstood the code.

I guess this is reasonable.  If it annoys anyone, it can be reverted
or weakened.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 23:15             ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 23:15 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> This patch enables cgroup mounting inside userns when a process
>>> as appropriate privileges. The cgroup filesystem mounted is
>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>> This allows container management tools to run inside the containers
>>> without depending on any global state.
>>> In order to support this, a new kernfs api is added to lookup the
>>> dentry for the cgroupns-root.
>>>
>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/kernfs.h |  2 ++
>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index f973ae9..e334f45 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>         return NULL;
>>>  }
>>>
>>> +/**
>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>> + * @sb: the kernfs super_block
>>> + * @kn: kernfs_node for which a dentry is needed
>>> + *
>>> + * This can used used by callers which want to mount only a part of the kernfs
>>> + * as root of the filesystem.
>>> + */
>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>> +                                 struct kernfs_node *kn)
>>> +{
>>
>> I can't usefully review this, but kernfs_make_root and
>> kernfs_obtain_root aren't the same string...
>>
>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>> index 7e5d597..250aaec 100644
>>> --- a/kernel/cgroup.c
>>> +++ b/kernel/cgroup.c
>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         memset(opts, 0, sizeof(*opts));
>>>
>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>> +        * namespace.
>>> +        */
>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>> +       }
>>> +
>>
>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>> behavior isn't requested?
>>
>
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.
>
>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>                 nr_opts++;
>>>
>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>> -               if (nr_opts != 1) {
>>> +               if (nr_opts > 1) {
>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>                         return -EINVAL;
>>
>> This looks wrong.  But, if you make the change above, then it'll be right.
>>
>
> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
> cgroupns does the right thing automatically.
>

This is a debatable point, but it's not what I meant.  Won't your code
let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?

>
>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>         int ret;
>>>         int i;
>>>         bool new_sb;
>>> +       struct cgroup_namespace *ns =
>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>> +
>>> +       /* Check if the caller has permission to mount. */
>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>> +               put_cgroup_ns(ns);
>>> +               return ERR_PTR(-EPERM);
>>> +       }
>>
>> Why is this necessary?
>>
>
> Without this, if I unshare userns and mntns (but no cgroupns), I will
> be able to mount my parent's cgroupfs hierarchy. This is deviation
> from whats allowed today (i.e., today I can't mount cgroupfs even
> after unsharing userns & mntns). This check is there to prevent the
> unintended effect of cgroupns feature.

Oh, I get it.  I misunderstood the code.

I guess this is reasonable.  If it annoys anyone, it can be reverted
or weakened.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <CALCETrW64-6xC6psP-8k0H-1GfVnWBTeEBNSrE_sH+-DFtuZQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-03 23:23               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/linux/kernfs.h |  2 ++
>>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>>         return NULL;
>>>>  }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> +                                 struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         memset(opts, 0, sizeof(*opts));
>>>>
>>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>>> +        * namespace.
>>>> +        */
>>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> +       }
>>>> +
>>>
>>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>>                 nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>> -               if (nr_opts != 1) {
>>>> +               if (nr_opts > 1) {
>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>                         return -EINVAL;
>>>
>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant.  Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>

I don't think so. This check "if (nr_opts > 1)" is nested under "if
(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.

>>
>>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>>         int ret;
>>>>         int i;
>>>>         bool new_sb;
>>>> +       struct cgroup_namespace *ns =
>>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>>> +
>>>> +       /* Check if the caller has permission to mount. */
>>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>>> +               put_cgroup_ns(ns);
>>>> +               return ERR_PTR(-EPERM);
>>>> +       }
>>>
>>> Why is this necessary?
>>>
>>
>> Without this, if I unshare userns and mntns (but no cgroupns), I will
>> be able to mount my parent's cgroupfs hierarchy. This is deviation
>> from whats allowed today (i.e., today I can't mount cgroupfs even
>> after unsharing userns & mntns). This check is there to prevent the
>> unintended effect of cgroupns feature.
>
> Oh, I get it.  I misunderstood the code.
>
> I guess this is reasonable.  If it annoys anyone, it can be reverted
> or weakened.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <CALCETrW64-6xC6psP-8k0H-1GfVnWBTeEBNSrE_sH+-DFtuZQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-03 23:23               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>>> ---
>>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/linux/kernfs.h |  2 ++
>>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>>         return NULL;
>>>>  }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> +                                 struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         memset(opts, 0, sizeof(*opts));
>>>>
>>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>>> +        * namespace.
>>>> +        */
>>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> +       }
>>>> +
>>>
>>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>>                 nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>> -               if (nr_opts != 1) {
>>>> +               if (nr_opts > 1) {
>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>                         return -EINVAL;
>>>
>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant.  Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>

I don't think so. This check "if (nr_opts > 1)" is nested under "if
(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.

>>
>>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>>         int ret;
>>>>         int i;
>>>>         bool new_sb;
>>>> +       struct cgroup_namespace *ns =
>>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>>> +
>>>> +       /* Check if the caller has permission to mount. */
>>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>>> +               put_cgroup_ns(ns);
>>>> +               return ERR_PTR(-EPERM);
>>>> +       }
>>>
>>> Why is this necessary?
>>>
>>
>> Without this, if I unshare userns and mntns (but no cgroupns), I will
>> be able to mount my parent's cgroupfs hierarchy. This is deviation
>> from whats allowed today (i.e., today I can't mount cgroupfs even
>> after unsharing userns & mntns). This check is there to prevent the
>> unintended effect of cgroupns feature.
>
> Oh, I get it.  I misunderstood the code.
>
> I guess this is reasonable.  If it annoys anyone, it can be reverted
> or weakened.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 23:23               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/linux/kernfs.h |  2 ++
>>>>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>>         return NULL;
>>>>  }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> +                                 struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         memset(opts, 0, sizeof(*opts));
>>>>
>>>> +       /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>>> +        * namespace.
>>>> +        */
>>>> +       if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>>> +               opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> +       }
>>>> +
>>>
>>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>>         while ((token = strsep(&o, ",")) != NULL) {
>>>>                 nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>> -               if (nr_opts != 1) {
>>>> +               if (nr_opts > 1) {
>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>                         return -EINVAL;
>>>
>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant.  Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>

I don't think so. This check "if (nr_opts > 1)" is nested under "if
(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.

>>
>>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>>         int ret;
>>>>         int i;
>>>>         bool new_sb;
>>>> +       struct cgroup_namespace *ns =
>>>> +               get_cgroup_ns(current->nsproxy->cgroup_ns);
>>>> +
>>>> +       /* Check if the caller has permission to mount. */
>>>> +       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>>> +               put_cgroup_ns(ns);
>>>> +               return ERR_PTR(-EPERM);
>>>> +       }
>>>
>>> Why is this necessary?
>>>
>>
>> Without this, if I unshare userns and mntns (but no cgroupns), I will
>> be able to mount my parent's cgroupfs hierarchy. This is deviation
>> from whats allowed today (i.e., today I can't mount cgroupfs even
>> after unsharing userns & mntns). This check is there to prevent the
>> unintended effect of cgroupns feature.
>
> Oh, I get it.  I misunderstood the code.
>
> I guess this is reasonable.  If it annoys anyone, it can be reverted
> or weakened.
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-11-01  0:02           ` Andy Lutomirski
@ 2014-11-03 23:40               ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the cgroup of the process at the point
>> of creation of the cgroup namespace (referred as cgroupns-root).
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root
>> (unless they are moved outside of their cgroupns-root, at which point
>>  they will see a relative path from their cgroupns-root).
>> For a correctly setup container this enables container-tools
>> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
>> containers without leaking system level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>
>> +       /* Prevent cgroup changes for this task. */
>> +       threadgroup_lock(current);
>
> This could just be me being dense, but what is the lock for?
>

threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().

>> +
>> +       /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
>> +        */
>> +       cgrp = get_task_cgroup(current);
>> +
>> +       err = -ENOMEM;
>> +       new_ns = alloc_cgroup_ns();
>> +       if (!new_ns)
>> +               goto err_out_unlock;
>> +
>> +       err = proc_alloc_inum(&new_ns->proc_inum);
>> +       if (err)
>> +               goto err_out_unlock;
>> +
>> +       new_ns->user_ns = get_user_ns(user_ns);
>> +       new_ns->root_cgrp = cgrp;
>> +
>> +       threadgroup_unlock(current);
>> +
>> +       return new_ns;
>> +
>> +err_out_unlock:
>> +       threadgroup_unlock(current);
>> +err_out:
>> +       if (cgrp)
>> +               cgroup_put(cgrp);
>> +       kfree(new_ns);
>> +       return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +       pr_info("setns not supported for cgroup namespace");
>> +       return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +       struct cgroup_namespace *ns = NULL;
>> +       struct nsproxy *nsproxy;
>> +
>> +       rcu_read_lock();
>> +       nsproxy = task->nsproxy;
>> +       if (nsproxy) {
>> +               ns = nsproxy->cgroup_ns;
>> +               get_cgroup_ns(ns);
>> +       }
>> +       rcu_read_unlock();
>
> How is this correct?  Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand.  What is RCU protecting?
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-11-03 23:40               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali@google.com> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the cgroup of the process at the point
>> of creation of the cgroup namespace (referred as cgroupns-root).
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root
>> (unless they are moved outside of their cgroupns-root, at which point
>>  they will see a relative path from their cgroupns-root).
>> For a correctly setup container this enables container-tools
>> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
>> containers without leaking system level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>
>> +       /* Prevent cgroup changes for this task. */
>> +       threadgroup_lock(current);
>
> This could just be me being dense, but what is the lock for?
>

threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().

>> +
>> +       /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
>> +        */
>> +       cgrp = get_task_cgroup(current);
>> +
>> +       err = -ENOMEM;
>> +       new_ns = alloc_cgroup_ns();
>> +       if (!new_ns)
>> +               goto err_out_unlock;
>> +
>> +       err = proc_alloc_inum(&new_ns->proc_inum);
>> +       if (err)
>> +               goto err_out_unlock;
>> +
>> +       new_ns->user_ns = get_user_ns(user_ns);
>> +       new_ns->root_cgrp = cgrp;
>> +
>> +       threadgroup_unlock(current);
>> +
>> +       return new_ns;
>> +
>> +err_out_unlock:
>> +       threadgroup_unlock(current);
>> +err_out:
>> +       if (cgrp)
>> +               cgroup_put(cgrp);
>> +       kfree(new_ns);
>> +       return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +       pr_info("setns not supported for cgroup namespace");
>> +       return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +       struct cgroup_namespace *ns = NULL;
>> +       struct nsproxy *nsproxy;
>> +
>> +       rcu_read_lock();
>> +       nsproxy = task->nsproxy;
>> +       if (nsproxy) {
>> +               ns = nsproxy->cgroup_ns;
>> +               get_cgroup_ns(ns);
>> +       }
>> +       rcu_read_unlock();
>
> How is this correct?  Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand.  What is RCU protecting?
>
> --Andy



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-11-01  0:58               ` Eric W. Biederman
@ 2014-11-03 23:42                   ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> <snip>
>
>>> +static void *cgroupns_get(struct task_struct *task)
>>> +{
>>> +       struct cgroup_namespace *ns = NULL;
>>> +       struct nsproxy *nsproxy;
>>> +
>>> +       rcu_read_lock();
>>> +       nsproxy = task->nsproxy;
>>> +       if (nsproxy) {
>>> +               ns = nsproxy->cgroup_ns;
>>> +               get_cgroup_ns(ns);
>>> +       }
>>> +       rcu_read_unlock();
>>
>> How is this correct?  Other namespaces do it too, so it Must Be
>> Correct (tm), but I don't understand.  What is RCU protecting?
>
> The code is not correct.  The code needs to use task_lock.
>
> RCU used to protect nsproxy, and now task_lock protects nsproxy.
> For the reasons of of all of this I refer you to the commit
> that changed this, and the comment in nsproxy.h
>

My bad. This should be under task_lock. I will fix it.

> commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
> Author: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Date:   Mon Feb 3 19:13:49 2014 -0800
>
>     namespaces: Use task_lock and not rcu to protect nsproxy
>
>     The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
>     a sufficiently expensive system call that people have complained.
>
>     Upon inspect nsproxy no longer needs rcu protection for remote reads.
>     remote reads are rare.  So optimize for same process reads and write
>     by switching using rask_lock instead.
>
>     This yields a simpler to understand lock, and a faster setns system call.
>
>     In particular this fixes a performance regression observed
>     by Rafael David Tinoco <rafael.tinoco-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>.
>
>     This is effectively a revert of Pavel Emelyanov's commit
>     cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
>     from 2007.  The race this originialy fixed no longer exists as
>     do_notify_parent uses task_active_pid_ns(parent) instead of
>     parent->nsproxy.
>
>     Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
> Eric



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-11-03 23:42                   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-03 23:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Tejun Heo, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali@google.com> wrote:
>
> <snip>
>
>>> +static void *cgroupns_get(struct task_struct *task)
>>> +{
>>> +       struct cgroup_namespace *ns = NULL;
>>> +       struct nsproxy *nsproxy;
>>> +
>>> +       rcu_read_lock();
>>> +       nsproxy = task->nsproxy;
>>> +       if (nsproxy) {
>>> +               ns = nsproxy->cgroup_ns;
>>> +               get_cgroup_ns(ns);
>>> +       }
>>> +       rcu_read_unlock();
>>
>> How is this correct?  Other namespaces do it too, so it Must Be
>> Correct (tm), but I don't understand.  What is RCU protecting?
>
> The code is not correct.  The code needs to use task_lock.
>
> RCU used to protect nsproxy, and now task_lock protects nsproxy.
> For the reasons of of all of this I refer you to the commit
> that changed this, and the comment in nsproxy.h
>

My bad. This should be under task_lock. I will fix it.

> commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Mon Feb 3 19:13:49 2014 -0800
>
>     namespaces: Use task_lock and not rcu to protect nsproxy
>
>     The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
>     a sufficiently expensive system call that people have complained.
>
>     Upon inspect nsproxy no longer needs rcu protection for remote reads.
>     remote reads are rare.  So optimize for same process reads and write
>     by switching using rask_lock instead.
>
>     This yields a simpler to understand lock, and a faster setns system call.
>
>     In particular this fixes a performance regression observed
>     by Rafael David Tinoco <rafael.tinoco@canonical.com>.
>
>     This is effectively a revert of Pavel Emelyanov's commit
>     cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
>     from 2007.  The race this originialy fixed no longer exists as
>     do_notify_parent uses task_active_pid_ns(parent) instead of
>     parent->nsproxy.
>
>     Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Eric



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-03 23:23               ` Aditya Kali
@ 2014-11-03 23:48                   ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 23:48 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>> -               if (nr_opts != 1) {
>>>>> +               if (nr_opts > 1) {
>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>                         return -EINVAL;
>>>>
>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>
>>>
>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>> cgroupns does the right thing automatically.
>>>
>>
>> This is a debatable point, but it's not what I meant.  Won't your code
>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>
>
> I don't think so. This check "if (nr_opts > 1)" is nested under "if
> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
> here.

But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-03 23:48                   ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-03 23:48 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali@google.com> wrote:
> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>> -               if (nr_opts != 1) {
>>>>> +               if (nr_opts > 1) {
>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>                         return -EINVAL;
>>>>
>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>
>>>
>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>> cgroupns does the right thing automatically.
>>>
>>
>> This is a debatable point, but it's not what I meant.  Won't your code
>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>
>
> I don't think so. This check "if (nr_opts > 1)" is nested under "if
> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
> here.

But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]                   ` <CALCETrUB_xx5zno26k5UjAFt77nZTpgyndD4AuBSZxiZBNjXSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-04  0:12                     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  0:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>> -               if (nr_opts != 1) {
>>>>>> +               if (nr_opts > 1) {
>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>                         return -EINVAL;
>>>>>
>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant.  Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>

Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]                   ` <CALCETrUB_xx5zno26k5UjAFt77nZTpgyndD4AuBSZxiZBNjXSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-04  0:12                     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  0:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>> -               if (nr_opts != 1) {
>>>>>> +               if (nr_opts > 1) {
>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>                         return -EINVAL;
>>>>>
>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant.  Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>

Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04  0:12                     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  0:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>> -               if (nr_opts != 1) {
>>>>>> +               if (nr_opts > 1) {
>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>                         return -EINVAL;
>>>>>
>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant.  Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>

Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-04  0:12                     ` Aditya Kali
@ 2014-11-04  0:17                         ` Andy Lutomirski
  -1 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-04  0:17 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>> -               if (nr_opts != 1) {
>>>>>>> +               if (nr_opts > 1) {
>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>                         return -EINVAL;
>>>>>>
>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>
>>>>>
>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>> cgroupns does the right thing automatically.
>>>>>
>>>>
>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>
>>>
>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>> here.
>>
>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>
>
> Yes. Hence this change makes sure that we don't return EINVAL when
> nr_opts == 0 or nr_opts == 1 :)
> That way, both of the following are equivalent when inside non-init cgroupns:
>
> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
> (2) $ mount -t cgroup cgroup mountpoint
>
> Any other mount option will trigger the error here.

I still don't get it.  Can you walk me through why mount -o
some_other_option -t cgroup cgroup mountpoint causes -EINVAL?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04  0:17                         ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-04  0:17 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali@google.com> wrote:
> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali@google.com> wrote:
>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>> -               if (nr_opts != 1) {
>>>>>>> +               if (nr_opts > 1) {
>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>                         return -EINVAL;
>>>>>>
>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>
>>>>>
>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>> cgroupns does the right thing automatically.
>>>>>
>>>>
>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>
>>>
>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>> here.
>>
>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>
>
> Yes. Hence this change makes sure that we don't return EINVAL when
> nr_opts == 0 or nr_opts == 1 :)
> That way, both of the following are equivalent when inside non-init cgroupns:
>
> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
> (2) $ mount -t cgroup cgroup mountpoint
>
> Any other mount option will trigger the error here.

I still don't get it.  Can you walk me through why mount -o
some_other_option -t cgroup cgroup mountpoint causes -EINVAL?

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-04  0:17                         ` Andy Lutomirski
@ 2014-11-04  0:49                             ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  0:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>>> -               if (nr_opts != 1) {
>>>>>>>> +               if (nr_opts > 1) {
>>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>>                         return -EINVAL;
>>>>>>>
>>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>>
>>>>>>
>>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>>> cgroupns does the right thing automatically.
>>>>>>
>>>>>
>>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>>
>>>>
>>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>>> here.
>>>
>>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>>
>>
>> Yes. Hence this change makes sure that we don't return EINVAL when
>> nr_opts == 0 or nr_opts == 1 :)
>> That way, both of the following are equivalent when inside non-init cgroupns:
>>
>> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
>> (2) $ mount -t cgroup cgroup mountpoint
>>
>> Any other mount option will trigger the error here.
>
> I still don't get it.  Can you walk me through why mount -o
> some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
>

Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04  0:49                             ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  0:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali@google.com> wrote:
>> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali@google.com> wrote:
>>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali@google.com> wrote:
>>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>>> -               if (nr_opts != 1) {
>>>>>>>> +               if (nr_opts > 1) {
>>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>>                         return -EINVAL;
>>>>>>>
>>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>>
>>>>>>
>>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>>> cgroupns does the right thing automatically.
>>>>>>
>>>>>
>>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>>
>>>>
>>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>>> here.
>>>
>>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>>
>>
>> Yes. Hence this change makes sure that we don't return EINVAL when
>> nr_opts == 0 or nr_opts == 1 :)
>> That way, both of the following are equivalent when inside non-init cgroupns:
>>
>> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
>> (2) $ mount -t cgroup cgroup mountpoint
>>
>> Any other mount option will trigger the error here.
>
> I still don't get it.  Can you walk me through why mount -o
> some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
>

Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

> --Andy

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
  2014-10-31 19:18       ` Aditya Kali
@ 2014-11-04  1:56           ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  1:56 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
  they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
  fs/proc/namespaces.c             |   1 +
  include/linux/cgroup.h           |  18 +++++-
  include/linux/cgroup_namespace.h |  36 +++++++++++
  include/linux/nsproxy.h          |   2 +
  include/linux/proc_ns.h          |   4 ++
  kernel/Makefile                  |   2 +-
  kernel/cgroup.c                  |  14 +++++
  kernel/cgroup_namespace.c        | 127 
+++++++++++++++++++++++++++++++++++++++
  kernel/fork.c                    |   2 +-
  kernel/nsproxy.c                 |  19 +++++-
  10 files changed, 220 insertions(+), 5 deletions(-)
  create mode 100644 include/linux/cgroup_namespace.h
  create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
  	&userns_operations,
  #endif
  	&mntns_operations,
+	&cgroupns_operations,
  };

  static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
  #include <linux/seq_file.h>
  #include <linux/kernfs.h>
  #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

  #ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
  #endif
  };

+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
  extern struct cgroup_root cgrp_dfl_root;
  extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
char *buf, size_t buflen)
  	return kernfs_name(cgrp->kn, buf, buflen);
  }

+static inline char * __must_check cgroup_path_ns(struct 
cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
  static inline char * __must_check cgroup_path(struct cgroup *cgrp, 
char *buf,
  					      size_t buflen)
  {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
  }

  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h 
b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
  struct uts_namespace;
  struct ipc_namespace;
  struct pid_namespace;
+struct cgroup_namespace;
  struct fs_struct;

  /*
@@ -33,6 +34,7 @@ struct nsproxy {
  	struct mnt_namespace *mnt_ns;
  	struct pid_namespace *pid_ns_for_children;
  	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
  };
  extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

  struct pid_namespace;
  struct nsproxy;
+struct task_struct;
+struct inode;

  struct proc_ns_operations {
  	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
  extern const struct proc_ns_operations pidns_operations;
  extern const struct proc_ns_operations userns_operations;
  extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

  /*
   * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
  };

  #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
  obj-$(CONFIG_KEXEC) += kexec.o
  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
  obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
  obj-$(CONFIG_CPUSETS) += cpuset.o
  obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated 
array */
  #include <linux/kthread.h>
  #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

  #include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
  			      bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
  /* IDR wrappers which synchronize using cgroup_idr_lock */
  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int 
end,
  			    gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node 
*parent_kn, const char *name,
  	parent = cgroup_kn_lock_live(parent_kn);
  	if (!parent)
  		return -ENODEV;
+
  	root = parent->root;

  	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
  	unsigned long key;
  	int ssid, err;

+	get_user_ns(init_cgroup_ns.user_ns);
+
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by 
the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	return new_ns;
+
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long 
unshare_flags)
  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
  		return -EINVAL;
  	/*
  	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
  #include <linux/proc_ns.h>
  #include <linux/file.h>
  #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

  static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
  #ifdef CONFIG_NET
  	.net_ns			= &init_net,
  #endif
+	.cgroup_ns		= &init_cgroup_ns,
  };

  static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned 
long flags,
  		goto out_pid;
  	}

+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
  	if (IS_ERR(new_nsp->net_ns)) {
  		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy 
*create_new_namespaces(unsigned long flags,
  	return new_nsp;

  out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
  	if (new_nsp->pid_ns_for_children)
  		put_pid_ns(new_nsp->pid_ns_for_children);
  out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct 
task_struct *tsk)
  	struct nsproxy *new_ns;

  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
  		get_nsproxy(old_ns);
  		return 0;
  	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
  		put_ipc_ns(ns->ipc_ns);
  	if (ns->pid_ns_for_children)
  		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
  	put_net(ns->net_ns);
  	kmem_cache_free(nsproxy_cachep, ns);
  }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long 
unshare_flags,
  	int err = 0;

  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
  		return 0;

  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
@ 2014-11-04  1:56           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  1:56 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
  they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
  fs/proc/namespaces.c             |   1 +
  include/linux/cgroup.h           |  18 +++++-
  include/linux/cgroup_namespace.h |  36 +++++++++++
  include/linux/nsproxy.h          |   2 +
  include/linux/proc_ns.h          |   4 ++
  kernel/Makefile                  |   2 +-
  kernel/cgroup.c                  |  14 +++++
  kernel/cgroup_namespace.c        | 127 
+++++++++++++++++++++++++++++++++++++++
  kernel/fork.c                    |   2 +-
  kernel/nsproxy.c                 |  19 +++++-
  10 files changed, 220 insertions(+), 5 deletions(-)
  create mode 100644 include/linux/cgroup_namespace.h
  create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
  	&userns_operations,
  #endif
  	&mntns_operations,
+	&cgroupns_operations,
  };

  static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
  #include <linux/seq_file.h>
  #include <linux/kernfs.h>
  #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

  #ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
  #endif
  };

+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
  extern struct cgroup_root cgrp_dfl_root;
  extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
char *buf, size_t buflen)
  	return kernfs_name(cgrp->kn, buf, buflen);
  }

+static inline char * __must_check cgroup_path_ns(struct 
cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
  static inline char * __must_check cgroup_path(struct cgroup *cgrp, 
char *buf,
  					      size_t buflen)
  {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
  }

  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h 
b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
  struct uts_namespace;
  struct ipc_namespace;
  struct pid_namespace;
+struct cgroup_namespace;
  struct fs_struct;

  /*
@@ -33,6 +34,7 @@ struct nsproxy {
  	struct mnt_namespace *mnt_ns;
  	struct pid_namespace *pid_ns_for_children;
  	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
  };
  extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

  struct pid_namespace;
  struct nsproxy;
+struct task_struct;
+struct inode;

  struct proc_ns_operations {
  	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
  extern const struct proc_ns_operations pidns_operations;
  extern const struct proc_ns_operations userns_operations;
  extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

  /*
   * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
  };

  #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
  obj-$(CONFIG_KEXEC) += kexec.o
  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
  obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
  obj-$(CONFIG_CPUSETS) += cpuset.o
  obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated 
array */
  #include <linux/kthread.h>
  #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

  #include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
  			      bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
  /* IDR wrappers which synchronize using cgroup_idr_lock */
  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int 
end,
  			    gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node 
*parent_kn, const char *name,
  	parent = cgroup_kn_lock_live(parent_kn);
  	if (!parent)
  		return -ENODEV;
+
  	root = parent->root;

  	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
  	unsigned long key;
  	int ssid, err;

+	get_user_ns(init_cgroup_ns.user_ns);
+
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali@google.com)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by 
the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	return new_ns;
+
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long 
unshare_flags)
  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
  		return -EINVAL;
  	/*
  	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
  #include <linux/proc_ns.h>
  #include <linux/file.h>
  #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

  static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
  #ifdef CONFIG_NET
  	.net_ns			= &init_net,
  #endif
+	.cgroup_ns		= &init_cgroup_ns,
  };

  static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned 
long flags,
  		goto out_pid;
  	}

+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
  	if (IS_ERR(new_nsp->net_ns)) {
  		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy 
*create_new_namespaces(unsigned long flags,
  	return new_nsp;

  out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
  	if (new_nsp->pid_ns_for_children)
  		put_pid_ns(new_nsp->pid_ns_for_children);
  out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct 
task_struct *tsk)
  	struct nsproxy *new_ns;

  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
  		get_nsproxy(old_ns);
  		return 0;
  	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
  		put_ipc_ns(ns->ipc_ns);
  	if (ns->pid_ns_for_children)
  		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
  	put_net(ns->net_ns);
  	kmem_cache_free(nsproxy_cachep, ns);
  }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long 
unshare_flags,
  	int err = 0;

  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
  		return 0;

  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2014-11-01  0:07       ` Andy Lutomirski
  2014-11-01  1:09         ` Eric W. Biederman
@ 2014-11-04  1:59       ` Aditya Kali
  2 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  1:59 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
  fs/kernfs/mount.c      | 48 
++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
  3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)
  	return NULL;
  }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
  {
  	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
  				       unsigned int flags, void *priv);
  void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)
  			return -ENOENT;
  	}

+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  		pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
  		if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,
  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
  }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
  static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)
  {
  	LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
  	int ret;
  	int i;
  	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}

  	/*
  	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1842,28 @@ out_free:
  	kfree(opts.release_agent);
  	kfree(opts.name);

-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
  		return ERR_PTR(ret);
+	}

  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
  				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
  	if (IS_ERR(dentry) || !new_sb)
  		cgroup_put(&root->cgrp);

@@ -1834,6 +1876,7 @@ out_free:
  		deactivate_super(pinned_sb);
  	}

+	put_cgroup_ns(ns);
  	return dentry;
  }

@@ -1862,6 +1905,7 @@ static struct file_system_type cgroup_fs_type = {
  	.name = "cgroup",
  	.mount = cgroup_mount,
  	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
  };

  static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-11-04  1:59       ` Aditya Kali
  2014-11-01  1:09         ` Eric W. Biederman
  2014-11-04  1:59       ` Aditya Kali
  2 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  1:59 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
  fs/kernfs/mount.c      | 48 
++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
  3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)
  	return NULL;
  }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
  {
  	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
  				       unsigned int flags, void *priv);
  void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)
  			return -ENOENT;
  	}

+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  		pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
  		if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,
  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
  }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
  static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)
  {
  	LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
  	int ret;
  	int i;
  	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}

  	/*
  	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1842,28 @@ out_free:
  	kfree(opts.release_agent);
  	kfree(opts.name);

-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
  		return ERR_PTR(ret);
+	}

  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
  				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
  	if (IS_ERR(dentry) || !new_sb)
  		cgroup_put(&root->cgrp);

@@ -1834,6 +1876,7 @@ out_free:
  		deactivate_super(pinned_sb);
  	}

+	put_cgroup_ns(ns);
  	return dentry;
  }

@@ -1862,6 +1905,7 @@ static struct file_system_type cgroup_fs_type = {
  	.name = "cgroup",
  	.mount = cgroup_mount,
  	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
  };

  static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5



^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04  1:59       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-04  1:59 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
  fs/kernfs/mount.c      | 48 
++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
  3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)
  	return NULL;
  }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
  {
  	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
  				       unsigned int flags, void *priv);
  void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)
  			return -ENOENT;
  	}

+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  		pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
  		if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,
  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
  }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
  static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)
  {
  	LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
  	int ret;
  	int i;
  	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}

  	/*
  	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1842,28 @@ out_free:
  	kfree(opts.release_agent);
  	kfree(opts.name);

-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
  		return ERR_PTR(ret);
+	}

  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
  				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
  	if (IS_ERR(dentry) || !new_sb)
  		cgroup_put(&root->cgrp);

@@ -1834,6 +1876,7 @@ out_free:
  		deactivate_super(pinned_sb);
  	}

+	put_cgroup_ns(ns);
  	return dentry;
  }

@@ -1862,6 +1905,7 @@ static struct file_system_type cgroup_fs_type = {
  	.name = "cgroup",
  	.mount = cgroup_mount,
  	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
  };

  static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
  2014-10-31 19:18   ` Aditya Kali
@ 2014-11-04 13:10       ` Vivek Goyal
  -1 siblings, 0 replies; 384+ messages in thread
From: Vivek Goyal @ 2014-11-04 13:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA

On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
[..]
>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>  fs/kernfs/mount.c                |  48 ++++++++++
>  fs/proc/namespaces.c             |   1 +
>  include/linux/cgroup.h           |  41 ++++++++-
>  include/linux/cgroup_namespace.h |  36 ++++++++
>  include/linux/kernfs.h           |   5 +
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 +
>  include/uapi/linux/sched.h       |   3 +-
>  kernel/Makefile                  |   2 +-
>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++-

Hi Aditya,

Can we provide a documentation file for cgroup namespace behavior. Say,
Documentation/namespaces/cgroup-namespace.txt.

Namespaces are complicated and it might be a good idea to keep one .txt
file for each namespace.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
@ 2014-11-04 13:10       ` Vivek Goyal
  0 siblings, 0 replies; 384+ messages in thread
From: Vivek Goyal @ 2014-11-04 13:10 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo, containers

On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
[..]
>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>  fs/kernfs/mount.c                |  48 ++++++++++
>  fs/proc/namespaces.c             |   1 +
>  include/linux/cgroup.h           |  41 ++++++++-
>  include/linux/cgroup_namespace.h |  36 ++++++++
>  include/linux/kernfs.h           |   5 +
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 +
>  include/uapi/linux/sched.h       |   3 +-
>  kernel/Makefile                  |   2 +-
>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++-

Hi Aditya,

Can we provide a documentation file for cgroup namespace behavior. Say,
Documentation/namespaces/cgroup-namespace.txt.

Namespaces are complicated and it might be a good idea to keep one .txt
file for each namespace.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <CAGr1F2Hd_PS_AscBGMXdZC9qkHGRUp-MeQvJksDOQkRBB3RGoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-11-03 22:56                 ` Andy Lutomirski
@ 2014-11-04 13:46               ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:46 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Hello, Aditya.

On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.

If it's okay to require userland to just do bind mounting, I'd be far
happier with that.  cgroup mount code is already overcomplicated
because of the dynamic matching of supers to mounts when it could just
have told userland to use bind mounting.  Doesn't the host side have
to set up some of the filesystem layouts anyway?  Does it really
matter that we require the host to set up cgroup hierarchy too?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <CAGr1F2Hd_PS_AscBGMXdZC9qkHGRUp-MeQvJksDOQkRBB3RGoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-04 13:46               ` Tejun Heo
  2014-11-04 13:46               ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:46 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups, linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

Hello, Aditya.

On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.

If it's okay to require userland to just do bind mounting, I'd be far
happier with that.  cgroup mount code is already overcomplicated
because of the dynamic matching of supers to mounts when it could just
have told userland to use bind mounting.  Doesn't the host side have
to set up some of the filesystem layouts anyway?  Does it really
matter that we require the host to set up cgroup hierarchy too?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04 13:46               ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:46 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

Hello, Aditya.

On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.

If it's okay to require userland to just do bind mounting, I'd be far
happier with that.  cgroup mount code is already overcomplicated
because of the dynamic matching of supers to mounts when it could just
have told userland to use bind mounting.  Doesn't the host side have
to set up some of the filesystem layouts anyway?  Does it really
matter that we require the host to set up cgroup hierarchy too?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]           ` <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-11-03 23:15             ` Andy Lutomirski
@ 2014-11-04 13:57             ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

Hello, Aditya.

On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.

I'm not sure whether supporting mounting from inside a ns is even
necessary but, if it is, can't you just test against cgrp_dfl_root?
There's no reason to do anything differnetly for ns mounting.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]           ` <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-11-04 13:57             ` Tejun Heo
  2014-11-04 13:57             ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Andy Lutomirski, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups, linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

Hello, Aditya.

On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.

I'm not sure whether supporting mounting from inside a ns is even
necessary but, if it is, can't you just test against cgrp_dfl_root?
There's no reason to do anything differnetly for ns mounting.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04 13:57             ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2014-11-04 13:57 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Andy Lutomirski, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

Hello, Aditya.

On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.

I'm not sure whether supporting mounting from inside a ns is even
necessary but, if it is, can't you just test against cgrp_dfl_root?
There's no reason to do anything differnetly for ns mounting.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]               ` <20141104134633.GA14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-11-04 15:00                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-04 15:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
>> I agree that this is effectively bind-mounting, but doing this in kernel
>> makes it really convenient for the userspace. The process that sets up the
>> container doesn't need to care whether it should bind-mount cgroupfs inside
>> the container or not. The tasks inside the container can mount cgroupfs on
>> as-needed basis. The root container manager can simply unshare cgroupns and
>> forget about the internal setup. I think this is useful just for the reason
>> that it makes life much simpler for userspace.
>
> If it's okay to require userland to just do bind mounting, I'd be far
> happier with that.  cgroup mount code is already overcomplicated
> because of the dynamic matching of supers to mounts when it could just
> have told userland to use bind mounting.  Doesn't the host side have
> to set up some of the filesystem layouts anyway?  Does it really
> matter that we require the host to set up cgroup hierarchy too?
>

Sort of, but only sort of.

You can create a container by unsharing namespaces, mounting
everything, and then calling pivot_root.  But this is unpleasant
because of the strange way that pid namespaces work -- you generally
have to fork first, so this gets tedious.  And it doesn't integrate
well with things like fstab or other container-side configuration
mechanisms.

It's nicer if you can unshare namespaces, mount the bare minimum,
pivot_root, and let the contained software do as much setup as
possible.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]               ` <20141104134633.GA14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-11-04 15:00                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-04 15:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aditya Kali, Eric W. Biederman, Li Zefan, Serge Hallyn, cgroups,
	linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
>> I agree that this is effectively bind-mounting, but doing this in kernel
>> makes it really convenient for the userspace. The process that sets up the
>> container doesn't need to care whether it should bind-mount cgroupfs inside
>> the container or not. The tasks inside the container can mount cgroupfs on
>> as-needed basis. The root container manager can simply unshare cgroupns and
>> forget about the internal setup. I think this is useful just for the reason
>> that it makes life much simpler for userspace.
>
> If it's okay to require userland to just do bind mounting, I'd be far
> happier with that.  cgroup mount code is already overcomplicated
> because of the dynamic matching of supers to mounts when it could just
> have told userland to use bind mounting.  Doesn't the host side have
> to set up some of the filesystem layouts anyway?  Does it really
> matter that we require the host to set up cgroup hierarchy too?
>

Sort of, but only sort of.

You can create a container by unsharing namespaces, mounting
everything, and then calling pivot_root.  But this is unpleasant
because of the strange way that pid namespaces work -- you generally
have to fork first, so this gets tedious.  And it doesn't integrate
well with things like fstab or other container-side configuration
mechanisms.

It's nicer if you can unshare namespaces, mount the bare minimum,
pivot_root, and let the contained software do as much setup as
possible.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04 15:00                 ` Andy Lutomirski
  0 siblings, 0 replies; 384+ messages in thread
From: Andy Lutomirski @ 2014-11-04 15:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aditya Kali, Eric W. Biederman, Li Zefan, Serge Hallyn,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
>> I agree that this is effectively bind-mounting, but doing this in kernel
>> makes it really convenient for the userspace. The process that sets up the
>> container doesn't need to care whether it should bind-mount cgroupfs inside
>> the container or not. The tasks inside the container can mount cgroupfs on
>> as-needed basis. The root container manager can simply unshare cgroupns and
>> forget about the internal setup. I think this is useful just for the reason
>> that it makes life much simpler for userspace.
>
> If it's okay to require userland to just do bind mounting, I'd be far
> happier with that.  cgroup mount code is already overcomplicated
> because of the dynamic matching of supers to mounts when it could just
> have told userland to use bind mounting.  Doesn't the host side have
> to set up some of the filesystem layouts anyway?  Does it really
> matter that we require the host to set up cgroup hierarchy too?
>

Sort of, but only sort of.

You can create a container by unsharing namespaces, mounting
everything, and then calling pivot_root.  But this is unpleasant
because of the strange way that pid namespaces work -- you generally
have to fork first, so this gets tedious.  And it doesn't integrate
well with things like fstab or other container-side configuration
mechanisms.

It's nicer if you can unshare namespaces, mount the bare minimum,
pivot_root, and let the contained software do as much setup as
possible.

--Andy

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-11-04 15:00                 ` Andy Lutomirski
@ 2014-11-04 15:50                     ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-11-04 15:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > Hello, Aditya.
> >
> > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> >> I agree that this is effectively bind-mounting, but doing this in kernel
> >> makes it really convenient for the userspace. The process that sets up the
> >> container doesn't need to care whether it should bind-mount cgroupfs inside
> >> the container or not. The tasks inside the container can mount cgroupfs on
> >> as-needed basis. The root container manager can simply unshare cgroupns and
> >> forget about the internal setup. I think this is useful just for the reason
> >> that it makes life much simpler for userspace.
> >
> > If it's okay to require userland to just do bind mounting, I'd be far
> > happier with that.  cgroup mount code is already overcomplicated
> > because of the dynamic matching of supers to mounts when it could just
> > have told userland to use bind mounting.  Doesn't the host side have
> > to set up some of the filesystem layouts anyway?  Does it really
> > matter that we require the host to set up cgroup hierarchy too?
> >
> 
> Sort of, but only sort of.
> 
> You can create a container by unsharing namespaces, mounting
> everything, and then calling pivot_root.  But this is unpleasant
> because of the strange way that pid namespaces work -- you generally
> have to fork first, so this gets tedious.  And it doesn't integrate
> well with things like fstab or other container-side configuration
> mechanisms.
> 
> It's nicer if you can unshare namespaces, mount the bare minimum,
> pivot_root, and let the contained software do as much setup as
> possible.

Also, the bind-mount requires the container manager to know where
the guest distro will want the cgroups mounted.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-04 15:50                     ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2014-11-04 15:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel, Eric W. Biederman, cgroups, Ingo Molnar

Quoting Andy Lutomirski (luto@amacapital.net):
> On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello, Aditya.
> >
> > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> >> I agree that this is effectively bind-mounting, but doing this in kernel
> >> makes it really convenient for the userspace. The process that sets up the
> >> container doesn't need to care whether it should bind-mount cgroupfs inside
> >> the container or not. The tasks inside the container can mount cgroupfs on
> >> as-needed basis. The root container manager can simply unshare cgroupns and
> >> forget about the internal setup. I think this is useful just for the reason
> >> that it makes life much simpler for userspace.
> >
> > If it's okay to require userland to just do bind mounting, I'd be far
> > happier with that.  cgroup mount code is already overcomplicated
> > because of the dynamic matching of supers to mounts when it could just
> > have told userland to use bind mounting.  Doesn't the host side have
> > to set up some of the filesystem layouts anyway?  Does it really
> > matter that we require the host to set up cgroup hierarchy too?
> >
> 
> Sort of, but only sort of.
> 
> You can create a container by unsharing namespaces, mounting
> everything, and then calling pivot_root.  But this is unpleasant
> because of the strange way that pid namespaces work -- you generally
> have to fork first, so this gets tedious.  And it doesn't integrate
> well with things like fstab or other container-side configuration
> mechanisms.
> 
> It's nicer if you can unshare namespaces, mount the bare minimum,
> pivot_root, and let the contained software do as much setup as
> possible.

Also, the bind-mount requires the container manager to know where
the guest distro will want the cgroups mounted.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <20141104135726.GB14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-11-06 17:28               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-06 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar

On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]             ` <20141104135726.GB14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-11-06 17:28               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-06 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups, linux-kernel, Linux API, Ingo Molnar, Linux Containers,
	Rohit Jnagal

On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-06 17:28               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-06 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Li Zefan, Serge Hallyn, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal

On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
  2014-11-04 13:10       ` Vivek Goyal
@ 2014-11-06 17:33           ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-06 17:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Ingo Molnar

On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> [..]
>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>>  fs/kernfs/mount.c                |  48 ++++++++++
>>  fs/proc/namespaces.c             |   1 +
>>  include/linux/cgroup.h           |  41 ++++++++-
>>  include/linux/cgroup_namespace.h |  36 ++++++++
>>  include/linux/kernfs.h           |   5 +
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  kernel/Makefile                  |   2 +-
>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++-
>
> Hi Aditya,
>
> Can we provide a documentation file for cgroup namespace behavior. Say,
> Documentation/namespaces/cgroup-namespace.txt.
>
Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

> Namespaces are complicated and it might be a good idea to keep one .txt
> file for each namespace.
>
> Thanks
> Vivek


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
@ 2014-11-06 17:33           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-06 17:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers

On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> [..]
>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>>  fs/kernfs/mount.c                |  48 ++++++++++
>>  fs/proc/namespaces.c             |   1 +
>>  include/linux/cgroup.h           |  41 ++++++++-
>>  include/linux/cgroup_namespace.h |  36 ++++++++
>>  include/linux/kernfs.h           |   5 +
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  kernel/Makefile                  |   2 +-
>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++-
>
> Hi Aditya,
>
> Can we provide a documentation file for cgroup namespace behavior. Say,
> Documentation/namespaces/cgroup-namespace.txt.
>
Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

> Namespaces are complicated and it might be a good idea to keep one .txt
> file for each namespace.
>
> Thanks
> Vivek


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]                     ` <20141104155052.GA7027-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2014-11-12 17:48                       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-12 17:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>
> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> > On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > Hello, Aditya.
> > >
> > > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> > >> I agree that this is effectively bind-mounting, but doing this in kernel
> > >> makes it really convenient for the userspace. The process that sets up the
> > >> container doesn't need to care whether it should bind-mount cgroupfs inside
> > >> the container or not. The tasks inside the container can mount cgroupfs on
> > >> as-needed basis. The root container manager can simply unshare cgroupns and
> > >> forget about the internal setup. I think this is useful just for the reason
> > >> that it makes life much simpler for userspace.
> > >
> > > If it's okay to require userland to just do bind mounting, I'd be far
> > > happier with that.  cgroup mount code is already overcomplicated
> > > because of the dynamic matching of supers to mounts when it could just
> > > have told userland to use bind mounting.  Doesn't the host side have
> > > to set up some of the filesystem layouts anyway?  Does it really
> > > matter that we require the host to set up cgroup hierarchy too?
> > >
> >
> > Sort of, but only sort of.
> >
> > You can create a container by unsharing namespaces, mounting
> > everything, and then calling pivot_root.  But this is unpleasant
> > because of the strange way that pid namespaces work -- you generally
> > have to fork first, so this gets tedious.  And it doesn't integrate
> > well with things like fstab or other container-side configuration
> > mechanisms.
> >
> > It's nicer if you can unshare namespaces, mount the bare minimum,
> > pivot_root, and let the contained software do as much setup as
> > possible.
>
> Also, the bind-mount requires the container manager to know where
> the guest distro will want the cgroups mounted.
>
> -serge
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers




-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]                     ` <20141104155052.GA7027-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2014-11-12 17:48                       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-12 17:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel, Ingo Molnar, Eric W. Biederman, Tejun Heo, cgroups

I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> Quoting Andy Lutomirski (luto@amacapital.net):
> > On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj@kernel.org> wrote:
> > > Hello, Aditya.
> > >
> > > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> > >> I agree that this is effectively bind-mounting, but doing this in kernel
> > >> makes it really convenient for the userspace. The process that sets up the
> > >> container doesn't need to care whether it should bind-mount cgroupfs inside
> > >> the container or not. The tasks inside the container can mount cgroupfs on
> > >> as-needed basis. The root container manager can simply unshare cgroupns and
> > >> forget about the internal setup. I think this is useful just for the reason
> > >> that it makes life much simpler for userspace.
> > >
> > > If it's okay to require userland to just do bind mounting, I'd be far
> > > happier with that.  cgroup mount code is already overcomplicated
> > > because of the dynamic matching of supers to mounts when it could just
> > > have told userland to use bind mounting.  Doesn't the host side have
> > > to set up some of the filesystem layouts anyway?  Does it really
> > > matter that we require the host to set up cgroup hierarchy too?
> > >
> >
> > Sort of, but only sort of.
> >
> > You can create a container by unsharing namespaces, mounting
> > everything, and then calling pivot_root.  But this is unpleasant
> > because of the strange way that pid namespaces work -- you generally
> > have to fork first, so this gets tedious.  And it doesn't integrate
> > well with things like fstab or other container-side configuration
> > mechanisms.
> >
> > It's nicer if you can unshare namespaces, mount the bare minimum,
> > pivot_root, and let the contained software do as much setup as
> > possible.
>
> Also, the bind-mount requires the container manager to know where
> the guest distro will want the cgroups mounted.
>
> -serge
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers




-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-11-12 17:48                       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-11-12 17:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>
> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> > On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > Hello, Aditya.
> > >
> > > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> > >> I agree that this is effectively bind-mounting, but doing this in kernel
> > >> makes it really convenient for the userspace. The process that sets up the
> > >> container doesn't need to care whether it should bind-mount cgroupfs inside
> > >> the container or not. The tasks inside the container can mount cgroupfs on
> > >> as-needed basis. The root container manager can simply unshare cgroupns and
> > >> forget about the internal setup. I think this is useful just for the reason
> > >> that it makes life much simpler for userspace.
> > >
> > > If it's okay to require userland to just do bind mounting, I'd be far
> > > happier with that.  cgroup mount code is already overcomplicated
> > > because of the dynamic matching of supers to mounts when it could just
> > > have told userland to use bind mounting.  Doesn't the host side have
> > > to set up some of the filesystem layouts anyway?  Does it really
> > > matter that we require the host to set up cgroup hierarchy too?
> > >
> >
> > Sort of, but only sort of.
> >
> > You can create a container by unsharing namespaces, mounting
> > everything, and then calling pivot_root.  But this is unpleasant
> > because of the strange way that pid namespaces work -- you generally
> > have to fork first, so this gets tedious.  And it doesn't integrate
> > well with things like fstab or other container-side configuration
> > mechanisms.
> >
> > It's nicer if you can unshare namespaces, mount the bare minimum,
> > pivot_root, and let the contained software do as much setup as
> > possible.
>
> Also, the bind-mount requires the container manager to know where
> the guest distro will want the cgroups mounted.
>
> -serge
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers




-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
  2014-11-06 17:33           ` Aditya Kali
@ 2014-11-26 22:58               ` Richard Weinberger
  -1 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2014-11-26 22:58 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
>> [..]
>>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>>>  fs/kernfs/mount.c                |  48 ++++++++++
>>>  fs/proc/namespaces.c             |   1 +
>>>  include/linux/cgroup.h           |  41 ++++++++-
>>>  include/linux/cgroup_namespace.h |  36 ++++++++
>>>  include/linux/kernfs.h           |   5 +
>>>  include/linux/nsproxy.h          |   2 +
>>>  include/linux/proc_ns.h          |   4 +
>>>  include/uapi/linux/sched.h       |   3 +-
>>>  kernel/Makefile                  |   2 +-
>>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>>>  kernel/fork.c                    |   2 +-
>>>  kernel/nsproxy.c                 |  19 +++-
>>
>> Hi Aditya,
>>
>> Can we provide a documentation file for cgroup namespace behavior. Say,
>> Documentation/namespaces/cgroup-namespace.txt.
>>
> Yes, definitely. I will add it as soon as we have a consensus on the
> overall series.

Do you have a public git repository which contains your patches?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
@ 2014-11-26 22:58               ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2014-11-26 22:58 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Vivek Goyal, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <adityakali@google.com> wrote:
> On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
>> [..]
>>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
>>>  fs/kernfs/mount.c                |  48 ++++++++++
>>>  fs/proc/namespaces.c             |   1 +
>>>  include/linux/cgroup.h           |  41 ++++++++-
>>>  include/linux/cgroup_namespace.h |  36 ++++++++
>>>  include/linux/kernfs.h           |   5 +
>>>  include/linux/nsproxy.h          |   2 +
>>>  include/linux/proc_ns.h          |   4 +
>>>  include/uapi/linux/sched.h       |   3 +-
>>>  kernel/Makefile                  |   2 +-
>>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
>>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
>>>  kernel/fork.c                    |   2 +-
>>>  kernel/nsproxy.c                 |  19 +++-
>>
>> Hi Aditya,
>>
>> Can we provide a documentation file for cgroup namespace behavior. Say,
>> Documentation/namespaces/cgroup-namespace.txt.
>>
> Yes, definitely. I will add it as soon as we have a consensus on the
> overall series.

Do you have a public git repository which contains your patches?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
  2014-11-26 22:58               ` Richard Weinberger
@ 2014-12-02 19:14                   ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-02 19:14 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

On Wed, Nov 26, 2014 at 2:58 PM, Richard Weinberger
<richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> >> [..]
> >>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
> >>>  fs/kernfs/mount.c                |  48 ++++++++++
> >>>  fs/proc/namespaces.c             |   1 +
> >>>  include/linux/cgroup.h           |  41 ++++++++-
> >>>  include/linux/cgroup_namespace.h |  36 ++++++++
> >>>  include/linux/kernfs.h           |   5 +
> >>>  include/linux/nsproxy.h          |   2 +
> >>>  include/linux/proc_ns.h          |   4 +
> >>>  include/uapi/linux/sched.h       |   3 +-
> >>>  kernel/Makefile                  |   2 +-
> >>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
> >>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
> >>>  kernel/fork.c                    |   2 +-
> >>>  kernel/nsproxy.c                 |  19 +++-
> >>
> >> Hi Aditya,
> >>
> >> Can we provide a documentation file for cgroup namespace behavior. Say,
> >> Documentation/namespaces/cgroup-namespace.txt.
> >>
> > Yes, definitely. I will add it as soon as we have a consensus on the
> > overall series.
>
> Do you have a public git repository which contains your patches?
>

Hi, Sorry for late reply. I don't have these in a public git repo yet.
But I will try to post it on github or somewhere.
Also, I found a bug in this patchset that crashes the kernel in some
cases (when both unified and split hierarchies are mounted). I have a
fix and will send out the patches (with documentation) soon.

>
> --
> Thanks,
> //richard

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv2 0/7] CGroup Namespaces
@ 2014-12-02 19:14                   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-02 19:14 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Vivek Goyal, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers

On Wed, Nov 26, 2014 at 2:58 PM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
>
> On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <adityakali@google.com> wrote:
> > On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> >> [..]
> >>>  fs/kernfs/dir.c                  | 194 ++++++++++++++++++++++++++++++++++-----
> >>>  fs/kernfs/mount.c                |  48 ++++++++++
> >>>  fs/proc/namespaces.c             |   1 +
> >>>  include/linux/cgroup.h           |  41 ++++++++-
> >>>  include/linux/cgroup_namespace.h |  36 ++++++++
> >>>  include/linux/kernfs.h           |   5 +
> >>>  include/linux/nsproxy.h          |   2 +
> >>>  include/linux/proc_ns.h          |   4 +
> >>>  include/uapi/linux/sched.h       |   3 +-
> >>>  kernel/Makefile                  |   2 +-
> >>>  kernel/cgroup.c                  | 108 +++++++++++++++++-----
> >>>  kernel/cgroup_namespace.c        | 148 +++++++++++++++++++++++++++++
> >>>  kernel/fork.c                    |   2 +-
> >>>  kernel/nsproxy.c                 |  19 +++-
> >>
> >> Hi Aditya,
> >>
> >> Can we provide a documentation file for cgroup namespace behavior. Say,
> >> Documentation/namespaces/cgroup-namespace.txt.
> >>
> > Yes, definitely. I will add it as soon as we have a consensus on the
> > overall series.
>
> Do you have a public git repository which contains your patches?
>

Hi, Sorry for late reply. I don't have these in a public git repo yet.
But I will try to post it on github or somewhere.
Also, I found a bug in this patchset that crashes the kernel in some
cases (when both unified and split hierarchies are mounted). I have a
fix and will send out the patches (with documentation) soon.

>
> --
> Thanks,
> //richard

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv3 0/8] CGroup Namespaces
       [not found] <adityakali-cgroupns>
@ 2014-12-05  1:55   ` Aditya Kali
  2014-07-17 19:52 ` Aditya Kali
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Another spin for CGroup Namespaces feature.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
     task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

---
 Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
 fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
 fs/kernfs/mount.c                   |  48 +++++++++
 fs/proc/namespaces.c                |   1 +
 include/linux/cgroup.h              |  52 +++++++++-
 include/linux/cgroup_namespace.h    |  36 +++++++
 include/linux/kernfs.h              |   5 +
 include/linux/nsproxy.h             |   2 +
 include/linux/proc_ns.h             |   4 +
 include/uapi/linux/sched.h          |   3 +-
 kernel/Makefile                     |   2 +-
 kernel/cgroup.c                     | 106 +++++++++++++++-----
 kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
 kernel/fork.c                       |   2 +-
 kernel/nsproxy.c                    |  19 +++-
 15 files changed, 711 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/cgroups/namespace.txt
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv3 0/8] CGroup Namespaces
@ 2014-12-05  1:55   ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger

Another spin for CGroup Namespaces feature.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
     task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

---
 Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
 fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
 fs/kernfs/mount.c                   |  48 +++++++++
 fs/proc/namespaces.c                |   1 +
 include/linux/cgroup.h              |  52 +++++++++-
 include/linux/cgroup_namespace.h    |  36 +++++++
 include/linux/kernfs.h              |   5 +
 include/linux/nsproxy.h             |   2 +
 include/linux/proc_ns.h             |   4 +
 include/uapi/linux/sched.h          |   3 +-
 kernel/Makefile                     |   2 +-
 kernel/cgroup.c                     | 106 +++++++++++++++-----
 kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
 kernel/fork.c                       |   2 +-
 kernel/nsproxy.c                    |  19 +++-
 15 files changed, 711 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/cgroups/namespace.txt
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

^ permalink raw reply	[flat|nested] 384+ messages in thread

* [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-12-05  1:55     ` Aditya Kali
  2014-12-05  1:55       ` Aditya Kali
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 195 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-	char *p = buf + buflen;
+	size_t depth = 0;
+
+	BUG_ON(!kn);
+	while (kn->parent) {
+		depth++;
+		kn = kn->parent;
+	}
+	return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_from,
+	struct kernfs_node *kn_to,
+	char *buf,
+	size_t buflen)
+{
+	char *p = buf;
+	struct kernfs_node *kn;
+	size_t depth_from = 0, depth_to, d;
 	int len;
 
-	*--p = '\0';
+	/* We atleast need 2 bytes to write "/\0". */
+	BUG_ON(buflen < 2);
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
+	/* Short-circuit the easy case - kn_to is the root node. */
+	if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+		*p = '/';
+		*(p + 1) = '\0';
+		return p;
+	}
+
+	/* We can find the relative path only if both the nodes belong to the
+	 * same kernfs root.
+	 */
+	if (kn_from) {
+		BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+		depth_from = kernfs_node_depth(kn_from);
+	}
+
+	depth_to = kernfs_node_depth(kn_to);
+
+	/* We compose path from left to right. So first write out all possible
+	 * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+	 */
+	if (kn_from) {
+		while (depth_from > depth_to) {
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
 		}
+
+		d = depth_to;
+		kn = kn_to;
+		while (depth_from < d) {
+			kn = kn->parent;
+			d--;
+		}
+
+		/* Now we have 'depth_from == depth_to' at this point. Add more
+		 * "/.."s until we reach common ancestor. In the worst case,
+		 * root node will be the common ancestor.
+		 */
+		while (depth_from > 0) {
+			/* If we reached common ancestor, stop. */
+			if (kn_from == kn)
+				break;
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
+			kn = kn->parent;
+		}
+	}
+
+	/* Figure out how many bytes we need to write the path.
+	 */
+	d = depth_to;
+	kn = kn_to;
+	len = 0;
+	while (depth_from < d) {
+		/* Account for "/<name>". */
+		len += strlen(kn->name) + 1;
+		kn = kn->parent;
+		--d;
+	}
+
+	if ((buflen - (p - buf)) < len + 1) {
+		/* buffer not big enough. */
+		buf[0] = '\0';
+		return NULL;
+	}
+
+	/* We have enough space. Move 'p' ahead by computed length and start
+	 * writing node names into buffer.
+	 */
+	p += len;
+	*p = '\0';
+	d = depth_to;
+	kn = kn_to;
+	while (d > depth_from) {
+		len = strlen(kn->name);
 		p -= len;
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
-	} while (kn && kn->parent);
+		--d;
+	}
 
-	return p;
+	return buf;
 }
 
 /**
@@ -92,26 +223,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +298,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-12-05  1:55     ` Aditya Kali
  2014-12-05  1:55       ` Aditya Kali
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/dir.c        | 195 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-	char *p = buf + buflen;
+	size_t depth = 0;
+
+	BUG_ON(!kn);
+	while (kn->parent) {
+		depth++;
+		kn = kn->parent;
+	}
+	return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_from,
+	struct kernfs_node *kn_to,
+	char *buf,
+	size_t buflen)
+{
+	char *p = buf;
+	struct kernfs_node *kn;
+	size_t depth_from = 0, depth_to, d;
 	int len;
 
-	*--p = '\0';
+	/* We atleast need 2 bytes to write "/\0". */
+	BUG_ON(buflen < 2);
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
+	/* Short-circuit the easy case - kn_to is the root node. */
+	if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+		*p = '/';
+		*(p + 1) = '\0';
+		return p;
+	}
+
+	/* We can find the relative path only if both the nodes belong to the
+	 * same kernfs root.
+	 */
+	if (kn_from) {
+		BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+		depth_from = kernfs_node_depth(kn_from);
+	}
+
+	depth_to = kernfs_node_depth(kn_to);
+
+	/* We compose path from left to right. So first write out all possible
+	 * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+	 */
+	if (kn_from) {
+		while (depth_from > depth_to) {
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
 		}
+
+		d = depth_to;
+		kn = kn_to;
+		while (depth_from < d) {
+			kn = kn->parent;
+			d--;
+		}
+
+		/* Now we have 'depth_from == depth_to' at this point. Add more
+		 * "/.."s until we reach common ancestor. In the worst case,
+		 * root node will be the common ancestor.
+		 */
+		while (depth_from > 0) {
+			/* If we reached common ancestor, stop. */
+			if (kn_from == kn)
+				break;
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
+			kn = kn->parent;
+		}
+	}
+
+	/* Figure out how many bytes we need to write the path.
+	 */
+	d = depth_to;
+	kn = kn_to;
+	len = 0;
+	while (depth_from < d) {
+		/* Account for "/<name>". */
+		len += strlen(kn->name) + 1;
+		kn = kn->parent;
+		--d;
+	}
+
+	if ((buflen - (p - buf)) < len + 1) {
+		/* buffer not big enough. */
+		buf[0] = '\0';
+		return NULL;
+	}
+
+	/* We have enough space. Move 'p' ahead by computed length and start
+	 * writing node names into buffer.
+	 */
+	p += len;
+	*p = '\0';
+	d = depth_to;
+	kn = kn_to;
+	while (d > depth_from) {
+		len = strlen(kn->name);
 		p -= len;
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
-	} while (kn && kn->parent);
+		--d;
+	}
 
-	return p;
+	return buf;
 }
 
 /**
@@ -92,26 +223,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +298,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
@ 2014-12-05  1:55     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w, Aditya Kali

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/dir.c        | 195 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-	char *p = buf + buflen;
+	size_t depth = 0;
+
+	BUG_ON(!kn);
+	while (kn->parent) {
+		depth++;
+		kn = kn->parent;
+	}
+	return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+	struct kernfs_node *kn_from,
+	struct kernfs_node *kn_to,
+	char *buf,
+	size_t buflen)
+{
+	char *p = buf;
+	struct kernfs_node *kn;
+	size_t depth_from = 0, depth_to, d;
 	int len;
 
-	*--p = '\0';
+	/* We atleast need 2 bytes to write "/\0". */
+	BUG_ON(buflen < 2);
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
+	/* Short-circuit the easy case - kn_to is the root node. */
+	if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+		*p = '/';
+		*(p + 1) = '\0';
+		return p;
+	}
+
+	/* We can find the relative path only if both the nodes belong to the
+	 * same kernfs root.
+	 */
+	if (kn_from) {
+		BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+		depth_from = kernfs_node_depth(kn_from);
+	}
+
+	depth_to = kernfs_node_depth(kn_to);
+
+	/* We compose path from left to right. So first write out all possible
+	 * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+	 */
+	if (kn_from) {
+		while (depth_from > depth_to) {
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
 		}
+
+		d = depth_to;
+		kn = kn_to;
+		while (depth_from < d) {
+			kn = kn->parent;
+			d--;
+		}
+
+		/* Now we have 'depth_from == depth_to' at this point. Add more
+		 * "/.."s until we reach common ancestor. In the worst case,
+		 * root node will be the common ancestor.
+		 */
+		while (depth_from > 0) {
+			/* If we reached common ancestor, stop. */
+			if (kn_from == kn)
+				break;
+			len = strlen("/..");
+			if ((buflen - (p - buf)) < len + 1) {
+				/* buffer not big enough. */
+				buf[0] = '\0';
+				return NULL;
+			}
+			memcpy(p, "/..", len);
+			p += len;
+			*p = '\0';
+			--depth_from;
+			kn_from = kn_from->parent;
+			kn = kn->parent;
+		}
+	}
+
+	/* Figure out how many bytes we need to write the path.
+	 */
+	d = depth_to;
+	kn = kn_to;
+	len = 0;
+	while (depth_from < d) {
+		/* Account for "/<name>". */
+		len += strlen(kn->name) + 1;
+		kn = kn->parent;
+		--d;
+	}
+
+	if ((buflen - (p - buf)) < len + 1) {
+		/* buffer not big enough. */
+		buf[0] = '\0';
+		return NULL;
+	}
+
+	/* We have enough space. Move 'p' ahead by computed length and start
+	 * writing node names into buffer.
+	 */
+	p += len;
+	*p = '\0';
+	d = depth_to;
+	kn = kn_to;
+	while (d > depth_from) {
+		len = strlen(kn->name);
 		p -= len;
 		memcpy(p, kn->name, len);
 		*--p = '/';
 		kn = kn->parent;
-	} while (kn && kn->parent);
+		--d;
+	}
 
-	return p;
+	return buf;
 }
 
 /**
@@ -92,26 +223,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+			    char *buf, size_t buflen)
 {
 	unsigned long flags;
 	char *p;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
+	p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+	return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +298,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
+	p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+					 sizeof(kernfs_pr_cont_buf));
 	if (p)
 		pr_cont("%s", p);
 	else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+					  struct kernfs_node *kn, char *buf,
+					  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
 				size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *cgrp;
+
+	mutex_lock(&cgroup_mutex);
+	down_read(&css_set_rwsem);
+
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	cgroup_get(cgrp);
+
+	up_read(&css_set_rwsem);
+	mutex_unlock(&cgroup_mutex);
+	return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
 	/* the src and dst cset list running through cset->mg_node */
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
 	return css;
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+	return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
 	return css;
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-	return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 5/8] cgroup: introduce cgroup namespaces
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  29 ++++++++-
 include/linux/cgroup_namespace.h |  36 +++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  |  13 ++++
 kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 10 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+	&cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	if (ns) {
+		BUG_ON(!cgroup_on_dfl(cgrp));
+		return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
+					     buflen);
+	} else {
+		return kernfs_path(cgrp->kn, buf, buflen);
+	}
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	if (cgroup_on_dfl(cgrp)) {
+		return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
+				      buflen);
+	} else {
+		return cgroup_path_ns(NULL, cgrp, buf, buflen);
+	}
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e12d36e..b1ae6d9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
 	unsigned long key;
 	int ssid, err;
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	return new_ns;
+
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 5/8] cgroup: introduce cgroup namespaces
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  29 ++++++++-
 include/linux/cgroup_namespace.h |  36 +++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  |  13 ++++
 kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 10 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+	&cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	if (ns) {
+		BUG_ON(!cgroup_on_dfl(cgrp));
+		return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
+					     buflen);
+	} else {
+		return kernfs_path(cgrp->kn, buf, buflen);
+	}
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
 					      size_t buflen)
 {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	if (cgroup_on_dfl(cgrp)) {
+		return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
+				      buflen);
+	} else {
+		return cgroup_path_ns(NULL, cgrp, buf, buflen);
+	}
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e12d36e..b1ae6d9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
 	unsigned long key;
 	int ssid, err;
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali@google.com)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	return new_ns;
+
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+	.cgroup_ns		= &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 6/8] cgroup: cgroup namespace setns support
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup_namespace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Don't need to do anything if we are attaching to our own cgroupns. */
+	if (cgroup_ns == nsproxy->cgroup_ns)
+		return 0;
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 6/8] cgroup: cgroup namespace setns support
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 kernel/cgroup_namespace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Don't need to do anything if we are attaching to our own cgroupns. */
+	if (cgroup_ns == nsproxy->cgroup_ns)
+		return 0;
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (5 preceding siblings ...)
  2014-12-05  1:55       ` Aditya Kali
@ 2014-12-05  1:55     ` Aditya Kali
  2014-12-05  1:55       ` Aditya Kali
  2014-12-05  3:20     ` [PATCHv3 0/8] CGroup Namespaces Aditya Kali
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 			return -ENOENT;
 	}
 
+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
 		if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1866,11 +1891,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1883,6 +1925,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-12-05  1:55     ` Aditya Kali
  2014-12-05  1:55       ` Aditya Kali
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 			return -ENOENT;
 	}
 
+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
 		if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1866,11 +1891,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1883,6 +1925,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-12-05  1:55     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w, Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 			return -ENOENT;
 	}
 
+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
 	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
 		if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
 	LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret;
 	int i;
 	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1866,11 +1891,28 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -1883,6 +1925,7 @@ out_free:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
@@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2014-12-05  1:55   ` Aditya Kali
@ 2014-12-05  1:55       ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
new file mode 100644
index 0000000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+			CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc/<pid>/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc/<pid>/cgroup will continue to show the full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+    the process calling unshare is running.
+    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+    For the init_cgroup_ns, this is the real root ('/') cgroup
+    (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+    creator process later moves to a different cgroup.
+    $ ~/unshare -c # unshare cgroupns in some cgroup
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+    [ns]$ mkdir sub_cgrp_1
+    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+    [ns]$ sleep 100000 &  # From within unshared cgroupns
+    [1] 7353
+    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+    path relative to its own cgroupns-root will be shown:
+    # ns2's cgroupns-root is at '/batchjobs/container_id2'
+    [ns2]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+    Note that the relative path always starts with '/' to indicate that its
+    relative to the cgroupns-root of the caller.
+
+(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
+    (if they have proper access to external cgroups).
+    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
+    # assuming that the global hierarchy is still accessible inside cgroupns:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+    $ echo 7353 > batchjobs/container_id2/cgroup.procs
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+    Note that this kind of setup is not encouraged. A task inside cgroupns
+    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
+    the virtualization of /proc/<pid>/cgroup less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+    (a) the process has CAP_SYS_ADMIN in its current userns
+    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
+    No implicit cgroup changes happen with attaching to another cgroupns. It
+    is expected that the somone moves the attaching process under the target
+    cgroupns-root.
+
+(6) When some thread from a multi-threaded process unshares its
+    cgroup-namespace, the new cgroupns gets applied to the entire
+    process (all the threads). This should be OK since
+    unified-hierarchy only allows process-level containerization. So
+    all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is atleast 1
+    process inside it. When the last process exits, the cgroup
+    namespace is destroyed. The cgroupns-root and the actual cgroups
+    remain though.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+    inside cgroupns:
+    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+    This will mount the unified cgroup hierarchy with cgroupns-root as the
+    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
+
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 384+ messages in thread

* [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2014-12-05  1:55       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  1:55 UTC (permalink / raw)
  To: tj, lizefan, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo
  Cc: containers, jnagal, vgoyal, richard.weinberger, Aditya Kali

Signed-off-by: Aditya Kali <adityakali@google.com>
---
 Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
new file mode 100644
index 0000000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+			CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc/<pid>/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc/<pid>/cgroup will continue to show the full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+    the process calling unshare is running.
+    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+    For the init_cgroup_ns, this is the real root ('/') cgroup
+    (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+    creator process later moves to a different cgroup.
+    $ ~/unshare -c # unshare cgroupns in some cgroup
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+    [ns]$ mkdir sub_cgrp_1
+    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+    [ns]$ sleep 100000 &  # From within unshared cgroupns
+    [1] 7353
+    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+    path relative to its own cgroupns-root will be shown:
+    # ns2's cgroupns-root is at '/batchjobs/container_id2'
+    [ns2]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+    Note that the relative path always starts with '/' to indicate that its
+    relative to the cgroupns-root of the caller.
+
+(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
+    (if they have proper access to external cgroups).
+    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
+    # assuming that the global hierarchy is still accessible inside cgroupns:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+    $ echo 7353 > batchjobs/container_id2/cgroup.procs
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+    Note that this kind of setup is not encouraged. A task inside cgroupns
+    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
+    the virtualization of /proc/<pid>/cgroup less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+    (a) the process has CAP_SYS_ADMIN in its current userns
+    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
+    No implicit cgroup changes happen with attaching to another cgroupns. It
+    is expected that the somone moves the attaching process under the target
+    cgroupns-root.
+
+(6) When some thread from a multi-threaded process unshares its
+    cgroup-namespace, the new cgroupns gets applied to the entire
+    process (all the threads). This should be OK since
+    unified-hierarchy only allows process-level containerization. So
+    all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is atleast 1
+    process inside it. When the last process exits, the cgroup
+    namespace is destroyed. The cgroupns-root and the actual cgroups
+    remain though.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+    inside cgroupns:
+    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+    This will mount the unified cgroup hierarchy with cgroupns-root as the
+    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
+
-- 
2.2.0.rc0.207.ga3a616c


^ permalink raw reply related	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 0/8] CGroup Namespaces
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                       ` (7 preceding siblings ...)
  2014-12-05  1:55       ` Aditya Kali
@ 2014-12-05  3:20     ` Aditya Kali
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  3:20 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar
  Cc: Richard Weinberger, Linux Containers

These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
>    - removed use of threadgroup_lock() while creating new cgroupns
>    - use task_lock() instead of rcu_read_lock() while accessing
>      task->nsproxy
>    - optimized setns() to own cgroupns
>    - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
>    not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
>    across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
>    apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
>    cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
>    of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
>    cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> ---
>  Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
>  fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
>  fs/kernfs/mount.c                   |  48 +++++++++
>  fs/proc/namespaces.c                |   1 +
>  include/linux/cgroup.h              |  52 +++++++++-
>  include/linux/cgroup_namespace.h    |  36 +++++++
>  include/linux/kernfs.h              |   5 +
>  include/linux/nsproxy.h             |   2 +
>  include/linux/proc_ns.h             |   4 +
>  include/uapi/linux/sched.h          |   3 +-
>  kernel/Makefile                     |   2 +-
>  kernel/cgroup.c                     | 106 +++++++++++++++-----
>  kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
>  kernel/fork.c                       |   2 +-
>  kernel/nsproxy.c                    |  19 +++-
>  15 files changed, 711 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/cgroups/namespace.txt
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 0/8] CGroup Namespaces
       [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2014-12-05  3:20     ` Aditya Kali
  2014-12-05  1:55       ` Aditya Kali
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  3:20 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar
  Cc: Linux Containers, Rohit Jnagal, Vivek Goyal, Richard Weinberger

These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali <adityakali@google.com> wrote:
> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
>    - removed use of threadgroup_lock() while creating new cgroupns
>    - use task_lock() instead of rcu_read_lock() while accessing
>      task->nsproxy
>    - optimized setns() to own cgroupns
>    - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
>    not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
>    across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
>    apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
>    cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
>    of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
>    cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> ---
>  Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
>  fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
>  fs/kernfs/mount.c                   |  48 +++++++++
>  fs/proc/namespaces.c                |   1 +
>  include/linux/cgroup.h              |  52 +++++++++-
>  include/linux/cgroup_namespace.h    |  36 +++++++
>  include/linux/kernfs.h              |   5 +
>  include/linux/nsproxy.h             |   2 +
>  include/linux/proc_ns.h             |   4 +
>  include/uapi/linux/sched.h          |   3 +-
>  kernel/Makefile                     |   2 +-
>  kernel/cgroup.c                     | 106 +++++++++++++++-----
>  kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
>  kernel/fork.c                       |   2 +-
>  kernel/nsproxy.c                    |  19 +++-
>  15 files changed, 711 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/cgroups/namespace.txt
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 0/8] CGroup Namespaces
@ 2014-12-05  3:20     ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2014-12-05  3:20 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar
  Cc: Linux Containers, Rohit Jnagal, Vivek Goyal, Richard Weinberger

These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
>    - removed use of threadgroup_lock() while creating new cgroupns
>    - use task_lock() instead of rcu_read_lock() while accessing
>      task->nsproxy
>    - optimized setns() to own cgroupns
>    - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
>    not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
>    across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
>    apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
>    cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
>    of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
>    cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> ---
>  Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
>  fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
>  fs/kernfs/mount.c                   |  48 +++++++++
>  fs/proc/namespaces.c                |   1 +
>  include/linux/cgroup.h              |  52 +++++++++-
>  include/linux/cgroup_namespace.h    |  36 +++++++
>  include/linux/kernfs.h              |   5 +
>  include/linux/nsproxy.h             |   2 +
>  include/linux/proc_ns.h             |   4 +
>  include/uapi/linux/sched.h          |   3 +-
>  kernel/Makefile                     |   2 +-
>  kernel/cgroup.c                     | 106 +++++++++++++++-----
>  kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
>  kernel/fork.c                       |   2 +-
>  kernel/nsproxy.c                    |  19 +++-
>  15 files changed, 711 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/cgroups/namespace.txt
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2014-12-05  1:55       ` Aditya Kali
@ 2014-12-12  8:54           ` Zefan Li
  -1 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA

> +In its current form, the cgroup namespaces patcheset provides following
> +behavior:
> +
> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
> +    the process calling unshare is running.
> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
> +    For the init_cgroup_ns, this is the real root ('/') cgroup
> +    (identified in code as cgrp_dfl_root.cgrp).
> +
> +(2) The cgroupns-root cgroup does not change even if the namespace
> +    creator process later moves to a different cgroup.
> +    $ ~/unshare -c # unshare cgroupns in some cgroup
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +    [ns]$ mkdir sub_cgrp_1
> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
> +(a) Processes running inside the cgroup namespace will be able to see
> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
> +    [1] 7353
> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(b) From global cgroupns, the real cgroup path will be visible:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
> +
> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
> +    path relative to its own cgroupns-root will be shown:
> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
> +    [ns2]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1

Should be ../container_id1/sub_cgrp_1 ?

> +
> +    Note that the relative path always starts with '/' to indicate that its
> +    relative to the cgroupns-root of the caller.

If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?

> +
> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
> +    (if they have proper access to external cgroups).
> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
> +    # assuming that the global hierarchy is still accessible inside cgroupns:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
> +
> +    Note that this kind of setup is not encouraged. A task inside cgroupns
> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
> +    the virtualization of /proc/<pid>/cgroup less useful.
> +
> +(5) Setns to another cgroup namespace is allowed when:
> +    (a) the process has CAP_SYS_ADMIN in its current userns
> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
> +    No implicit cgroup changes happen with attaching to another cgroupns. It
> +    is expected that the somone moves the attaching process under the target
> +    cgroupns-root.
> +

s/the somone/someone

> +(6) When some thread from a multi-threaded process unshares its
> +    cgroup-namespace, the new cgroupns gets applied to the entire
> +    process (all the threads). This should be OK since
> +    unified-hierarchy only allows process-level containerization. So
> +    all the threads in the process will have the same cgroup.
> +
> +(7) The cgroup namespace is alive as long as there is atleast 1

s/atelast/at least

> +    process inside it. When the last process exits, the cgroup
> +    namespace is destroyed. The cgroupns-root and the actual cgroups
> +    remain though.
> +
> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
> +    inside cgroupns:
> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
> +
> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
> +
> 

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2014-12-12  8:54           ` Zefan Li
  0 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal, vgoyal, richard.weinberger

> +In its current form, the cgroup namespaces patcheset provides following
> +behavior:
> +
> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
> +    the process calling unshare is running.
> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
> +    For the init_cgroup_ns, this is the real root ('/') cgroup
> +    (identified in code as cgrp_dfl_root.cgrp).
> +
> +(2) The cgroupns-root cgroup does not change even if the namespace
> +    creator process later moves to a different cgroup.
> +    $ ~/unshare -c # unshare cgroupns in some cgroup
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +    [ns]$ mkdir sub_cgrp_1
> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
> +(a) Processes running inside the cgroup namespace will be able to see
> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
> +    [1] 7353
> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(b) From global cgroupns, the real cgroup path will be visible:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
> +
> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
> +    path relative to its own cgroupns-root will be shown:
> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
> +    [ns2]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1

Should be ../container_id1/sub_cgrp_1 ?

> +
> +    Note that the relative path always starts with '/' to indicate that its
> +    relative to the cgroupns-root of the caller.

If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?

> +
> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
> +    (if they have proper access to external cgroups).
> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
> +    # assuming that the global hierarchy is still accessible inside cgroupns:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
> +
> +    Note that this kind of setup is not encouraged. A task inside cgroupns
> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
> +    the virtualization of /proc/<pid>/cgroup less useful.
> +
> +(5) Setns to another cgroup namespace is allowed when:
> +    (a) the process has CAP_SYS_ADMIN in its current userns
> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
> +    No implicit cgroup changes happen with attaching to another cgroupns. It
> +    is expected that the somone moves the attaching process under the target
> +    cgroupns-root.
> +

s/the somone/someone

> +(6) When some thread from a multi-threaded process unshares its
> +    cgroup-namespace, the new cgroupns gets applied to the entire
> +    process (all the threads). This should be OK since
> +    unified-hierarchy only allows process-level containerization. So
> +    all the threads in the process will have the same cgroup.
> +
> +(7) The cgroup namespace is alive as long as there is atleast 1

s/atelast/at least

> +    process inside it. When the last process exits, the cgroup
> +    namespace is destroyed. The cgroupns-root and the actual cgroups
> +    remain though.
> +
> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
> +    inside cgroupns:
> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
> +
> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
> +
> 


^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 5/8] cgroup: introduce cgroup namespaces
  2014-12-05  1:55       ` Aditya Kali
@ 2014-12-12  8:54           ` Zefan Li
  -1 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014/12/5 9:55, Aditya Kali wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
>  they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/proc/namespaces.c             |   1 +
>  include/linux/cgroup.h           |  29 ++++++++-
>  include/linux/cgroup_namespace.h |  36 +++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  kernel/Makefile                  |   2 +-
>  kernel/cgroup.c                  |  13 ++++
>  kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  10 files changed, 230 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..55bc5da 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
>  	&userns_operations,
>  #endif
>  	&mntns_operations,
> +	&cgroupns_operations,

Should be guarded with CONFIG_CGROUPS ?

There are other changes that break compile if !CONFIG_CGROUPS.

>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6e7533b..94a5a0c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> +	atomic_t		count;
> +	unsigned int		proc_inum;
> +	struct user_namespace	*user_ns;
> +	struct cgroup		*root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>  	return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> +						 struct cgroup *cgrp, char *buf,
> +						 size_t buflen)
> +{
> +	if (ns) {
> +		BUG_ON(!cgroup_on_dfl(cgrp));
> +		return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
> +					     buflen);
> +	} else {
> +		return kernfs_path(cgrp->kn, buf, buflen);
> +	}
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>  					      size_t buflen)
>  {
> -	return kernfs_path(cgrp->kn, buf, buflen);
> +	if (cgroup_on_dfl(cgrp)) {
> +		return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
> +				      buflen);
> +	} else {
> +		return cgroup_path_ns(NULL, cgrp, buf, buflen);
> +	}
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..0b97b8d
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,36 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *current_cgroupns_root(void)
> +{
> +	return current->nsproxy->cgroup_ns->root_cgrp;
> +}
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	if (ns)
> +		atomic_inc(&ns->count);
> +	return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	if (ns && atomic_dec_and_test(&ns->count))
> +		free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					       struct user_namespace *user_ns,
> +					       struct cgroup_namespace *old_ns);
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>  	struct mnt_namespace *mnt_ns;
>  	struct pid_namespace *pid_ns_for_children;
>  	struct net 	     *net_ns;
> +	struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;

These two lines seems unnecessary.

>  
>  struct proc_ns_operations {
>  	const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
>  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
>  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
> +	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..d9731e2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
>  obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
> -obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e12d36e..b1ae6d9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>  			      bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> +	.count = {
> +		.counter = 1,
> +	},

.count = ATOMIC_INIT(1)

> +	.proc_inum = PROC_CGROUP_INIT_INO,
> +	.user_ns = &init_user_ns,
> +	.root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  			    gfp_t gfp_mask)
> @@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
>  	unsigned long key;
>  	int ssid, err;
>  
> +	get_user_ns(init_cgroup_ns.user_ns);
> +
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
>  
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..0e0ef3a
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,127 @@
> +/*
> + *  Copyright (C) 2014 Google Inc.
> + *
> + *  Author: Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org)
> + *
> + *  This program is free software; you can redistribute it and/or modify it
> + *  under the terms of the GNU General Public License as published by the Free
> + *  Software Foundation, version 2 of the License.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> +	struct cgroup_namespace *new_ns;
> +
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	if (new_ns)
> +		atomic_set(&new_ns->count, 1);
> +	return new_ns;
> +}

Better fold this function into copy_cgroup_ns().

> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	cgroup_put(ns->root_cgrp);
> +	put_user_ns(ns->user_ns);
> +	proc_free_inum(ns->proc_inum);
> +	kfree(ns);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);

This should be a static inline function.

> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					struct user_namespace *user_ns,
> +					struct cgroup_namespace *old_ns)
> +{
> +	struct cgroup_namespace *new_ns = NULL;
> +	struct cgroup *cgrp = NULL;
> +	int err;
> +
> +	BUG_ON(!old_ns);
> +
> +	if (!(flags & CLONE_NEWCGROUP))
> +		return get_cgroup_ns(old_ns);
> +
> +	/* Allow only sysadmin to create cgroup namespace. */
> +	err = -EPERM;
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto err_out;
> +
> +	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> +	 */

The comment style should be

/*
 * ...
 */

> +	cgrp = get_task_cgroup(current);
> +
> +	err = -ENOMEM;
> +	new_ns = alloc_cgroup_ns();
> +	if (!new_ns)
> +		goto err_out;
> +
> +	err = proc_alloc_inum(&new_ns->proc_inum);
> +	if (err)
> +		goto err_out;
> +
> +	new_ns->user_ns = get_user_ns(user_ns);
> +	new_ns->root_cgrp = cgrp;
> +
> +	return new_ns;
> +
> +err_out:
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	kfree(new_ns);
> +	return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	pr_info("setns not supported for cgroup namespace");
> +	return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +	struct cgroup_namespace *ns = NULL;
> +	struct nsproxy *nsproxy;
> +
> +	task_lock(task);
> +	nsproxy = task->nsproxy;
> +	if (nsproxy) {
> +		ns = nsproxy->cgroup_ns;
> +		get_cgroup_ns(ns);
> +	}
> +	task_unlock(task);
> +
> +	return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> +	put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> +	struct cgroup_namespace *cgroup_ns = ns;
> +
> +	return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> +	.name		= "cgroup",
> +	.type		= CLONE_NEWCGROUP,
> +	.get		= cgroupns_get,
> +	.put		= cgroupns_put,
> +	.install	= cgroupns_install,
> +	.inum		= cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> +	return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);

Why provide this unused init function?

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..d22d793 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>  	.net_ns			= &init_net,
>  #endif
> +	.cgroup_ns		= &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  		goto out_pid;
>  	}
>  
> +	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +					    tsk->nsproxy->cgroup_ns);
> +	if (IS_ERR(new_nsp->cgroup_ns)) {
> +		err = PTR_ERR(new_nsp->cgroup_ns);
> +		goto out_cgroup;
> +	}
> +
>  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  	return new_nsp;
>  
>  out_net:
> +	if (new_nsp->cgroup_ns)
> +		put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>  	if (new_nsp->pid_ns_for_children)
>  		put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  	struct nsproxy *new_ns;
>  
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			      CLONE_NEWPID | CLONE_NEWNET)))) {
> +			      CLONE_NEWPID | CLONE_NEWNET |
> +			      CLONE_NEWCGROUP)))) {
>  		get_nsproxy(old_ns);
>  		return 0;
>  	}
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>  		put_ipc_ns(ns->ipc_ns);
>  	if (ns->pid_ns_for_children)
>  		put_pid_ns(ns->pid_ns_for_children);
> +	if (ns->cgroup_ns)
> +		put_cgroup_ns(ns->cgroup_ns);
>  	put_net(ns->net_ns);
>  	kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> 

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 5/8] cgroup: introduce cgroup namespaces
@ 2014-12-12  8:54           ` Zefan Li
  0 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:54 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal, vgoyal, richard.weinberger

On 2014/12/5 9:55, Aditya Kali wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
>  they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  fs/proc/namespaces.c             |   1 +
>  include/linux/cgroup.h           |  29 ++++++++-
>  include/linux/cgroup_namespace.h |  36 +++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  kernel/Makefile                  |   2 +-
>  kernel/cgroup.c                  |  13 ++++
>  kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  10 files changed, 230 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..55bc5da 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
>  	&userns_operations,
>  #endif
>  	&mntns_operations,
> +	&cgroupns_operations,

Should be guarded with CONFIG_CGROUPS ?

There are other changes that break compile if !CONFIG_CGROUPS.

>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6e7533b..94a5a0c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> +	atomic_t		count;
> +	unsigned int		proc_inum;
> +	struct user_namespace	*user_ns;
> +	struct cgroup		*root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>  	return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> +						 struct cgroup *cgrp, char *buf,
> +						 size_t buflen)
> +{
> +	if (ns) {
> +		BUG_ON(!cgroup_on_dfl(cgrp));
> +		return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
> +					     buflen);
> +	} else {
> +		return kernfs_path(cgrp->kn, buf, buflen);
> +	}
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>  					      size_t buflen)
>  {
> -	return kernfs_path(cgrp->kn, buf, buflen);
> +	if (cgroup_on_dfl(cgrp)) {
> +		return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
> +				      buflen);
> +	} else {
> +		return cgroup_path_ns(NULL, cgrp, buf, buflen);
> +	}
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..0b97b8d
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,36 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *current_cgroupns_root(void)
> +{
> +	return current->nsproxy->cgroup_ns->root_cgrp;
> +}
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> +		struct cgroup_namespace *ns)
> +{
> +	if (ns)
> +		atomic_inc(&ns->count);
> +	return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	if (ns && atomic_dec_and_test(&ns->count))
> +		free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					       struct user_namespace *user_ns,
> +					       struct cgroup_namespace *old_ns);
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>  	struct mnt_namespace *mnt_ns;
>  	struct pid_namespace *pid_ns_for_children;
>  	struct net 	     *net_ns;
> +	struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;

These two lines seems unnecessary.

>  
>  struct proc_ns_operations {
>  	const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
>  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
>  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
> +	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..d9731e2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
>  obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
> -obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e12d36e..b1ae6d9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>  			      bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> +	.count = {
> +		.counter = 1,
> +	},

.count = ATOMIC_INIT(1)

> +	.proc_inum = PROC_CGROUP_INIT_INO,
> +	.user_ns = &init_user_ns,
> +	.root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  			    gfp_t gfp_mask)
> @@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
>  	unsigned long key;
>  	int ssid, err;
>  
> +	get_user_ns(init_cgroup_ns.user_ns);
> +
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
>  
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..0e0ef3a
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,127 @@
> +/*
> + *  Copyright (C) 2014 Google Inc.
> + *
> + *  Author: Aditya Kali (adityakali@google.com)
> + *
> + *  This program is free software; you can redistribute it and/or modify it
> + *  under the terms of the GNU General Public License as published by the Free
> + *  Software Foundation, version 2 of the License.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> +	struct cgroup_namespace *new_ns;
> +
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	if (new_ns)
> +		atomic_set(&new_ns->count, 1);
> +	return new_ns;
> +}

Better fold this function into copy_cgroup_ns().

> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +	cgroup_put(ns->root_cgrp);
> +	put_user_ns(ns->user_ns);
> +	proc_free_inum(ns->proc_inum);
> +	kfree(ns);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);

This should be a static inline function.

> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +					struct user_namespace *user_ns,
> +					struct cgroup_namespace *old_ns)
> +{
> +	struct cgroup_namespace *new_ns = NULL;
> +	struct cgroup *cgrp = NULL;
> +	int err;
> +
> +	BUG_ON(!old_ns);
> +
> +	if (!(flags & CLONE_NEWCGROUP))
> +		return get_cgroup_ns(old_ns);
> +
> +	/* Allow only sysadmin to create cgroup namespace. */
> +	err = -EPERM;
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> +		goto err_out;
> +
> +	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> +	 */

The comment style should be

/*
 * ...
 */

> +	cgrp = get_task_cgroup(current);
> +
> +	err = -ENOMEM;
> +	new_ns = alloc_cgroup_ns();
> +	if (!new_ns)
> +		goto err_out;
> +
> +	err = proc_alloc_inum(&new_ns->proc_inum);
> +	if (err)
> +		goto err_out;
> +
> +	new_ns->user_ns = get_user_ns(user_ns);
> +	new_ns->root_cgrp = cgrp;
> +
> +	return new_ns;
> +
> +err_out:
> +	if (cgrp)
> +		cgroup_put(cgrp);
> +	kfree(new_ns);
> +	return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> +	pr_info("setns not supported for cgroup namespace");
> +	return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> +	struct cgroup_namespace *ns = NULL;
> +	struct nsproxy *nsproxy;
> +
> +	task_lock(task);
> +	nsproxy = task->nsproxy;
> +	if (nsproxy) {
> +		ns = nsproxy->cgroup_ns;
> +		get_cgroup_ns(ns);
> +	}
> +	task_unlock(task);
> +
> +	return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> +	put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> +	struct cgroup_namespace *cgroup_ns = ns;
> +
> +	return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> +	.name		= "cgroup",
> +	.type		= CLONE_NEWCGROUP,
> +	.get		= cgroupns_get,
> +	.put		= cgroupns_put,
> +	.install	= cgroupns_install,
> +	.inum		= cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> +	return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);

Why provide this unused init function?

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..d22d793 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> -				CLONE_NEWUSER|CLONE_NEWPID))
> +				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>  		return -EINVAL;
>  	/*
>  	 * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>  	.net_ns			= &init_net,
>  #endif
> +	.cgroup_ns		= &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  		goto out_pid;
>  	}
>  
> +	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +					    tsk->nsproxy->cgroup_ns);
> +	if (IS_ERR(new_nsp->cgroup_ns)) {
> +		err = PTR_ERR(new_nsp->cgroup_ns);
> +		goto out_cgroup;
> +	}
> +
>  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>  	return new_nsp;
>  
>  out_net:
> +	if (new_nsp->cgroup_ns)
> +		put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>  	if (new_nsp->pid_ns_for_children)
>  		put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  	struct nsproxy *new_ns;
>  
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			      CLONE_NEWPID | CLONE_NEWNET)))) {
> +			      CLONE_NEWPID | CLONE_NEWNET |
> +			      CLONE_NEWCGROUP)))) {
>  		get_nsproxy(old_ns);
>  		return 0;
>  	}
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>  		put_ipc_ns(ns->ipc_ns);
>  	if (ns->pid_ns_for_children)
>  		put_pid_ns(ns->pid_ns_for_children);
> +	if (ns->cgroup_ns)
> +		put_cgroup_ns(ns->cgroup_ns);
>  	put_net(ns->net_ns);
>  	kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>  	int err = 0;
>  
>  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -			       CLONE_NEWNET | CLONE_NEWPID)))
> +			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>  		return 0;
>  
>  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> 


^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2014-12-05  1:55     ` Aditya Kali
@ 2014-12-12  8:55         ` Zefan Li
  -1 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:55 UTC (permalink / raw)
  To: Aditya Kali
  Cc: richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014/12/5 9:55, Aditya Kali wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is

s/as/has

> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
> 
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 95 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..efe5e15 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_obtain_root - get a dentry for the given kernfs_node
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs

s/used used/be used/

s/which/who

> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */

/*
 * ...
 */

> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}

Seperate this as a standalone patch?

> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b1ae6d9..e779890 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  			return -ENOENT;
>  	}
>  
> +	/* If inside a non-init cgroup namespace, only allow default hierarchy
> +	 * to be mounted.
> +	 */

/*
 * ...
 */

> +	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
> +	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
> +		return -EINVAL;
> +	}
> +
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>  		if (nr_opts != 1) {
> @@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1866,11 +1891,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> +		/* If this mount is for the default hierarchy in non-init cgroup
> +		 * namespace, then instead of root cgroup's dentry, we return
> +		 * the dentry corresponding to the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1883,6 +1925,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;
> 

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
@ 2014-12-12  8:55         ` Zefan Li
  0 siblings, 0 replies; 384+ messages in thread
From: Zefan Li @ 2014-12-12  8:55 UTC (permalink / raw)
  To: Aditya Kali
  Cc: tj, serge.hallyn, luto, ebiederm, cgroups, linux-kernel,
	linux-api, mingo, containers, jnagal, vgoyal, richard.weinberger

On 2014/12/5 9:55, Aditya Kali wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is

s/as/has

> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 95 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..efe5e15 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_obtain_root - get a dentry for the given kernfs_node
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs

s/used used/be used/

s/which/who

> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */

/*
 * ...
 */

> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}

Seperate this as a standalone patch?

> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b1ae6d9..e779890 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  			return -ENOENT;
>  	}
>  
> +	/* If inside a non-init cgroup namespace, only allow default hierarchy
> +	 * to be mounted.
> +	 */

/*
 * ...
 */

> +	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
> +	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
> +		return -EINVAL;
> +	}
> +
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>  		if (nr_opts != 1) {
> @@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1866,11 +1891,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> +		/* If this mount is for the default hierarchy in non-init cgroup
> +		 * namespace, then instead of root cgroup's dentry, we return
> +		 * the dentry corresponding to the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1883,6 +1925,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;
> 


^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2014-12-05  1:55       ` Aditya Kali
@ 2014-12-14 23:05           ` Richard Weinberger
  -1 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2014-12-14 23:05 UTC (permalink / raw)
  To: Aditya Kali, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Aditya,

I gave your patch set a try but it does not work for me.
Maybe you can bring some light into the issues I'm facing.
Sadly I still had no time to dig into your code.

Am 05.12.2014 um 02:55 schrieb Aditya Kali:
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 147 insertions(+)
>  create mode 100644 Documentation/cgroups/namespace.txt
> 
> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
> new file mode 100644
> index 0000000..6480379
> --- /dev/null
> +++ b/Documentation/cgroups/namespace.txt
> @@ -0,0 +1,147 @@
> +			CGroup Namespaces
> +
> +CGroup Namespace provides a mechanism to virtualize the view of the
> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
> +clone() and unshare() syscalls to create a new cgroup namespace.
> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
> +at the time of creation of the cgroup namespace.
> +
> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
> +path of the cgroup of a process. In a container setup (where a set of cgroups
> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
> +may leak potential system level information to the isolated processes.
> +
> +For Example:
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can generally be considered as system-data
> +and its desirable to not expose it to the isolated process.
> +
> +CGroup Namespaces can be used to restrict visibility of this path.
> +For Example:
> +  # Before creating cgroup namespace
> +  $ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
> +  $ ~/unshare -c
> +  [ns]$ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> +  # From within new cgroupns, process sees that its in the root cgroup
> +  [ns]$ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +
> +  # From global cgroupns:
> +  $ cat /proc/<pid>/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # Unshare cgroupns along with userns and mountns
> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> +  # sets up uid/gid map and execs /bin/bash
> +  $ ~/unshare -c -u -m

This command does not issue CLONE_NEWUSER, -U does.

> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
> +  # hierarchy.
> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
> +  [ns]$ ls -l /tmp/cgroup
> +  total 0
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
And /proc/self/cgroup still shows the cgroup from outside.

---cut---
container:/ # ls /sys/fs/cgroup/
container:/ # mount -t cgroup none /sys/fs/cgroup/
mount: wrong fs type, bad option, bad superblock on none,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
container:/ # cat /proc/self/cgroup
8:memory:/machine/test00.libvirt-lxc
7:devices:/machine/test00.libvirt-lxc
6:hugetlb:/
5:cpuset:/machine/test00.libvirt-lxc
4:blkio:/machine/test00.libvirt-lxc
3:cpu,cpuacct:/machine/test00.libvirt-lxc
2:freezer:/machine/test00.libvirt-lxc
1:name=systemd:/user.slice/user-0.slice/session-c2.scope
container:/ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:02 .
dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
container:/ #

#host side
lxc-os132:~ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:56 .
dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
---cut---

Any ideas?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2014-12-14 23:05           ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2014-12-14 23:05 UTC (permalink / raw)
  To: Aditya Kali, tj, lizefan, serge.hallyn, luto, ebiederm, cgroups,
	linux-kernel, linux-api, mingo
  Cc: containers, jnagal, vgoyal

Aditya,

I gave your patch set a try but it does not work for me.
Maybe you can bring some light into the issues I'm facing.
Sadly I still had no time to dig into your code.

Am 05.12.2014 um 02:55 schrieb Aditya Kali:
> Signed-off-by: Aditya Kali <adityakali@google.com>
> ---
>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 147 insertions(+)
>  create mode 100644 Documentation/cgroups/namespace.txt
> 
> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
> new file mode 100644
> index 0000000..6480379
> --- /dev/null
> +++ b/Documentation/cgroups/namespace.txt
> @@ -0,0 +1,147 @@
> +			CGroup Namespaces
> +
> +CGroup Namespace provides a mechanism to virtualize the view of the
> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
> +clone() and unshare() syscalls to create a new cgroup namespace.
> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
> +at the time of creation of the cgroup namespace.
> +
> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
> +path of the cgroup of a process. In a container setup (where a set of cgroups
> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
> +may leak potential system level information to the isolated processes.
> +
> +For Example:
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can generally be considered as system-data
> +and its desirable to not expose it to the isolated process.
> +
> +CGroup Namespaces can be used to restrict visibility of this path.
> +For Example:
> +  # Before creating cgroup namespace
> +  $ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
> +  $ ~/unshare -c
> +  [ns]$ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> +  # From within new cgroupns, process sees that its in the root cgroup
> +  [ns]$ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +
> +  # From global cgroupns:
> +  $ cat /proc/<pid>/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # Unshare cgroupns along with userns and mountns
> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> +  # sets up uid/gid map and execs /bin/bash
> +  $ ~/unshare -c -u -m

This command does not issue CLONE_NEWUSER, -U does.

> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
> +  # hierarchy.
> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
> +  [ns]$ ls -l /tmp/cgroup
> +  total 0
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
And /proc/self/cgroup still shows the cgroup from outside.

---cut---
container:/ # ls /sys/fs/cgroup/
container:/ # mount -t cgroup none /sys/fs/cgroup/
mount: wrong fs type, bad option, bad superblock on none,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
container:/ # cat /proc/self/cgroup
8:memory:/machine/test00.libvirt-lxc
7:devices:/machine/test00.libvirt-lxc
6:hugetlb:/
5:cpuset:/machine/test00.libvirt-lxc
4:blkio:/machine/test00.libvirt-lxc
3:cpu,cpuacct:/machine/test00.libvirt-lxc
2:freezer:/machine/test00.libvirt-lxc
1:name=systemd:/user.slice/user-0.slice/session-c2.scope
container:/ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:02 .
dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
container:/ #

#host side
lxc-os132:~ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:56 .
dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
---cut---

Any ideas?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2014-12-14 23:05           ` Richard Weinberger
@ 2015-01-05 22:48               ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-05 22:48 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Aditya,
>
> I gave your patch set a try but it does not work for me.
> Maybe you can bring some light into the issues I'm facing.
> Sadly I still had no time to dig into your code.
>
> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 Documentation/cgroups/namespace.txt
>>
>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>> new file mode 100644
>> index 0000000..6480379
>> --- /dev/null
>> +++ b/Documentation/cgroups/namespace.txt
>> @@ -0,0 +1,147 @@
>> +                     CGroup Namespaces
>> +
>> +CGroup Namespace provides a mechanism to virtualize the view of the
>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>> +clone() and unshare() syscalls to create a new cgroup namespace.
>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>> +at the time of creation of the cgroup namespace.
>> +
>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>> +may leak potential system level information to the isolated processes.
>> +
>> +For Example:
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>> +and its desirable to not expose it to the isolated process.
>> +
>> +CGroup Namespaces can be used to restrict visibility of this path.
>> +For Example:
>> +  # Before creating cgroup namespace
>> +  $ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>> +  $ ~/unshare -c
>> +  [ns]$ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>> +  # From within new cgroupns, process sees that its in the root cgroup
>> +  [ns]$ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +
>> +  # From global cgroupns:
>> +  $ cat /proc/<pid>/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # Unshare cgroupns along with userns and mountns
>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> +  # sets up uid/gid map and execs /bin/bash
>> +  $ ~/unshare -c -u -m
>
> This command does not issue CLONE_NEWUSER, -U does.
>
I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>> +  # hierarchy.
>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> +  [ns]$ ls -l /tmp/cgroup
>> +  total 0
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
> And /proc/self/cgroup still shows the cgroup from outside.
>
> ---cut---
> container:/ # ls /sys/fs/cgroup/
> container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

> mount: wrong fs type, bad option, bad superblock on none,
>        missing codepage or helper program, or other error
>
>        In some cases useful info is found in syslog - try
>        dmesg | tail or so.
> container:/ # cat /proc/self/cgroup
> 8:memory:/machine/test00.libvirt-lxc
> 7:devices:/machine/test00.libvirt-lxc
> 6:hugetlb:/
> 5:cpuset:/machine/test00.libvirt-lxc
> 4:blkio:/machine/test00.libvirt-lxc
> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
> 2:freezer:/machine/test00.libvirt-lxc
> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
> container:/ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:02 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
> container:/ #
>
> #host side
> lxc-os132:~ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:56 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
> ---cut---
>
> Any ideas?
>

Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

> Thanks,
> //richard


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-05 22:48               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-05 22:48 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard@nod.at> wrote:
> Aditya,
>
> I gave your patch set a try but it does not work for me.
> Maybe you can bring some light into the issues I'm facing.
> Sadly I still had no time to dig into your code.
>
> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>> Signed-off-by: Aditya Kali <adityakali@google.com>
>> ---
>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 Documentation/cgroups/namespace.txt
>>
>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>> new file mode 100644
>> index 0000000..6480379
>> --- /dev/null
>> +++ b/Documentation/cgroups/namespace.txt
>> @@ -0,0 +1,147 @@
>> +                     CGroup Namespaces
>> +
>> +CGroup Namespace provides a mechanism to virtualize the view of the
>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>> +clone() and unshare() syscalls to create a new cgroup namespace.
>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>> +at the time of creation of the cgroup namespace.
>> +
>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>> +may leak potential system level information to the isolated processes.
>> +
>> +For Example:
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>> +and its desirable to not expose it to the isolated process.
>> +
>> +CGroup Namespaces can be used to restrict visibility of this path.
>> +For Example:
>> +  # Before creating cgroup namespace
>> +  $ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>> +  $ ~/unshare -c
>> +  [ns]$ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>> +  # From within new cgroupns, process sees that its in the root cgroup
>> +  [ns]$ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +
>> +  # From global cgroupns:
>> +  $ cat /proc/<pid>/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # Unshare cgroupns along with userns and mountns
>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> +  # sets up uid/gid map and execs /bin/bash
>> +  $ ~/unshare -c -u -m
>
> This command does not issue CLONE_NEWUSER, -U does.
>
I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>> +  # hierarchy.
>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> +  [ns]$ ls -l /tmp/cgroup
>> +  total 0
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
> And /proc/self/cgroup still shows the cgroup from outside.
>
> ---cut---
> container:/ # ls /sys/fs/cgroup/
> container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

> mount: wrong fs type, bad option, bad superblock on none,
>        missing codepage or helper program, or other error
>
>        In some cases useful info is found in syslog - try
>        dmesg | tail or so.
> container:/ # cat /proc/self/cgroup
> 8:memory:/machine/test00.libvirt-lxc
> 7:devices:/machine/test00.libvirt-lxc
> 6:hugetlb:/
> 5:cpuset:/machine/test00.libvirt-lxc
> 4:blkio:/machine/test00.libvirt-lxc
> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
> 2:freezer:/machine/test00.libvirt-lxc
> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
> container:/ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:02 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
> container:/ #
>
> #host side
> lxc-os132:~ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:56 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
> ---cut---
>
> Any ideas?
>

Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

> Thanks,
> //richard


Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-05 22:48               ` Aditya Kali
@ 2015-01-05 22:52                   ` Richard Weinberger
  -1 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-05 22:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

Am 05.01.2015 um 23:48 schrieb Aditya Kali:
> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>> Aditya,
>>
>> I gave your patch set a try but it does not work for me.
>> Maybe you can bring some light into the issues I'm facing.
>> Sadly I still had no time to dig into your code.
>>
>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 147 insertions(+)
>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>
>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>> new file mode 100644
>>> index 0000000..6480379
>>> --- /dev/null
>>> +++ b/Documentation/cgroups/namespace.txt
>>> @@ -0,0 +1,147 @@
>>> +                     CGroup Namespaces
>>> +
>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>> +at the time of creation of the cgroup namespace.
>>> +
>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>> +may leak potential system level information to the isolated processes.
>>> +
>>> +For Example:
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>> +and its desirable to not expose it to the isolated process.
>>> +
>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>> +For Example:
>>> +  # Before creating cgroup namespace
>>> +  $ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>> +  $ ~/unshare -c
>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>> +  [ns]$ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>> +
>>> +  # From global cgroupns:
>>> +  $ cat /proc/<pid>/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # Unshare cgroupns along with userns and mountns
>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>> +  # sets up uid/gid map and execs /bin/bash
>>> +  $ ~/unshare -c -u -m
>>
>> This command does not issue CLONE_NEWUSER, -U does.
>>
> I was using a custom unshare binary. But I will update the command
> line to be similar to the one in util-linux.
> 
>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>> +  # hierarchy.
>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>> +  [ns]$ ls -l /tmp/cgroup
>>> +  total 0
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>> And /proc/self/cgroup still shows the cgroup from outside.
>>
>> ---cut---
>> container:/ # ls /sys/fs/cgroup/
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
> 
> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
> container, only unified hierarchy can be mounted. So, for now, that
> flag is needed. I will fix the documentation too.
> 
>> mount: wrong fs type, bad option, bad superblock on none,
>>        missing codepage or helper program, or other error
>>
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail or so.
>> container:/ # cat /proc/self/cgroup
>> 8:memory:/machine/test00.libvirt-lxc
>> 7:devices:/machine/test00.libvirt-lxc
>> 6:hugetlb:/
>> 5:cpuset:/machine/test00.libvirt-lxc
>> 4:blkio:/machine/test00.libvirt-lxc
>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>> 2:freezer:/machine/test00.libvirt-lxc
>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>> container:/ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>> container:/ #
>>
>> #host side
>> lxc-os132:~ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>> ---cut---
>>
>> Any ideas?
>>
> 
> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-05 22:52                   ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-05 22:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	Eric W. Biederman, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

Am 05.01.2015 um 23:48 schrieb Aditya Kali:
> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard@nod.at> wrote:
>> Aditya,
>>
>> I gave your patch set a try but it does not work for me.
>> Maybe you can bring some light into the issues I'm facing.
>> Sadly I still had no time to dig into your code.
>>
>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>> ---
>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 147 insertions(+)
>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>
>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>> new file mode 100644
>>> index 0000000..6480379
>>> --- /dev/null
>>> +++ b/Documentation/cgroups/namespace.txt
>>> @@ -0,0 +1,147 @@
>>> +                     CGroup Namespaces
>>> +
>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>> +at the time of creation of the cgroup namespace.
>>> +
>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>> +may leak potential system level information to the isolated processes.
>>> +
>>> +For Example:
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>> +and its desirable to not expose it to the isolated process.
>>> +
>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>> +For Example:
>>> +  # Before creating cgroup namespace
>>> +  $ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>> +  $ ~/unshare -c
>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>> +  [ns]$ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>> +
>>> +  # From global cgroupns:
>>> +  $ cat /proc/<pid>/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # Unshare cgroupns along with userns and mountns
>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>> +  # sets up uid/gid map and execs /bin/bash
>>> +  $ ~/unshare -c -u -m
>>
>> This command does not issue CLONE_NEWUSER, -U does.
>>
> I was using a custom unshare binary. But I will update the command
> line to be similar to the one in util-linux.
> 
>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>> +  # hierarchy.
>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>> +  [ns]$ ls -l /tmp/cgroup
>>> +  total 0
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>> And /proc/self/cgroup still shows the cgroup from outside.
>>
>> ---cut---
>> container:/ # ls /sys/fs/cgroup/
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
> 
> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
> container, only unified hierarchy can be mounted. So, for now, that
> flag is needed. I will fix the documentation too.
> 
>> mount: wrong fs type, bad option, bad superblock on none,
>>        missing codepage or helper program, or other error
>>
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail or so.
>> container:/ # cat /proc/self/cgroup
>> 8:memory:/machine/test00.libvirt-lxc
>> 7:devices:/machine/test00.libvirt-lxc
>> 6:hugetlb:/
>> 5:cpuset:/machine/test00.libvirt-lxc
>> 4:blkio:/machine/test00.libvirt-lxc
>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>> 2:freezer:/machine/test00.libvirt-lxc
>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>> container:/ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>> container:/ #
>>
>> #host side
>> lxc-os132:~ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>> ---cut---
>>
>> Any ideas?
>>
> 
> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2014-12-12  8:54           ` Zefan Li
@ 2015-01-05 22:54               ` Aditya Kali
  -1 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-05 22:54 UTC (permalink / raw)
  To: Zefan Li
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Tejun Heo, cgroups mailinglist

Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> +In its current form, the cgroup namespaces patcheset provides following
>> +behavior:
>> +
>> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
>> +    the process calling unshare is running.
>> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
>> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
>> +    For the init_cgroup_ns, this is the real root ('/') cgroup
>> +    (identified in code as cgrp_dfl_root.cgrp).
>> +
>> +(2) The cgroupns-root cgroup does not change even if the namespace
>> +    creator process later moves to a different cgroup.
>> +    $ ~/unshare -c # unshare cgroupns in some cgroup
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +    [ns]$ mkdir sub_cgrp_1
>> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
>> +(a) Processes running inside the cgroup namespace will be able to see
>> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
>> +    [1] 7353
>> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(b) From global cgroupns, the real cgroup path will be visible:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
>> +
>> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
>> +    path relative to its own cgroupns-root will be shown:
>> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
>> +    [ns2]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
>
> Should be ../container_id1/sub_cgrp_1 ?
>

Starting with '/' was deliberate.

>> +
>> +    Note that the relative path always starts with '/' to indicate that its
>> +    relative to the cgroupns-root of the caller.
>
> If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?
>

This is so as not to surprise the apps parsing /proc/<pid>/cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

>> +
>> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
>> +    (if they have proper access to external cgroups).
>> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
>> +    # assuming that the global hierarchy is still accessible inside cgroupns:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
>> +
>> +    Note that this kind of setup is not encouraged. A task inside cgroupns
>> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
>> +    the virtualization of /proc/<pid>/cgroup less useful.
>> +
>> +(5) Setns to another cgroup namespace is allowed when:
>> +    (a) the process has CAP_SYS_ADMIN in its current userns
>> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
>> +    No implicit cgroup changes happen with attaching to another cgroupns. It
>> +    is expected that the somone moves the attaching process under the target
>> +    cgroupns-root.
>> +
>
> s/the somone/someone
>
fixed.

>> +(6) When some thread from a multi-threaded process unshares its
>> +    cgroup-namespace, the new cgroupns gets applied to the entire
>> +    process (all the threads). This should be OK since
>> +    unified-hierarchy only allows process-level containerization. So
>> +    all the threads in the process will have the same cgroup.
>> +
>> +(7) The cgroup namespace is alive as long as there is atleast 1
>
> s/atelast/at least
>
fixed.

>> +    process inside it. When the last process exits, the cgroup
>> +    namespace is destroyed. The cgroupns-root and the actual cgroups
>> +    remain though.
>> +
>> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
>> +    inside cgroupns:
>> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
>> +
>> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
>> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
>> +
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks!
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-05 22:54               ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-05 22:54 UTC (permalink / raw)
  To: Zefan Li
  Cc: Tejun Heo, Serge Hallyn, Andy Lutomirski, Eric W. Biederman,
	cgroups mailinglist, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal, Richard Weinberger

Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li <lizefan@huawei.com> wrote:
>> +In its current form, the cgroup namespaces patcheset provides following
>> +behavior:
>> +
>> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
>> +    the process calling unshare is running.
>> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
>> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
>> +    For the init_cgroup_ns, this is the real root ('/') cgroup
>> +    (identified in code as cgrp_dfl_root.cgrp).
>> +
>> +(2) The cgroupns-root cgroup does not change even if the namespace
>> +    creator process later moves to a different cgroup.
>> +    $ ~/unshare -c # unshare cgroupns in some cgroup
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +    [ns]$ mkdir sub_cgrp_1
>> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
>> +(a) Processes running inside the cgroup namespace will be able to see
>> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
>> +    [1] 7353
>> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(b) From global cgroupns, the real cgroup path will be visible:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
>> +
>> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
>> +    path relative to its own cgroupns-root will be shown:
>> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
>> +    [ns2]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
>
> Should be ../container_id1/sub_cgrp_1 ?
>

Starting with '/' was deliberate.

>> +
>> +    Note that the relative path always starts with '/' to indicate that its
>> +    relative to the cgroupns-root of the caller.
>
> If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?
>

This is so as not to surprise the apps parsing /proc/<pid>/cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

>> +
>> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
>> +    (if they have proper access to external cgroups).
>> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
>> +    # assuming that the global hierarchy is still accessible inside cgroupns:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
>> +
>> +    Note that this kind of setup is not encouraged. A task inside cgroupns
>> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
>> +    the virtualization of /proc/<pid>/cgroup less useful.
>> +
>> +(5) Setns to another cgroup namespace is allowed when:
>> +    (a) the process has CAP_SYS_ADMIN in its current userns
>> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
>> +    No implicit cgroup changes happen with attaching to another cgroupns. It
>> +    is expected that the somone moves the attaching process under the target
>> +    cgroupns-root.
>> +
>
> s/the somone/someone
>
fixed.

>> +(6) When some thread from a multi-threaded process unshares its
>> +    cgroup-namespace, the new cgroupns gets applied to the entire
>> +    process (all the threads). This should be OK since
>> +    unified-hierarchy only allows process-level containerization. So
>> +    all the threads in the process will have the same cgroup.
>> +
>> +(7) The cgroup namespace is alive as long as there is atleast 1
>
> s/atelast/at least
>
fixed.

>> +    process inside it. When the last process exits, the cgroup
>> +    namespace is destroyed. The cgroupns-root and the actual cgroups
>> +    remain though.
>> +
>> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
>> +    inside cgroupns:
>> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
>> +
>> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
>> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
>> +
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks!
-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                   ` <54AB15BD.8020007-/L3Ra7n9ekc@public.gmane.org>
@ 2015-01-05 23:53                     ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-05 23:53 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>> Aditya,
>>>
>>> I gave your patch set a try but it does not work for me.
>>> Maybe you can bring some light into the issues I'm facing.
>>> Sadly I still had no time to dig into your code.
>>>
>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 147 insertions(+)
>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>
>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>> new file mode 100644
>>>> index 0000000..6480379
>>>> --- /dev/null
>>>> +++ b/Documentation/cgroups/namespace.txt
>>>> @@ -0,0 +1,147 @@
>>>> +                     CGroup Namespaces
>>>> +
>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>> +at the time of creation of the cgroup namespace.
>>>> +
>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>> +may leak potential system level information to the isolated processes.
>>>> +
>>>> +For Example:
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>> +and its desirable to not expose it to the isolated process.
>>>> +
>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>> +For Example:
>>>> +  # Before creating cgroup namespace
>>>> +  $ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>> +  $ ~/unshare -c
>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>> +  [ns]$ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>> +
>>>> +  # From global cgroupns:
>>>> +  $ cat /proc/<pid>/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # Unshare cgroupns along with userns and mountns
>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>> +  # sets up uid/gid map and execs /bin/bash
>>>> +  $ ~/unshare -c -u -m
>>>
>>> This command does not issue CLONE_NEWUSER, -U does.
>>>
>> I was using a custom unshare binary. But I will update the command
>> line to be similar to the one in util-linux.
>> 
>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>> +  # hierarchy.
>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>> +  [ns]$ ls -l /tmp/cgroup
>>>> +  total 0
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>
>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>
>>> ---cut---
>>> container:/ # ls /sys/fs/cgroup/
>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> 
>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>> container, only unified hierarchy can be mounted. So, for now, that
>> flag is needed. I will fix the documentation too.
>> 
>>> mount: wrong fs type, bad option, bad superblock on none,
>>>        missing codepage or helper program, or other error
>>>
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail or so.
>>> container:/ # cat /proc/self/cgroup
>>> 8:memory:/machine/test00.libvirt-lxc
>>> 7:devices:/machine/test00.libvirt-lxc
>>> 6:hugetlb:/
>>> 5:cpuset:/machine/test00.libvirt-lxc
>>> 4:blkio:/machine/test00.libvirt-lxc
>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>> 2:freezer:/machine/test00.libvirt-lxc
>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>> container:/ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>> container:/ #
>>>
>>> #host side
>>> lxc-os132:~ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>> ---cut---
>>>
>>> Any ideas?
>>>
>> 
>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>
> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Ugh.  It sounds like there is a real mess here.  At the very least there
is misunderstanding.

I have a memory that systemd should have been able to use a unified
hierarchy.  As you could still mount the different controllers
independently (they just use the same directory structure on each
mount).

That said from a practical standpoint I am not certain that a cgroup
namespace is viable if it can not support the behavior of cgroupsfs
that everyone is using.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                   ` <54AB15BD.8020007-/L3Ra7n9ekc@public.gmane.org>
@ 2015-01-05 23:53                     ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-05 23:53 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

Richard Weinberger <richard@nod.at> writes:

> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard@nod.at> wrote:
>>> Aditya,
>>>
>>> I gave your patch set a try but it does not work for me.
>>> Maybe you can bring some light into the issues I'm facing.
>>> Sadly I still had no time to dig into your code.
>>>
>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>>> ---
>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 147 insertions(+)
>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>
>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>> new file mode 100644
>>>> index 0000000..6480379
>>>> --- /dev/null
>>>> +++ b/Documentation/cgroups/namespace.txt
>>>> @@ -0,0 +1,147 @@
>>>> +                     CGroup Namespaces
>>>> +
>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>> +at the time of creation of the cgroup namespace.
>>>> +
>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>> +may leak potential system level information to the isolated processes.
>>>> +
>>>> +For Example:
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>> +and its desirable to not expose it to the isolated process.
>>>> +
>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>> +For Example:
>>>> +  # Before creating cgroup namespace
>>>> +  $ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>> +  $ ~/unshare -c
>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>> +  [ns]$ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>> +
>>>> +  # From global cgroupns:
>>>> +  $ cat /proc/<pid>/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # Unshare cgroupns along with userns and mountns
>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>> +  # sets up uid/gid map and execs /bin/bash
>>>> +  $ ~/unshare -c -u -m
>>>
>>> This command does not issue CLONE_NEWUSER, -U does.
>>>
>> I was using a custom unshare binary. But I will update the command
>> line to be similar to the one in util-linux.
>> 
>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>> +  # hierarchy.
>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>> +  [ns]$ ls -l /tmp/cgroup
>>>> +  total 0
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>
>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>
>>> ---cut---
>>> container:/ # ls /sys/fs/cgroup/
>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> 
>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>> container, only unified hierarchy can be mounted. So, for now, that
>> flag is needed. I will fix the documentation too.
>> 
>>> mount: wrong fs type, bad option, bad superblock on none,
>>>        missing codepage or helper program, or other error
>>>
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail or so.
>>> container:/ # cat /proc/self/cgroup
>>> 8:memory:/machine/test00.libvirt-lxc
>>> 7:devices:/machine/test00.libvirt-lxc
>>> 6:hugetlb:/
>>> 5:cpuset:/machine/test00.libvirt-lxc
>>> 4:blkio:/machine/test00.libvirt-lxc
>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>> 2:freezer:/machine/test00.libvirt-lxc
>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>> container:/ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>> container:/ #
>>>
>>> #host side
>>> lxc-os132:~ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>> ---cut---
>>>
>>> Any ideas?
>>>
>> 
>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>
> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Ugh.  It sounds like there is a real mess here.  At the very least there
is misunderstanding.

I have a memory that systemd should have been able to use a unified
hierarchy.  As you could still mount the different controllers
independently (they just use the same directory structure on each
mount).

That said from a practical standpoint I am not certain that a cgroup
namespace is viable if it can not support the behavior of cgroupsfs
that everyone is using.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-05 23:53                     ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-05 23:53 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel@vger.kernel.org, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>> Aditya,
>>>
>>> I gave your patch set a try but it does not work for me.
>>> Maybe you can bring some light into the issues I'm facing.
>>> Sadly I still had no time to dig into your code.
>>>
>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 147 insertions(+)
>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>
>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>> new file mode 100644
>>>> index 0000000..6480379
>>>> --- /dev/null
>>>> +++ b/Documentation/cgroups/namespace.txt
>>>> @@ -0,0 +1,147 @@
>>>> +                     CGroup Namespaces
>>>> +
>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>> +at the time of creation of the cgroup namespace.
>>>> +
>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>> +may leak potential system level information to the isolated processes.
>>>> +
>>>> +For Example:
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>> +and its desirable to not expose it to the isolated process.
>>>> +
>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>> +For Example:
>>>> +  # Before creating cgroup namespace
>>>> +  $ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>> +  $ ~/unshare -c
>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>> +  [ns]$ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>> +
>>>> +  # From global cgroupns:
>>>> +  $ cat /proc/<pid>/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # Unshare cgroupns along with userns and mountns
>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>> +  # sets up uid/gid map and execs /bin/bash
>>>> +  $ ~/unshare -c -u -m
>>>
>>> This command does not issue CLONE_NEWUSER, -U does.
>>>
>> I was using a custom unshare binary. But I will update the command
>> line to be similar to the one in util-linux.
>> 
>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>> +  # hierarchy.
>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>> +  [ns]$ ls -l /tmp/cgroup
>>>> +  total 0
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>
>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>
>>> ---cut---
>>> container:/ # ls /sys/fs/cgroup/
>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> 
>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>> container, only unified hierarchy can be mounted. So, for now, that
>> flag is needed. I will fix the documentation too.
>> 
>>> mount: wrong fs type, bad option, bad superblock on none,
>>>        missing codepage or helper program, or other error
>>>
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail or so.
>>> container:/ # cat /proc/self/cgroup
>>> 8:memory:/machine/test00.libvirt-lxc
>>> 7:devices:/machine/test00.libvirt-lxc
>>> 6:hugetlb:/
>>> 5:cpuset:/machine/test00.libvirt-lxc
>>> 4:blkio:/machine/test00.libvirt-lxc
>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>> 2:freezer:/machine/test00.libvirt-lxc
>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>> container:/ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>> container:/ #
>>>
>>> #host side
>>> lxc-os132:~ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>> ---cut---
>>>
>>> Any ideas?
>>>
>> 
>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>
> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Ugh.  It sounds like there is a real mess here.  At the very least there
is misunderstanding.

I have a memory that systemd should have been able to use a unified
hierarchy.  As you could still mount the different controllers
independently (they just use the same directory structure on each
mount).

That said from a practical standpoint I am not certain that a cgroup
namespace is viable if it can not support the behavior of cgroupsfs
that everyone is using.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-05 23:53                     ` Eric W. Biederman
@ 2015-01-06  0:07                         ` Richard Weinberger
  -1 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

Am 06.01.2015 um 00:53 schrieb Eric W. Biederman:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
> 
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
> 
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).

Luckily systemd folks want to move to the unified but as of now it does not work.
Please see this mail from Lennart:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html

Maybe the porting is easy. Dunno.
I had no time yet to look into that.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.

Yep.

systemd *really* wants to own cgroupfs, so it has to mount it within the container.
Currently libvirt does nasty hacks using bind mounts which are also problematic.
My hope was that with cgroup namespaces I can simply cheat systemd and give it
a cgroupfs to mess with.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-06  0:07                         ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

Am 06.01.2015 um 00:53 schrieb Eric W. Biederman:
> Richard Weinberger <richard@nod.at> writes:
> 
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard@nod.at> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
> 
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
> 
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).

Luckily systemd folks want to move to the unified but as of now it does not work.
Please see this mail from Lennart:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html

Maybe the porting is easy. Dunno.
I had no time yet to look into that.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.

Yep.

systemd *really* wants to own cgroupfs, so it has to mount it within the container.
Currently libvirt does nasty hacks using bind mounts which are also problematic.
My hope was that with cgroup namespaces I can simply cheat systemd and give it
a cgroupfs to mess with.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                     ` <87lhlgpyxk.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-06  0:07                         ` Richard Weinberger
@ 2015-01-06  0:10                       ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06  0:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Linux API,
	Tejun Heo, cgroups mailinglist, Ingo Molnar

On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).
>
In theory, if you boot kernel with
"cgroup__DEVEL__legacy_files_on_dfl" command-line parameter, and mount
cgroups with sane-behavior flag, then it should be more-or-less
similar to mounting all hierarchies together at the same mount-point
(mount -t cgroup -o __DEVEL_sane_behavior none $mntpt). I haven't
tried this, but systemd should be able to work with it and you can
enable cgroup-namespace too.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.
>

Since the old/default behavior is on its way out, I didn't invest time
in fixing that. Also, some of the properties that make
cgroup-namespace simpler are only provided by unified hierarchy (for
example: a single root-cgroup per container).


> Eric

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                     ` <87lhlgpyxk.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-06  0:10                       ` Aditya Kali
  2015-01-06  0:10                       ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06  0:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Richard Weinberger <richard@nod.at> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard@nod.at> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali@google.com>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).
>
In theory, if you boot kernel with
"cgroup__DEVEL__legacy_files_on_dfl" command-line parameter, and mount
cgroups with sane-behavior flag, then it should be more-or-less
similar to mounting all hierarchies together at the same mount-point
(mount -t cgroup -o __DEVEL_sane_behavior none $mntpt). I haven't
tried this, but systemd should be able to work with it and you can
enable cgroup-namespace too.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.
>

Since the old/default behavior is on its way out, I didn't invest time
in fixing that. Also, some of the properties that make
cgroup-namespace simpler are only provided by unified hierarchy (for
example: a single root-cgroup per container).


> Eric

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-06  0:10                       ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06  0:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).
>
In theory, if you boot kernel with
"cgroup__DEVEL__legacy_files_on_dfl" command-line parameter, and mount
cgroups with sane-behavior flag, then it should be more-or-less
similar to mounting all hierarchies together at the same mount-point
(mount -t cgroup -o __DEVEL_sane_behavior none $mntpt). I haven't
tried this, but systemd should be able to work with it and you can
enable cgroup-namespace too.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.
>

Since the old/default behavior is on its way out, I didn't invest time
in fixing that. Also, some of the properties that make
cgroup-namespace simpler are only provided by unified hierarchy (for
example: a single root-cgroup per container).


> Eric

-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                       ` <CAGr1F2HSi_D07r2c5CKOsjSR1+58k9G2MrtACsd+HV6XKvJ7cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-06  0:17                         ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06  0:17 UTC (permalink / raw)
  To: Aditya Kali, Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

Am 06.01.2015 um 01:10 schrieb Aditya Kali:
> Since the old/default behavior is on its way out, I didn't invest time
> in fixing that. Also, some of the properties that make
> cgroup-namespace simpler are only provided by unified hierarchy (for
> example: a single root-cgroup per container).

Does the new sane cgroupfs behavior even have a single real world user?
I always thought it isn't stable yet.

Linux distros currently use systemd v210. They don't dare to use a newer one.
Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
version it will take 1-2 years until it would hit a recent distro.

So please support also the old and nasty behavior such that one day we can run current
systemd distros in Linux containers.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                       ` <CAGr1F2HSi_D07r2c5CKOsjSR1+58k9G2MrtACsd+HV6XKvJ7cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-06  0:17                         ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06  0:17 UTC (permalink / raw)
  To: Aditya Kali, Eric W. Biederman
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

Am 06.01.2015 um 01:10 schrieb Aditya Kali:
> Since the old/default behavior is on its way out, I didn't invest time
> in fixing that. Also, some of the properties that make
> cgroup-namespace simpler are only provided by unified hierarchy (for
> example: a single root-cgroup per container).

Does the new sane cgroupfs behavior even have a single real world user?
I always thought it isn't stable yet.

Linux distros currently use systemd v210. They don't dare to use a newer one.
Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
version it will take 1-2 years until it would hit a recent distro.

So please support also the old and nasty behavior such that one day we can run current
systemd distros in Linux containers.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-06  0:17                         ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06  0:17 UTC (permalink / raw)
  To: Aditya Kali, Eric W. Biederman
  Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, Ingo Molnar, Linux Containers, Rohit Jnagal,
	Vivek Goyal

Am 06.01.2015 um 01:10 schrieb Aditya Kali:
> Since the old/default behavior is on its way out, I didn't invest time
> in fixing that. Also, some of the properties that make
> cgroup-namespace simpler are only provided by unified hierarchy (for
> example: a single root-cgroup per container).

Does the new sane cgroupfs behavior even have a single real world user?
I always thought it isn't stable yet.

Linux distros currently use systemd v210. They don't dare to use a newer one.
Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
version it will take 1-2 years until it would hit a recent distro.

So please support also the old and nasty behavior such that one day we can run current
systemd distros in Linux containers.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                         ` <54AB2992.6060707-/L3Ra7n9ekc@public.gmane.org>
@ 2015-01-06 23:20                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06 23:20 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

I understand your point. But it will add some complexity to the code.

Before trying to make it work for non-unified hierarchy cases, I would
like to get a clearer idea.
What do you expect to be mounted when you run:
  container:/ # mount -t cgroup none /sys/fs/cgroup/
from inside the container?

Note that cgroup-namespace wont be able to change the way cgroups are
mounted .. i.e., if say cpu and cpuacct subsystems are mounted
together at a single mount-point, then we cannot mount them any other
way (inside a container or outside). This restriction exists today and
cgroup-namespaces won't change that.

So, If on the host we have:
root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
cgroup /sys/fs/cgroup/rest cgroup
rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0

And inside the container we want each subsystem to be on its own
mount-point, then it will fail. Do you think even then its useful to
support virtualizing paths for non-unified hierarchies?

Thanks,


On Mon, Jan 5, 2015 at 4:17 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Am 06.01.2015 um 01:10 schrieb Aditya Kali:
>> Since the old/default behavior is on its way out, I didn't invest time
>> in fixing that. Also, some of the properties that make
>> cgroup-namespace simpler are only provided by unified hierarchy (for
>> example: a single root-cgroup per container).
>
> Does the new sane cgroupfs behavior even have a single real world user?
> I always thought it isn't stable yet.
>
> Linux distros currently use systemd v210. They don't dare to use a newer one.
> Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
> version it will take 1-2 years until it would hit a recent distro.
>
> So please support also the old and nasty behavior such that one day we can run current
> systemd distros in Linux containers.
>
> Thanks,
> //richard



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                         ` <54AB2992.6060707-/L3Ra7n9ekc@public.gmane.org>
@ 2015-01-06 23:20                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06 23:20 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

I understand your point. But it will add some complexity to the code.

Before trying to make it work for non-unified hierarchy cases, I would
like to get a clearer idea.
What do you expect to be mounted when you run:
  container:/ # mount -t cgroup none /sys/fs/cgroup/
from inside the container?

Note that cgroup-namespace wont be able to change the way cgroups are
mounted .. i.e., if say cpu and cpuacct subsystems are mounted
together at a single mount-point, then we cannot mount them any other
way (inside a container or outside). This restriction exists today and
cgroup-namespaces won't change that.

So, If on the host we have:
root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
cgroup /sys/fs/cgroup/rest cgroup
rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0

And inside the container we want each subsystem to be on its own
mount-point, then it will fail. Do you think even then its useful to
support virtualizing paths for non-unified hierarchies?

Thanks,


On Mon, Jan 5, 2015 at 4:17 PM, Richard Weinberger <richard@nod.at> wrote:
> Am 06.01.2015 um 01:10 schrieb Aditya Kali:
>> Since the old/default behavior is on its way out, I didn't invest time
>> in fixing that. Also, some of the properties that make
>> cgroup-namespace simpler are only provided by unified hierarchy (for
>> example: a single root-cgroup per container).
>
> Does the new sane cgroupfs behavior even have a single real world user?
> I always thought it isn't stable yet.
>
> Linux distros currently use systemd v210. They don't dare to use a newer one.
> Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
> version it will take 1-2 years until it would hit a recent distro.
>
> So please support also the old and nasty behavior such that one day we can run current
> systemd distros in Linux containers.
>
> Thanks,
> //richard



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-06 23:20                           ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-06 23:20 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

I understand your point. But it will add some complexity to the code.

Before trying to make it work for non-unified hierarchy cases, I would
like to get a clearer idea.
What do you expect to be mounted when you run:
  container:/ # mount -t cgroup none /sys/fs/cgroup/
from inside the container?

Note that cgroup-namespace wont be able to change the way cgroups are
mounted .. i.e., if say cpu and cpuacct subsystems are mounted
together at a single mount-point, then we cannot mount them any other
way (inside a container or outside). This restriction exists today and
cgroup-namespaces won't change that.

So, If on the host we have:
root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
cgroup /sys/fs/cgroup/rest cgroup
rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0

And inside the container we want each subsystem to be on its own
mount-point, then it will fail. Do you think even then its useful to
support virtualizing paths for non-unified hierarchies?

Thanks,


On Mon, Jan 5, 2015 at 4:17 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Am 06.01.2015 um 01:10 schrieb Aditya Kali:
>> Since the old/default behavior is on its way out, I didn't invest time
>> in fixing that. Also, some of the properties that make
>> cgroup-namespace simpler are only provided by unified hierarchy (for
>> example: a single root-cgroup per container).
>
> Does the new sane cgroupfs behavior even have a single real world user?
> I always thought it isn't stable yet.
>
> Linux distros currently use systemd v210. They don't dare to use a newer one.
> Even *if* systemd would support the sane sane cgroupfs behavior in the most recent
> version it will take 1-2 years until it would hit a recent distro.
>
> So please support also the old and nasty behavior such that one day we can run current
> systemd distros in Linux containers.
>
> Thanks,
> //richard



-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                           ` <CAGr1F2EGOUSEd3-G4PS0mq=9kU1nWG4CwHUOQaNUATepc11_Sw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-06 23:39                             ` Richard Weinberger
  2015-01-07  9:28                               ` Richard Weinberger
  1 sibling, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06 23:39 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> I understand your point. But it will add some complexity to the code.
> 
> Before trying to make it work for non-unified hierarchy cases, I would
> like to get a clearer idea.
> What do you expect to be mounted when you run:
>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> from inside the container?

I expect cgroupfs to behave exactly as it would in the initial namespace.
Such that the container can do with it whatever it wants.
systemd mounts and manages cgroups on its own.
Like for CONFIG_DEVPTS_MULTIPLE_INSTANCES.

If a new cgroup namespace cannot provide a clean and autonomous cgroupfs
instance it is fundamentally flawed.
You cannot provide a namespace mechanism which depends on the host side
that much.
This will also horrible break container migrations between hosts.
i.e. Migrate a container from a Ubuntu host to a Fedora (systemd!) host.

> Note that cgroup-namespace wont be able to change the way cgroups are
> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> together at a single mount-point, then we cannot mount them any other
> way (inside a container or outside). This restriction exists today and
> cgroup-namespaces won't change that.

Why can't cgroup namespace change this?
I think of cgroup namespace as a new and clean cgroupfs instance which inherits
all limits from the outside.

> So, If on the host we have:
> root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
> tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
> cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
> cgroup /sys/fs/cgroup/rest cgroup
> rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0
> 
> And inside the container we want each subsystem to be on its own
> mount-point, then it will fail. Do you think even then its useful to
> support virtualizing paths for non-unified hierarchies?

As I've stated above I expect from cgroup namespaces a clean and sane
cgroupfs instance no matter how the outer mounts are.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                           ` <CAGr1F2EGOUSEd3-G4PS0mq=9kU1nWG4CwHUOQaNUATepc11_Sw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-06 23:39                             ` Richard Weinberger
  2015-01-07  9:28                               ` Richard Weinberger
  1 sibling, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06 23:39 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> I understand your point. But it will add some complexity to the code.
> 
> Before trying to make it work for non-unified hierarchy cases, I would
> like to get a clearer idea.
> What do you expect to be mounted when you run:
>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> from inside the container?

I expect cgroupfs to behave exactly as it would in the initial namespace.
Such that the container can do with it whatever it wants.
systemd mounts and manages cgroups on its own.
Like for CONFIG_DEVPTS_MULTIPLE_INSTANCES.

If a new cgroup namespace cannot provide a clean and autonomous cgroupfs
instance it is fundamentally flawed.
You cannot provide a namespace mechanism which depends on the host side
that much.
This will also horrible break container migrations between hosts.
i.e. Migrate a container from a Ubuntu host to a Fedora (systemd!) host.

> Note that cgroup-namespace wont be able to change the way cgroups are
> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> together at a single mount-point, then we cannot mount them any other
> way (inside a container or outside). This restriction exists today and
> cgroup-namespaces won't change that.

Why can't cgroup namespace change this?
I think of cgroup namespace as a new and clean cgroupfs instance which inherits
all limits from the outside.

> So, If on the host we have:
> root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
> tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
> cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
> cgroup /sys/fs/cgroup/rest cgroup
> rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0
> 
> And inside the container we want each subsystem to be on its own
> mount-point, then it will fail. Do you think even then its useful to
> support virtualizing paths for non-unified hierarchies?

As I've stated above I expect from cgroup namespaces a clean and sane
cgroupfs instance no matter how the outer mounts are.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-06 23:39                             ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-06 23:39 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> I understand your point. But it will add some complexity to the code.
> 
> Before trying to make it work for non-unified hierarchy cases, I would
> like to get a clearer idea.
> What do you expect to be mounted when you run:
>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> from inside the container?

I expect cgroupfs to behave exactly as it would in the initial namespace.
Such that the container can do with it whatever it wants.
systemd mounts and manages cgroups on its own.
Like for CONFIG_DEVPTS_MULTIPLE_INSTANCES.

If a new cgroup namespace cannot provide a clean and autonomous cgroupfs
instance it is fundamentally flawed.
You cannot provide a namespace mechanism which depends on the host side
that much.
This will also horrible break container migrations between hosts.
i.e. Migrate a container from a Ubuntu host to a Fedora (systemd!) host.

> Note that cgroup-namespace wont be able to change the way cgroups are
> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> together at a single mount-point, then we cannot mount them any other
> way (inside a container or outside). This restriction exists today and
> cgroup-namespaces won't change that.

Why can't cgroup namespace change this?
I think of cgroup namespace as a new and clean cgroupfs instance which inherits
all limits from the outside.

> So, If on the host we have:
> root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
> tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
> cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
> cgroup /sys/fs/cgroup/rest cgroup
> rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0
> 
> And inside the container we want each subsystem to be on its own
> mount-point, then it will fail. Do you think even then its useful to
> support virtualizing paths for non-unified hierarchies?

As I've stated above I expect from cgroup namespaces a clean and sane
cgroupfs instance no matter how the outer mounts are.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-06 23:20                           ` Aditya Kali
@ 2015-01-07  9:28                               ` Richard Weinberger
  -1 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-07  9:28 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> I understand your point. But it will add some complexity to the code.
> 
> Before trying to make it work for non-unified hierarchy cases, I would
> like to get a clearer idea.
> What do you expect to be mounted when you run:
>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> from inside the container?
> 
> Note that cgroup-namespace wont be able to change the way cgroups are
> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> together at a single mount-point, then we cannot mount them any other
> way (inside a container or outside). This restriction exists today and
> cgroup-namespaces won't change that.

I wondered why cgroup namespaces won't change that and looked at your patches
in more detail.
What you propose as cgroup namespace is much more a cgroup chroot() than
a namespace.
As you pass relative paths into the namespace you depend on the mount structure
of the host side.
Hence, the abstraction between namespaces happens on the mount paths of the initial
cgroupfs. But we really want a new cgroupfs instance within a container and not just
a cut out of the initial cgroupfs mount.

I fear you approach is over simplified and won't work for all cases. It may work
for your specific use case at Google but we really want something generic.
Eric, what do you think?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07  9:28                               ` Richard Weinberger
  0 siblings, 0 replies; 384+ messages in thread
From: Richard Weinberger @ 2015-01-07  9:28 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> I understand your point. But it will add some complexity to the code.
> 
> Before trying to make it work for non-unified hierarchy cases, I would
> like to get a clearer idea.
> What do you expect to be mounted when you run:
>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> from inside the container?
> 
> Note that cgroup-namespace wont be able to change the way cgroups are
> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> together at a single mount-point, then we cannot mount them any other
> way (inside a container or outside). This restriction exists today and
> cgroup-namespaces won't change that.

I wondered why cgroup namespaces won't change that and looked at your patches
in more detail.
What you propose as cgroup namespace is much more a cgroup chroot() than
a namespace.
As you pass relative paths into the namespace you depend on the mount structure
of the host side.
Hence, the abstraction between namespaces happens on the mount paths of the initial
cgroupfs. But we really want a new cgroupfs instance within a container and not just
a cut out of the initial cgroupfs mount.

I fear you approach is over simplified and won't work for all cases. It may work
for your specific use case at Google but we really want something generic.
Eric, what do you think?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-07  9:28                               ` Richard Weinberger
@ 2015-01-07 14:45                                   ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 14:45 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>> 
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>> 
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>
> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?

I think I probably need to go back upthread and read the patches.

I think it is a reasonable practical requirement that a widely used long
term supported distribution like RHEL 7 needs to be able to run in a linux
container bizarre init system and all.  And that we the abstractions
should be that that we should be able to migrate such a beast.

There are a couple of issues in play and I think we need actual testing
rather than reports that something shouldn't work before we reject a set
of patches.    Aditya in one of his replies to me has reported a
configuration that he expects will work.  So I think that configuration
needs to be tested.

cgroups is a weird beast and the problems tend not to lie where a person
would first expect.

I suspect no one strongly cares if the cgroup hierarchy is unified or
not.  By unified hierarchy I mean that  every mount of cgroupfs has the
same directories with the same processes in each directory.

I do think people will care which controllers will show up in differ
mounts of cgroupfs, and I think that is relevant to process migration.




I am going to segway into scope of what is achievable with a cgroup namespace.

- If there are files in cgroupfs that are not safe to delegate we can
  not support those files in a container. 

  Last I looked there were such files and systemd used them.

- Which controllers share hierarchies of processes to track resources is
  a core cgroup issue and not a cgroup namespace issue.

  If we find problems with using a unified hierarchy support we need to
  go fix cgroups in general not cgroupfs.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 14:45                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 14:45 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
	cgroups mailinglist, linux-kernel, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

Richard Weinberger <richard@nod.at> writes:

> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>> 
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>> 
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>
> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?

I think I probably need to go back upthread and read the patches.

I think it is a reasonable practical requirement that a widely used long
term supported distribution like RHEL 7 needs to be able to run in a linux
container bizarre init system and all.  And that we the abstractions
should be that that we should be able to migrate such a beast.

There are a couple of issues in play and I think we need actual testing
rather than reports that something shouldn't work before we reject a set
of patches.    Aditya in one of his replies to me has reported a
configuration that he expects will work.  So I think that configuration
needs to be tested.

cgroups is a weird beast and the problems tend not to lie where a person
would first expect.

I suspect no one strongly cares if the cgroup hierarchy is unified or
not.  By unified hierarchy I mean that  every mount of cgroupfs has the
same directories with the same processes in each directory.

I do think people will care which controllers will show up in differ
mounts of cgroupfs, and I think that is relevant to process migration.




I am going to segway into scope of what is achievable with a cgroup namespace.

- If there are files in cgroupfs that are not safe to delegate we can
  not support those files in a container. 

  Last I looked there were such files and systemd used them.

- Which controllers share hierarchies of processes to track resources is
  a core cgroup issue and not a cgroup namespace issue.

  If we find problems with using a unified hierarchy support we need to
  go fix cgroups in general not cgroupfs.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                               ` <54ACFC38.5070007-/L3Ra7n9ekc@public.gmane.org>
  2015-01-07 14:45                                   ` Eric W. Biederman
@ 2015-01-07 18:57                                 ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-07 18:57 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Eric W. Biederman, Tejun Heo, cgroups mailinglist, Ingo Molnar

On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>

What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc/<pid>/cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).

Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.

Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
              ------ [B]
A --------|
              ------ [C]

Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.

> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
>
> Thanks,
> //richard


-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                               ` <54ACFC38.5070007-/L3Ra7n9ekc@public.gmane.org>
@ 2015-01-07 18:57                                 ` Aditya Kali
  2015-01-07 18:57                                 ` Aditya Kali
  1 sibling, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-07 18:57 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist, linux-kernel, Linux API,
	Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal

On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger <richard@nod.at> wrote:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>

What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc/<pid>/cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).

Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.

Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
              ------ [B]
A --------|
              ------ [C]

Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.

> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
>
> Thanks,
> //richard


-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 18:57                                 ` Aditya Kali
  0 siblings, 0 replies; 384+ messages in thread
From: Aditya Kali @ 2015-01-07 18:57 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
	Andy Lutomirski, cgroups mailinglist,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Ingo Molnar,
	Linux Containers, Rohit Jnagal, Vivek Goyal

On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>

What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc/<pid>/cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).

Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.

Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
              ------ [B]
A --------|
              ------ [C]

Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.

> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
>
> Thanks,
> //richard


-- 
Aditya

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                   ` <87fvbmir9q.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-07 19:30                                     ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-01-07 19:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, Tejun Heo, cgroups mailinglist

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
> > Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> >> I understand your point. But it will add some complexity to the code.
> >> 
> >> Before trying to make it work for non-unified hierarchy cases, I would
> >> like to get a clearer idea.
> >> What do you expect to be mounted when you run:
> >>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> >> from inside the container?
> >> 
> >> Note that cgroup-namespace wont be able to change the way cgroups are
> >> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> >> together at a single mount-point, then we cannot mount them any other
> >> way (inside a container or outside). This restriction exists today and
> >> cgroup-namespaces won't change that.
> >
> > I wondered why cgroup namespaces won't change that and looked at your patches
> > in more detail.
> > What you propose as cgroup namespace is much more a cgroup chroot() than
> > a namespace.
> > As you pass relative paths into the namespace you depend on the mount structure
> > of the host side.
> > Hence, the abstraction between namespaces happens on the mount paths of the initial
> > cgroupfs. But we really want a new cgroupfs instance within a container and not just
> > a cut out of the initial cgroupfs mount.
> >
> > I fear you approach is over simplified and won't work for all cases. It may work
> > for your specific use case at Google but we really want something generic.
> > Eric, what do you think?
> 
> I think I probably need to go back upthread and read the patches.
> 
> I think it is a reasonable practical requirement that a widely used long
> term supported distribution like RHEL 7 needs to be able to run in a linux
> container bizarre init system and all.  And that we the abstractions
> should be that that we should be able to migrate such a beast.

Userspace should be able to deal with however cgroups are mounted for
it.  The only case I've heard of where it really made a meaningful
difference was google's advanced grid usage.  In fact, the whole
justification of the unified cgroup stuff was that it was claimed (and
argued against by google) that that sufficed for any users.

Now yes, until now userspace could cache its info on how cgroups were
mounted and assume that wouldn't change (because the kernel wouldn't
let it), and migration will break that.  But if the cgroup roadmap
is to obsolete anything but unified hierarchy, then this was going
to happen regardless of what the cgroupns patchset did.

I agree with Aditya.  So long as the proclaimed direction of cgroups is
to only support unified cgroup hierarchy, there's no point in having
cgroupns do anything more than the chrooting.

> There are a couple of issues in play and I think we need actual testing
> rather than reports that something shouldn't work before we reject a set
> of patches.    Aditya in one of his replies to me has reported a
> configuration that he expects will work.  So I think that configuration
> needs to be tested.
> 
> cgroups is a weird beast and the problems tend not to lie where a person
> would first expect.
> 
> I suspect no one strongly cares if the cgroup hierarchy is unified or
> not.

Well, google does.  There are cases that were either much more complicated
or impossible to represent with unified hierarchy.  But complicating cgroupns
to support something which Tejun has said is explicitly not going to be
supported in the future would be ill-advised.

>   By unified hierarchy I mean that  every mount of cgroupfs has the
> same directories with the same processes in each directory.

No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
unified hierarchy means that all (sane) controllers are co-mounted into
one hierarchy.

> I do think people will care which controllers will show up in differ
> mounts of cgroupfs, and I think that is relevant to process migration.
> 
> 
> 
> 
> I am going to segway into scope of what is achievable with a cgroup namespace.
> 
> - If there are files in cgroupfs that are not safe to delegate we can
>   not support those files in a container. 
> 
>   Last I looked there were such files and systemd used them.
> 
> - Which controllers share hierarchies of processes to track resources is
>   a core cgroup issue and not a cgroup namespace issue.
> 
>   If we find problems with using a unified hierarchy support we need to
>   go fix cgroups in general not cgroupfs.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                   ` <87fvbmir9q.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-07 19:30                                     ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-01-07 19:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel, Andy Lutomirski, Tejun Heo, cgroups mailinglist,
	Ingo Molnar

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Richard Weinberger <richard@nod.at> writes:
> 
> > Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> >> I understand your point. But it will add some complexity to the code.
> >> 
> >> Before trying to make it work for non-unified hierarchy cases, I would
> >> like to get a clearer idea.
> >> What do you expect to be mounted when you run:
> >>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> >> from inside the container?
> >> 
> >> Note that cgroup-namespace wont be able to change the way cgroups are
> >> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> >> together at a single mount-point, then we cannot mount them any other
> >> way (inside a container or outside). This restriction exists today and
> >> cgroup-namespaces won't change that.
> >
> > I wondered why cgroup namespaces won't change that and looked at your patches
> > in more detail.
> > What you propose as cgroup namespace is much more a cgroup chroot() than
> > a namespace.
> > As you pass relative paths into the namespace you depend on the mount structure
> > of the host side.
> > Hence, the abstraction between namespaces happens on the mount paths of the initial
> > cgroupfs. But we really want a new cgroupfs instance within a container and not just
> > a cut out of the initial cgroupfs mount.
> >
> > I fear you approach is over simplified and won't work for all cases. It may work
> > for your specific use case at Google but we really want something generic.
> > Eric, what do you think?
> 
> I think I probably need to go back upthread and read the patches.
> 
> I think it is a reasonable practical requirement that a widely used long
> term supported distribution like RHEL 7 needs to be able to run in a linux
> container bizarre init system and all.  And that we the abstractions
> should be that that we should be able to migrate such a beast.

Userspace should be able to deal with however cgroups are mounted for
it.  The only case I've heard of where it really made a meaningful
difference was google's advanced grid usage.  In fact, the whole
justification of the unified cgroup stuff was that it was claimed (and
argued against by google) that that sufficed for any users.

Now yes, until now userspace could cache its info on how cgroups were
mounted and assume that wouldn't change (because the kernel wouldn't
let it), and migration will break that.  But if the cgroup roadmap
is to obsolete anything but unified hierarchy, then this was going
to happen regardless of what the cgroupns patchset did.

I agree with Aditya.  So long as the proclaimed direction of cgroups is
to only support unified cgroup hierarchy, there's no point in having
cgroupns do anything more than the chrooting.

> There are a couple of issues in play and I think we need actual testing
> rather than reports that something shouldn't work before we reject a set
> of patches.    Aditya in one of his replies to me has reported a
> configuration that he expects will work.  So I think that configuration
> needs to be tested.
> 
> cgroups is a weird beast and the problems tend not to lie where a person
> would first expect.
> 
> I suspect no one strongly cares if the cgroup hierarchy is unified or
> not.

Well, google does.  There are cases that were either much more complicated
or impossible to represent with unified hierarchy.  But complicating cgroupns
to support something which Tejun has said is explicitly not going to be
supported in the future would be ill-advised.

>   By unified hierarchy I mean that  every mount of cgroupfs has the
> same directories with the same processes in each directory.

No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
unified hierarchy means that all (sane) controllers are co-mounted into
one hierarchy.

> I do think people will care which controllers will show up in differ
> mounts of cgroupfs, and I think that is relevant to process migration.
> 
> 
> 
> 
> I am going to segway into scope of what is achievable with a cgroup namespace.
> 
> - If there are files in cgroupfs that are not safe to delegate we can
>   not support those files in a container. 
> 
>   Last I looked there were such files and systemd used them.
> 
> - Which controllers share hierarchies of processes to track resources is
>   a core cgroup issue and not a cgroup namespace issue.
> 
>   If we find problems with using a unified hierarchy support we need to
>   go fix cgroups in general not cgroupfs.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 19:30                                     ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-01-07 19:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
> > Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> >> I understand your point. But it will add some complexity to the code.
> >> 
> >> Before trying to make it work for non-unified hierarchy cases, I would
> >> like to get a clearer idea.
> >> What do you expect to be mounted when you run:
> >>   container:/ # mount -t cgroup none /sys/fs/cgroup/
> >> from inside the container?
> >> 
> >> Note that cgroup-namespace wont be able to change the way cgroups are
> >> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> >> together at a single mount-point, then we cannot mount them any other
> >> way (inside a container or outside). This restriction exists today and
> >> cgroup-namespaces won't change that.
> >
> > I wondered why cgroup namespaces won't change that and looked at your patches
> > in more detail.
> > What you propose as cgroup namespace is much more a cgroup chroot() than
> > a namespace.
> > As you pass relative paths into the namespace you depend on the mount structure
> > of the host side.
> > Hence, the abstraction between namespaces happens on the mount paths of the initial
> > cgroupfs. But we really want a new cgroupfs instance within a container and not just
> > a cut out of the initial cgroupfs mount.
> >
> > I fear you approach is over simplified and won't work for all cases. It may work
> > for your specific use case at Google but we really want something generic.
> > Eric, what do you think?
> 
> I think I probably need to go back upthread and read the patches.
> 
> I think it is a reasonable practical requirement that a widely used long
> term supported distribution like RHEL 7 needs to be able to run in a linux
> container bizarre init system and all.  And that we the abstractions
> should be that that we should be able to migrate such a beast.

Userspace should be able to deal with however cgroups are mounted for
it.  The only case I've heard of where it really made a meaningful
difference was google's advanced grid usage.  In fact, the whole
justification of the unified cgroup stuff was that it was claimed (and
argued against by google) that that sufficed for any users.

Now yes, until now userspace could cache its info on how cgroups were
mounted and assume that wouldn't change (because the kernel wouldn't
let it), and migration will break that.  But if the cgroup roadmap
is to obsolete anything but unified hierarchy, then this was going
to happen regardless of what the cgroupns patchset did.

I agree with Aditya.  So long as the proclaimed direction of cgroups is
to only support unified cgroup hierarchy, there's no point in having
cgroupns do anything more than the chrooting.

> There are a couple of issues in play and I think we need actual testing
> rather than reports that something shouldn't work before we reject a set
> of patches.    Aditya in one of his replies to me has reported a
> configuration that he expects will work.  So I think that configuration
> needs to be tested.
> 
> cgroups is a weird beast and the problems tend not to lie where a person
> would first expect.
> 
> I suspect no one strongly cares if the cgroup hierarchy is unified or
> not.

Well, google does.  There are cases that were either much more complicated
or impossible to represent with unified hierarchy.  But complicating cgroupns
to support something which Tejun has said is explicitly not going to be
supported in the future would be ill-advised.

>   By unified hierarchy I mean that  every mount of cgroupfs has the
> same directories with the same processes in each directory.

No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
unified hierarchy means that all (sane) controllers are co-mounted into
one hierarchy.

> I do think people will care which controllers will show up in differ
> mounts of cgroupfs, and I think that is relevant to process migration.
> 
> 
> 
> 
> I am going to segway into scope of what is achievable with a cgroup namespace.
> 
> - If there are files in cgroupfs that are not safe to delegate we can
>   not support those files in a container. 
> 
>   Last I looked there were such files and systemd used them.
> 
> - Which controllers share hierarchies of processes to track resources is
>   a core cgroup issue and not a cgroup namespace issue.
> 
>   If we find problems with using a unified hierarchy support we need to
>   go fix cgroups in general not cgroupfs.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                     ` <20150107193059.GA1857-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-01-07 22:14                                       ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 22:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, Tejun Heo, cgroups mailinglist

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

>>   By unified hierarchy I mean that  every mount of cgroupfs has the
>> same directories with the same processes in each directory.
>
> No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
> unified hierarchy means that all (sane) controllers are co-mounted into
> one hierarchy.

I see what you mean.  If it is indeed the case than a mount of cgroupfs
using the unified hiearchy and can not specify which controllers are
present under that mount that very significant bug and presents a very
significant regression in user space flexibility.

I think you can still mount the unified hierarchy and select which
controls you want to see.  If you can not that is a change significantly
past what was agreed to and a regression fix needs to be applied.

With a unified hierarchy and separate controllers per mount many cgroup
using applications will continue to work as before without changes, or
with minimal changes.  That is what was agreed to and what I expect has
been actually implemented and it is what needs to be implemented in any
case.

I will see about making time to see where things are really at.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                     ` <20150107193059.GA1857-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-01-07 22:14                                       ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 22:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel, Andy Lutomirski, Tejun Heo, cgroups mailinglist,
	Ingo Molnar

"Serge E. Hallyn" <serge@hallyn.com> writes:

>>   By unified hierarchy I mean that  every mount of cgroupfs has the
>> same directories with the same processes in each directory.
>
> No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
> unified hierarchy means that all (sane) controllers are co-mounted into
> one hierarchy.

I see what you mean.  If it is indeed the case than a mount of cgroupfs
using the unified hiearchy and can not specify which controllers are
present under that mount that very significant bug and presents a very
significant regression in user space flexibility.

I think you can still mount the unified hierarchy and select which
controls you want to see.  If you can not that is a change significantly
past what was agreed to and a regression fix needs to be applied.

With a unified hierarchy and separate controllers per mount many cgroup
using applications will continue to work as before without changes, or
with minimal changes.  That is what was agreed to and what I expect has
been actually implemented and it is what needs to be implemented in any
case.

I will see about making time to see where things are really at.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 22:14                                       ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 22:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
	linux-kernel@vger.kernel.org, Andy Lutomirski, Tejun Heo,
	cgroups mailinglist, Ingo Molnar

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

>>   By unified hierarchy I mean that  every mount of cgroupfs has the
>> same directories with the same processes in each directory.
>
> No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
> unified hierarchy means that all (sane) controllers are co-mounted into
> one hierarchy.

I see what you mean.  If it is indeed the case than a mount of cgroupfs
using the unified hiearchy and can not specify which controllers are
present under that mount that very significant bug and presents a very
significant regression in user space flexibility.

I think you can still mount the unified hierarchy and select which
controls you want to see.  If you can not that is a change significantly
past what was agreed to and a regression fix needs to be applied.

With a unified hierarchy and separate controllers per mount many cgroup
using applications will continue to work as before without changes, or
with minimal changes.  That is what was agreed to and what I expect has
been actually implemented and it is what needs to be implemented in any
case.

I will see about making time to see where things are really at.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-07 22:14                                       ` Eric W. Biederman
@ 2015-01-07 22:45                                           ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 22:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
> I see what you mean.  If it is indeed the case than a mount of cgroupfs
> using the unified hiearchy and can not specify which controllers are
> present under that mount that very significant bug and presents a very
> significant regression in user space flexibility.

The parent always controls which controllers are made available at the
children level.  Only if the parent enables a controller, its
children, whether they're namespaces or not, can choose to further
distribute resources using that controller.  It's a straight-forward
top-down thing.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 22:45                                           ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 22:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
> I see what you mean.  If it is indeed the case than a mount of cgroupfs
> using the unified hiearchy and can not specify which controllers are
> present under that mount that very significant bug and presents a very
> significant regression in user space flexibility.

The parent always controls which controllers are made available at the
children level.  Only if the parent enables a controller, its
children, whether they're namespaces or not, can choose to further
distribute resources using that controller.  It's a straight-forward
top-down thing.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                           ` <20150107224430.GA28414-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-01-07 23:02                                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
>> I see what you mean.  If it is indeed the case than a mount of cgroupfs
>> using the unified hiearchy and can not specify which controllers are
>> present under that mount that very significant bug and presents a very
>> significant regression in user space flexibility.
>
> The parent always controls which controllers are made available at the
> children level.  Only if the parent enables a controller, its
> children, whether they're namespaces or not, can choose to further
> distribute resources using that controller.  It's a straight-forward
> top-down thing.

Ignoring namespace details for a moment. The following should be
possible with a unified hierarchy.  If it is not it is a show stopper
of a regression.

mount -t tmpfs none /sys/fs/cgroup
(cd /sys/fs/cgroup ; mkdir cpu cpuacct devices memory)
mount -t cgroupfs -o cpu /sys/fs/cgroup/cpu
mount -t cgroupfs -o cpuacct /sys/fs/cgroup/cpuacct
mount -t cgroupfs -o devices /sys/fs/cgroup/devices
mount -t cgroupfs -o memory /sys/fs/cgroup/memory

With the expectation that only the control files for the specified
controllers show up in those mounts.

That is a unified hierarchy is fine.  Requiring that there only be one
mount point and that every one use it is not ok and it actively a problem.

It is absolutely required to be able to avoid b0rked controllers, and
to my knowledge the only way to do that is to have multiple mounts where
we pick the controller on each mount.   Even if there is now a way that
doesn't require multiple mounts to keep b0rked controllers from being
enabled multiple mounts still need to work to support the existing
userspace programs.

This discussion is happening because Documentation/cgroups/unified-hierarchy.txt
implies the configuration I have just described will not work with
unified hierachies enabled.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                           ` <20150107224430.GA28414-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-01-07 23:02                                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

Tejun Heo <tj@kernel.org> writes:

> On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
>> I see what you mean.  If it is indeed the case than a mount of cgroupfs
>> using the unified hiearchy and can not specify which controllers are
>> present under that mount that very significant bug and presents a very
>> significant regression in user space flexibility.
>
> The parent always controls which controllers are made available at the
> children level.  Only if the parent enables a controller, its
> children, whether they're namespaces or not, can choose to further
> distribute resources using that controller.  It's a straight-forward
> top-down thing.

Ignoring namespace details for a moment. The following should be
possible with a unified hierarchy.  If it is not it is a show stopper
of a regression.

mount -t tmpfs none /sys/fs/cgroup
(cd /sys/fs/cgroup ; mkdir cpu cpuacct devices memory)
mount -t cgroupfs -o cpu /sys/fs/cgroup/cpu
mount -t cgroupfs -o cpuacct /sys/fs/cgroup/cpuacct
mount -t cgroupfs -o devices /sys/fs/cgroup/devices
mount -t cgroupfs -o memory /sys/fs/cgroup/memory

With the expectation that only the control files for the specified
controllers show up in those mounts.

That is a unified hierarchy is fine.  Requiring that there only be one
mount point and that every one use it is not ok and it actively a problem.

It is absolutely required to be able to avoid b0rked controllers, and
to my knowledge the only way to do that is to have multiple mounts where
we pick the controller on each mount.   Even if there is now a way that
doesn't require multiple mounts to keep b0rked controllers from being
enabled multiple mounts still need to work to support the existing
userspace programs.

This discussion is happening because Documentation/cgroups/unified-hierarchy.txt
implies the configuration I have just described will not work with
unified hierachies enabled.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:02                                             ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
>> I see what you mean.  If it is indeed the case than a mount of cgroupfs
>> using the unified hiearchy and can not specify which controllers are
>> present under that mount that very significant bug and presents a very
>> significant regression in user space flexibility.
>
> The parent always controls which controllers are made available at the
> children level.  Only if the parent enables a controller, its
> children, whether they're namespaces or not, can choose to further
> distribute resources using that controller.  It's a straight-forward
> top-down thing.

Ignoring namespace details for a moment. The following should be
possible with a unified hierarchy.  If it is not it is a show stopper
of a regression.

mount -t tmpfs none /sys/fs/cgroup
(cd /sys/fs/cgroup ; mkdir cpu cpuacct devices memory)
mount -t cgroupfs -o cpu /sys/fs/cgroup/cpu
mount -t cgroupfs -o cpuacct /sys/fs/cgroup/cpuacct
mount -t cgroupfs -o devices /sys/fs/cgroup/devices
mount -t cgroupfs -o memory /sys/fs/cgroup/memory

With the expectation that only the control files for the specified
controllers show up in those mounts.

That is a unified hierarchy is fine.  Requiring that there only be one
mount point and that every one use it is not ok and it actively a problem.

It is absolutely required to be able to avoid b0rked controllers, and
to my knowledge the only way to do that is to have multiple mounts where
we pick the controller on each mount.   Even if there is now a way that
doesn't require multiple mounts to keep b0rked controllers from being
enabled multiple mounts still need to work to support the existing
userspace programs.

This discussion is happening because Documentation/cgroups/unified-hierarchy.txt
implies the configuration I have just described will not work with
unified hierachies enabled.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-07 23:02                                             ` Eric W. Biederman
@ 2015-01-07 23:06                                                 ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
> Ignoring namespace details for a moment. The following should be
> possible with a unified hierarchy.  If it is not it is a show stopper
> of a regression.

The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
familiar with the basics before claiming random stuff.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:06                                                 ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
> Ignoring namespace details for a moment. The following should be
> possible with a unified hierarchy.  If it is not it is a show stopper
> of a regression.

The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
familiar with the basics before claiming random stuff.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                 ` <20150107230615.GA28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-01-07 23:09                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>> Ignoring namespace details for a moment. The following should be
>> possible with a unified hierarchy.  If it is not it is a show stopper
>> of a regression.
>
> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> familiar with the basics before claiming random stuff.

Not random and I am familiar thank you very much.

I may have mistyped the manual command line configuration for specifying
which controllers appear on a mount point does not alter my point.

The old options to enable selecting controllers need to continue and
need to continue to work with a unified hierarchy.

Anything else is a gratuitious regression.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                 ` <20150107230615.GA28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-01-07 23:09                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

Tejun Heo <tj@kernel.org> writes:

> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>> Ignoring namespace details for a moment. The following should be
>> possible with a unified hierarchy.  If it is not it is a show stopper
>> of a regression.
>
> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> familiar with the basics before claiming random stuff.

Not random and I am familiar thank you very much.

I may have mistyped the manual command line configuration for specifying
which controllers appear on a mount point does not alter my point.

The old options to enable selecting controllers need to continue and
need to continue to work with a unified hierarchy.

Anything else is a gratuitious regression.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:09                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>> Ignoring namespace details for a moment. The following should be
>> possible with a unified hierarchy.  If it is not it is a show stopper
>> of a regression.
>
> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> familiar with the basics before claiming random stuff.

Not random and I am familiar thank you very much.

I may have mistyped the manual command line configuration for specifying
which controllers appear on a mount point does not alter my point.

The old options to enable selecting controllers need to continue and
need to continue to work with a unified hierarchy.

Anything else is a gratuitious regression.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-07 23:09                                                   ` Eric W. Biederman
@ 2015-01-07 23:16                                                       ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

On Wed, Jan 07, 2015 at 05:09:53PM -0600, Eric W. Biederman wrote:
> I may have mistyped the manual command line configuration for specifying
> which controllers appear on a mount point does not alter my point.

Hmmm?  You were talking about the old hierarchies?

> The old options to enable selecting controllers need to continue and
> need to continue to work with a unified hierarchy.
> 
> Anything else is a gratuitious regression.

I have no idea what you're on about.  If the outer system uses unified
hierarchy, the inner system should use that too.  If the outer system
doesn't use unified hierarchy, namespace support has never existed,
and even if it did, the inside could never pick and choose controllers
independent from the outside.  If the outside is co-mounting cpu and
cpuacct, the inside is either also doing that or not mounting either.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:16                                                       ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

On Wed, Jan 07, 2015 at 05:09:53PM -0600, Eric W. Biederman wrote:
> I may have mistyped the manual command line configuration for specifying
> which controllers appear on a mount point does not alter my point.

Hmmm?  You were talking about the old hierarchies?

> The old options to enable selecting controllers need to continue and
> need to continue to work with a unified hierarchy.
> 
> Anything else is a gratuitious regression.

I have no idea what you're on about.  If the outer system uses unified
hierarchy, the inner system should use that too.  If the outer system
doesn't use unified hierarchy, namespace support has never existed,
and even if it did, the inside could never pick and choose controllers
independent from the outside.  If the outside is co-mounting cpu and
cpuacct, the inside is either also doing that or not mounting either.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                   ` <87fvbm2nni.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-01-07 23:16                                                       ` Tejun Heo
@ 2015-01-07 23:27                                                     ` Eric W. Biederman
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
>
>> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>>> Ignoring namespace details for a moment. The following should be
>>> possible with a unified hierarchy.  If it is not it is a show stopper
>>> of a regression.
>>
>> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
>> familiar with the basics before claiming random stuff.

Oh let's see I got that command line option out of /proc/mounts and yes
it works.  Perhaps it doesn't if I invoke unified hiearchies but the
option does in fact exist and work.

Now I really do need to test report regressions, and send probably send
regression fixes.  If I understand your strange ranting I think you just
told me that option that -o SUBSYS does work with unified hierarchies.

Tejun.  I asked you specifically about this case 2 years ago at plumbers
and you personally told me this would continue to work.  I am going to
hold you to that.

Fixing bugs is one thing.  Gratuitious regressions that make supporting
existing user space applications insane is another.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                   ` <87fvbm2nni.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-01-07 23:27                                                     ` Eric W. Biederman
  2015-01-07 23:27                                                     ` Eric W. Biederman
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

ebiederm@xmission.com (Eric W. Biederman) writes:

> Tejun Heo <tj@kernel.org> writes:
>
>> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>>> Ignoring namespace details for a moment. The following should be
>>> possible with a unified hierarchy.  If it is not it is a show stopper
>>> of a regression.
>>
>> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
>> familiar with the basics before claiming random stuff.

Oh let's see I got that command line option out of /proc/mounts and yes
it works.  Perhaps it doesn't if I invoke unified hiearchies but the
option does in fact exist and work.

Now I really do need to test report regressions, and send probably send
regression fixes.  If I understand your strange ranting I think you just
told me that option that -o SUBSYS does work with unified hierarchies.

Tejun.  I asked you specifically about this case 2 years ago at plumbers
and you personally told me this would continue to work.  I am going to
hold you to that.

Fixing bugs is one thing.  Gratuitious regressions that make supporting
existing user space applications insane is another.

Eric




^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:27                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-01-07 23:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
>
>> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>>> Ignoring namespace details for a moment. The following should be
>>> possible with a unified hierarchy.  If it is not it is a show stopper
>>> of a regression.
>>
>> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
>> familiar with the basics before claiming random stuff.

Oh let's see I got that command line option out of /proc/mounts and yes
it works.  Perhaps it doesn't if I invoke unified hiearchies but the
option does in fact exist and work.

Now I really do need to test report regressions, and send probably send
regression fixes.  If I understand your strange ranting I think you just
told me that option that -o SUBSYS does work with unified hierarchies.

Tejun.  I asked you specifically about this case 2 years ago at plumbers
and you personally told me this would continue to work.  I am going to
hold you to that.

Fixing bugs is one thing.  Gratuitious regressions that make supporting
existing user space applications insane is another.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-01-07 23:27                                                     ` Eric W. Biederman
@ 2015-01-07 23:35                                                         ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> >> familiar with the basics before claiming random stuff.
> 
> Oh let's see I got that command line option out of /proc/mounts and yes
> it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> option does in fact exist and work.

I meant the -o SUBSYS doesn't exist for unified hierarchy.

> Now I really do need to test report regressions, and send probably send
> regression fixes.  If I understand your strange ranting I think you just
> told me that option that -o SUBSYS does work with unified hierarchies.

What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
for all controllers.

> Tejun.  I asked you specifically about this case 2 years ago at plumbers
> and you personally told me this would continue to work.  I am going to
> hold you to that.

I have no idea what you're talking about in *THIS* thread.  I'm fully
aware of what was discussed *THEN*.

> Fixing bugs is one thing.  Gratuitious regressions that make supporting
> existing user space applications insane is another.

Can you explain what problem you're actually trying to talk about
without spouting random claims about regressions?

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-01-07 23:35                                                         ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-01-07 23:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> >> familiar with the basics before claiming random stuff.
> 
> Oh let's see I got that command line option out of /proc/mounts and yes
> it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> option does in fact exist and work.

I meant the -o SUBSYS doesn't exist for unified hierarchy.

> Now I really do need to test report regressions, and send probably send
> regression fixes.  If I understand your strange ranting I think you just
> told me that option that -o SUBSYS does work with unified hierarchies.

What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
for all controllers.

> Tejun.  I asked you specifically about this case 2 years ago at plumbers
> and you personally told me this would continue to work.  I am going to
> hold you to that.

I have no idea what you're talking about in *THIS* thread.  I'm fully
aware of what was discussed *THEN*.

> Fixing bugs is one thing.  Gratuitious regressions that make supporting
> existing user space applications insane is another.

Can you explain what problem you're actually trying to talk about
without spouting random claims about regressions?

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                         ` <20150107233553.GC28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-02-11  3:46                                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  3:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> > >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> > >> familiar with the basics before claiming random stuff.
> > 
> > Oh let's see I got that command line option out of /proc/mounts and yes
> > it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> > option does in fact exist and work.
> 
> I meant the -o SUBSYS doesn't exist for unified hierarchy.
> 
> > Now I really do need to test report regressions, and send probably send
> > regression fixes.  If I understand your strange ranting I think you just
> > told me that option that -o SUBSYS does work with unified hierarchies.
> 
> What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
> for all controllers.
> 
> > Tejun.  I asked you specifically about this case 2 years ago at plumbers
> > and you personally told me this would continue to work.  I am going to
> > hold you to that.
> 
> I have no idea what you're talking about in *THIS* thread.  I'm fully
> aware of what was discussed *THEN*.
> 
> > Fixing bugs is one thing.  Gratuitious regressions that make supporting
> > existing user space applications insane is another.
> 
> Can you explain what problem you're actually trying to talk about
> without spouting random claims about regressions?

A few weeks ago, in order to test the cgroup namespace patchset with lxc,
I went through the motions of getting lxc to work with unified hierarchy.
A few of the things I had to change:

1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
to start with 1.  I expect many userspace parsers to be broken by this.

2. After creating every non-leaf cgroup, we must fill in the
cgroup.subtree_cgroups file.  This is extra work which userspace
doesn't have to do right now.

3. Let's say we want to create a freezer cgroup /foo/bar for some set of
tasks, which they will administer.  In fact let's assume we are going to
use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
and then write 'freezer' into /foo/bar.  (If we're not using cgroup
namespaces, then we have to do a similar thing to let the tasks administer
/foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
to is where the tasks have to know that they can create cgroups in "..".

For containers this becomes odd.  We tend to group containers by the
tasks in and under a cgroup.  We now will have to assume a convention
where we know to check for tasks in and under "..", since by definition
pid 1's cgroup (in a container) cannot have children.

4. The per-cgroup "tasks" file not existing seems odd, although certainly
unexpected by much current software.

So, if the unified hierarchy is going to not cause undue pain, existing
software really needs to start working now to use it.  It's going to be
a sizeable task for lxc.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                         ` <20150107233553.GC28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2015-02-11  3:46                                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  3:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Serge E. Hallyn, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn, linux-kernel,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj@kernel.org):
> On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> > >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> > >> familiar with the basics before claiming random stuff.
> > 
> > Oh let's see I got that command line option out of /proc/mounts and yes
> > it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> > option does in fact exist and work.
> 
> I meant the -o SUBSYS doesn't exist for unified hierarchy.
> 
> > Now I really do need to test report regressions, and send probably send
> > regression fixes.  If I understand your strange ranting I think you just
> > told me that option that -o SUBSYS does work with unified hierarchies.
> 
> What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
> for all controllers.
> 
> > Tejun.  I asked you specifically about this case 2 years ago at plumbers
> > and you personally told me this would continue to work.  I am going to
> > hold you to that.
> 
> I have no idea what you're talking about in *THIS* thread.  I'm fully
> aware of what was discussed *THEN*.
> 
> > Fixing bugs is one thing.  Gratuitious regressions that make supporting
> > existing user space applications insane is another.
> 
> Can you explain what problem you're actually trying to talk about
> without spouting random claims about regressions?

A few weeks ago, in order to test the cgroup namespace patchset with lxc,
I went through the motions of getting lxc to work with unified hierarchy.
A few of the things I had to change:

1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
to start with 1.  I expect many userspace parsers to be broken by this.

2. After creating every non-leaf cgroup, we must fill in the
cgroup.subtree_cgroups file.  This is extra work which userspace
doesn't have to do right now.

3. Let's say we want to create a freezer cgroup /foo/bar for some set of
tasks, which they will administer.  In fact let's assume we are going to
use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
and then write 'freezer' into /foo/bar.  (If we're not using cgroup
namespaces, then we have to do a similar thing to let the tasks administer
/foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
to is where the tasks have to know that they can create cgroups in "..".

For containers this becomes odd.  We tend to group containers by the
tasks in and under a cgroup.  We now will have to assume a convention
where we know to check for tasks in and under "..", since by definition
pid 1's cgroup (in a container) cannot have children.

4. The per-cgroup "tasks" file not existing seems odd, although certainly
unexpected by much current software.

So, if the unified hierarchy is going to not cause undue pain, existing
software really needs to start working now to use it.  It's going to be
a sizeable task for lxc.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  3:46                                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  3:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Serge E. Hallyn, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> > >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> > >> familiar with the basics before claiming random stuff.
> > 
> > Oh let's see I got that command line option out of /proc/mounts and yes
> > it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> > option does in fact exist and work.
> 
> I meant the -o SUBSYS doesn't exist for unified hierarchy.
> 
> > Now I really do need to test report regressions, and send probably send
> > regression fixes.  If I understand your strange ranting I think you just
> > told me that option that -o SUBSYS does work with unified hierarchies.
> 
> What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
> for all controllers.
> 
> > Tejun.  I asked you specifically about this case 2 years ago at plumbers
> > and you personally told me this would continue to work.  I am going to
> > hold you to that.
> 
> I have no idea what you're talking about in *THIS* thread.  I'm fully
> aware of what was discussed *THEN*.
> 
> > Fixing bugs is one thing.  Gratuitious regressions that make supporting
> > existing user space applications insane is another.
> 
> Can you explain what problem you're actually trying to talk about
> without spouting random claims about regressions?

A few weeks ago, in order to test the cgroup namespace patchset with lxc,
I went through the motions of getting lxc to work with unified hierarchy.
A few of the things I had to change:

1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
to start with 1.  I expect many userspace parsers to be broken by this.

2. After creating every non-leaf cgroup, we must fill in the
cgroup.subtree_cgroups file.  This is extra work which userspace
doesn't have to do right now.

3. Let's say we want to create a freezer cgroup /foo/bar for some set of
tasks, which they will administer.  In fact let's assume we are going to
use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
and then write 'freezer' into /foo/bar.  (If we're not using cgroup
namespaces, then we have to do a similar thing to let the tasks administer
/foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
to is where the tasks have to know that they can create cgroups in "..".

For containers this becomes odd.  We tend to group containers by the
tasks in and under a cgroup.  We now will have to assume a convention
where we know to check for tasks in and under "..", since by definition
pid 1's cgroup (in a container) cannot have children.

4. The per-cgroup "tasks" file not existing seems odd, although certainly
unexpected by much current software.

So, if the unified hierarchy is going to not cause undue pain, existing
software really needs to start working now to use it.  It's going to be
a sizeable task for lxc.

-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-02-11  3:46                                                           ` Serge E. Hallyn
@ 2015-02-11  4:09                                                               ` Tejun Heo
  -1 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  4:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> to start with 1.  I expect many userspace parsers to be broken by this.

This is intentional.  The unified hierarchy will always have the
hierarchy number zero.  Userland needs to be updated anyway and the
unified hierarchy won't show up unless explicitly enabled.

> 2. After creating every non-leaf cgroup, we must fill in the
> cgroup.subtree_cgroups file.  This is extra work which userspace
> doesn't have to do right now.

Again, by design.  This is how organization and control are separated
and the differing levels of granularity is achieved.

> 3. Let's say we want to create a freezer cgroup /foo/bar for some set of

There shouldn't be a "freezer" cgroup.  The processes are categorized
according to their logical structure and controllers are applied to
the hierarchy as necessary.

> tasks, which they will administer.  In fact let's assume we are going to
> use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> namespaces, then we have to do a similar thing to let the tasks administer
> /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> to is where the tasks have to know that they can create cgroups in "..".
> 
> For containers this becomes odd.  We tend to group containers by the
> tasks in and under a cgroup.  We now will have to assume a convention
> where we know to check for tasks in and under "..", since by definition
> pid 1's cgroup (in a container) cannot have children.

The semantics is that the parent enables distribution of its given
type of resource by enabling the controller in its subtree_control.
This scoping isn't necessary for freezer and I'm debating whether to
enable controllers which don't need granularity control to be enabled
unconditionally.  Right now, I'm leaning against it mostly for
consistency.

> 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> unexpected by much current software.

And, yes, everything is per-process for reasons described in
unified-hierarchy.txt.

> So, if the unified hierarchy is going to not cause undue pain, existing
> software really needs to start working now to use it.  It's going to be
> a sizeable task for lxc.

Yes, this isn't gonna be a trivial conversion.  The usage model
changes and so will a lot of controller knobs and behaviors.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  4:09                                                               ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  4:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn, linux-kernel, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> to start with 1.  I expect many userspace parsers to be broken by this.

This is intentional.  The unified hierarchy will always have the
hierarchy number zero.  Userland needs to be updated anyway and the
unified hierarchy won't show up unless explicitly enabled.

> 2. After creating every non-leaf cgroup, we must fill in the
> cgroup.subtree_cgroups file.  This is extra work which userspace
> doesn't have to do right now.

Again, by design.  This is how organization and control are separated
and the differing levels of granularity is achieved.

> 3. Let's say we want to create a freezer cgroup /foo/bar for some set of

There shouldn't be a "freezer" cgroup.  The processes are categorized
according to their logical structure and controllers are applied to
the hierarchy as necessary.

> tasks, which they will administer.  In fact let's assume we are going to
> use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> namespaces, then we have to do a similar thing to let the tasks administer
> /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> to is where the tasks have to know that they can create cgroups in "..".
> 
> For containers this becomes odd.  We tend to group containers by the
> tasks in and under a cgroup.  We now will have to assume a convention
> where we know to check for tasks in and under "..", since by definition
> pid 1's cgroup (in a container) cannot have children.

The semantics is that the parent enables distribution of its given
type of resource by enabling the controller in its subtree_control.
This scoping isn't necessary for freezer and I'm debating whether to
enable controllers which don't need granularity control to be enabled
unconditionally.  Right now, I'm leaning against it mostly for
consistency.

> 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> unexpected by much current software.

And, yes, everything is per-process for reasons described in
unified-hierarchy.txt.

> So, if the unified hierarchy is going to not cause undue pain, existing
> software really needs to start working now to use it.  It's going to be
> a sizeable task for lxc.

Yes, this isn't gonna be a trivial conversion.  The usage model
changes and so will a lot of controller knobs and behaviors.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                               ` <20150211040957.GC21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2015-02-11  4:29                                                                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  4:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> > to start with 1.  I expect many userspace parsers to be broken by this.
> 
> This is intentional.  The unified hierarchy will always have the
> hierarchy number zero.  Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
> 
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file.  This is extra work which userspace
> > doesn't have to do right now.
> 
> Again, by design.  This is how organization and control are separated
> and the differing levels of granularity is achieved.
> 
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
> 
> There shouldn't be a "freezer" cgroup.  The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.

But there can well be cgroups for which only freezer is enabled.  If
I'm wrong about that, then I am suffering a fundamental misunderstanding.

> > tasks, which they will administer.  In fact let's assume we are going to
> > use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> > 
> > For containers this becomes odd.  We tend to group containers by the
> > tasks in and under a cgroup.  We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
> 
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally.  Right now, I'm leaning against it mostly for
> consistency.

Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.

> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
> 
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
> 
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it.  It's going to be
> > a sizeable task for lxc.
> 
> Yes, this isn't gonna be a trivial conversion.  The usage model
> changes and so will a lot of controller knobs and behaviors.
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                               ` <20150211040957.GC21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2015-02-11  4:29                                                                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  4:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Eric W. Biederman, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn, linux-kernel,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj@kernel.org):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> > to start with 1.  I expect many userspace parsers to be broken by this.
> 
> This is intentional.  The unified hierarchy will always have the
> hierarchy number zero.  Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
> 
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file.  This is extra work which userspace
> > doesn't have to do right now.
> 
> Again, by design.  This is how organization and control are separated
> and the differing levels of granularity is achieved.
> 
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
> 
> There shouldn't be a "freezer" cgroup.  The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.

But there can well be cgroups for which only freezer is enabled.  If
I'm wrong about that, then I am suffering a fundamental misunderstanding.

> > tasks, which they will administer.  In fact let's assume we are going to
> > use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> > 
> > For containers this becomes odd.  We tend to group containers by the
> > tasks in and under a cgroup.  We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
> 
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally.  Right now, I'm leaning against it mostly for
> consistency.

Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.

> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
> 
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
> 
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it.  It's going to be
> > a sizeable task for lxc.
> 
> Yes, this isn't gonna be a trivial conversion.  The usage model
> changes and so will a lot of controller knobs and behaviors.
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  4:29                                                                 ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11  4:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Eric W. Biederman, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> > to start with 1.  I expect many userspace parsers to be broken by this.
> 
> This is intentional.  The unified hierarchy will always have the
> hierarchy number zero.  Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
> 
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file.  This is extra work which userspace
> > doesn't have to do right now.
> 
> Again, by design.  This is how organization and control are separated
> and the differing levels of granularity is achieved.
> 
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
> 
> There shouldn't be a "freezer" cgroup.  The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.

But there can well be cgroups for which only freezer is enabled.  If
I'm wrong about that, then I am suffering a fundamental misunderstanding.

> > tasks, which they will administer.  In fact let's assume we are going to
> > use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> > 
> > For containers this becomes odd.  We tend to group containers by the
> > tasks in and under a cgroup.  We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
> 
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally.  Right now, I'm leaning against it mostly for
> consistency.

Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.

> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
> 
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
> 
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it.  It's going to be
> > a sizeable task for lxc.
> 
> Yes, this isn't gonna be a trivial conversion.  The usage model
> changes and so will a lot of controller knobs and behaviors.
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                 ` <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-02-11  5:02                                                                   ` Eric W. Biederman
  2015-02-11  5:10                                                                   ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-02-11  5:02 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, Tejun Heo, cgroups mailinglist


A slightly off topic comment, for where this thread has gone but
relevant if we are talking about cgroup namespaces.

If don't implement compatibility with existing userspace, they get a
nack.  A backwards-incompatible change should figure out how to remove
the need for any namespaces.

Because that is what namespaces are about backwards compatibility.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                 ` <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-02-11  5:02                                                                   ` Eric W. Biederman
  2015-02-11  5:10                                                                   ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-02-11  5:02 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar


A slightly off topic comment, for where this thread has gone but
relevant if we are talking about cgroup namespaces.

If don't implement compatibility with existing userspace, they get a
nack.  A backwards-incompatible change should figure out how to remove
the need for any namespaces.

Because that is what namespaces are about backwards compatibility.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  5:02                                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-02-11  5:02 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar


A slightly off topic comment, for where this thread has gone but
relevant if we are talking about cgroup namespaces.

If don't implement compatibility with existing userspace, they get a
nack.  A backwards-incompatible change should figure out how to remove
the need for any namespaces.

Because that is what namespaces are about backwards compatibility.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                 ` <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2015-02-11  5:02                                                                   ` Eric W. Biederman
@ 2015-02-11  5:10                                                                   ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

Hello,

On Wed, Feb 11, 2015 at 05:29:42AM +0100, Serge E. Hallyn wrote:
> > There shouldn't be a "freezer" cgroup.  The processes are categorized
> > according to their logical structure and controllers are applied to
> > the hierarchy as necessary.
> 
> But there can well be cgroups for which only freezer is enabled.  If
> I'm wrong about that, then I am suffering a fundamental misunderstanding.

Ah, sure, I was mostly arguing semantics.  It's just weird to call it
"freezer" cgroup.

> > The semantics is that the parent enables distribution of its given
> > type of resource by enabling the controller in its subtree_control.
> > This scoping isn't necessary for freezer and I'm debating whether to
> > enable controllers which don't need granularity control to be enabled
> > unconditionally.  Right now, I'm leaning against it mostly for
> > consistency.
> 
> Yeah, IIUC (i.e. freezer would always be enabled?) that would be
> even-more-confusing.

Right, freezer is kinda weird tho.  Its feature can almost be
considered a utility feature of cgroups core rather than a separate
controller.  That said, it's most likely that it'll remain in its
current form although how it blocks tasks should definitely be
reimplemented.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                 ` <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-02-11  5:10                                                                   ` Tejun Heo
  2015-02-11  5:10                                                                   ` Tejun Heo
  1 sibling, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn, linux-kernel, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Hello,

On Wed, Feb 11, 2015 at 05:29:42AM +0100, Serge E. Hallyn wrote:
> > There shouldn't be a "freezer" cgroup.  The processes are categorized
> > according to their logical structure and controllers are applied to
> > the hierarchy as necessary.
> 
> But there can well be cgroups for which only freezer is enabled.  If
> I'm wrong about that, then I am suffering a fundamental misunderstanding.

Ah, sure, I was mostly arguing semantics.  It's just weird to call it
"freezer" cgroup.

> > The semantics is that the parent enables distribution of its given
> > type of resource by enabling the controller in its subtree_control.
> > This scoping isn't necessary for freezer and I'm debating whether to
> > enable controllers which don't need granularity control to be enabled
> > unconditionally.  Right now, I'm leaning against it mostly for
> > consistency.
> 
> Yeah, IIUC (i.e. freezer would always be enabled?) that would be
> even-more-confusing.

Right, freezer is kinda weird tho.  Its feature can almost be
considered a utility feature of cgroups core rather than a separate
controller.  That said, it's most likely that it'll remain in its
current form although how it blocks tasks should definitely be
reimplemented.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  5:10                                                                   ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Hello,

On Wed, Feb 11, 2015 at 05:29:42AM +0100, Serge E. Hallyn wrote:
> > There shouldn't be a "freezer" cgroup.  The processes are categorized
> > according to their logical structure and controllers are applied to
> > the hierarchy as necessary.
> 
> But there can well be cgroups for which only freezer is enabled.  If
> I'm wrong about that, then I am suffering a fundamental misunderstanding.

Ah, sure, I was mostly arguing semantics.  It's just weird to call it
"freezer" cgroup.

> > The semantics is that the parent enables distribution of its given
> > type of resource by enabling the controller in its subtree_control.
> > This scoping isn't necessary for freezer and I'm debating whether to
> > enable controllers which don't need granularity control to be enabled
> > unconditionally.  Right now, I'm leaning against it mostly for
> > consistency.
> 
> Yeah, IIUC (i.e. freezer would always be enabled?) that would be
> even-more-confusing.

Right, freezer is kinda weird tho.  Its feature can almost be
considered a utility feature of cgroups core rather than a separate
controller.  That said, it's most likely that it'll remain in its
current form although how it blocks tasks should definitely be
reimplemented.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                   ` <87oap1qbv3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-02-11  5:17                                                                     ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

Hey,

On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> A slightly off topic comment, for where this thread has gone but
> relevant if we are talking about cgroup namespaces.
> 
> If don't implement compatibility with existing userspace, they get a
> nack.  A backwards-incompatible change should figure out how to remove
> the need for any namespaces.
>
> Because that is what namespaces are about backwards compatibility.

Are you claiming that namespaces are soley about backwards
compatibility?  ie. to trick userland into scoping without letting it
notice?  That's a very restricted view and namespaces do provide
further isolation capabilties in addition to what can be achieved
otherwise and it is logical to collect simliar funtionalities there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                   ` <87oap1qbv3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-02-11  5:17                                                                     ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

Hey,

On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> A slightly off topic comment, for where this thread has gone but
> relevant if we are talking about cgroup namespaces.
> 
> If don't implement compatibility with existing userspace, they get a
> nack.  A backwards-incompatible change should figure out how to remove
> the need for any namespaces.
>
> Because that is what namespaces are about backwards compatibility.

Are you claiming that namespaces are soley about backwards
compatibility?  ie. to trick userland into scoping without letting it
notice?  That's a very restricted view and namespaces do provide
further isolation capabilties in addition to what can be achieved
otherwise and it is logical to collect simliar funtionalities there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  5:17                                                                     ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11  5:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Hey,

On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> A slightly off topic comment, for where this thread has gone but
> relevant if we are talking about cgroup namespaces.
> 
> If don't implement compatibility with existing userspace, they get a
> nack.  A backwards-incompatible change should figure out how to remove
> the need for any namespaces.
>
> Because that is what namespaces are about backwards compatibility.

Are you claiming that namespaces are soley about backwards
compatibility?  ie. to trick userland into scoping without letting it
notice?  That's a very restricted view and namespaces do provide
further isolation capabilties in addition to what can be achieved
otherwise and it is logical to collect simliar funtionalities there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-02-11  5:17                                                                     ` Tejun Heo
@ 2015-02-11  6:29                                                                         ` Eric W. Biederman
  -1 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-02-11  6:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hey,
>
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
>> A slightly off topic comment, for where this thread has gone but
>> relevant if we are talking about cgroup namespaces.
>> 
>> If don't implement compatibility with existing userspace, they get a
>> nack.  A backwards-incompatible change should figure out how to remove
>> the need for any namespaces.
>>
>> Because that is what namespaces are about backwards compatibility.
>
> Are you claiming that namespaces are soley about backwards
> compatibility?  ie. to trick userland into scoping without letting it
> notice?  That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.

In principle a namespace is an additional layer of indirection from
names to objects.  A namespace does not invent new kinds of objects.
A namespace takes things that were previously global and gives them a
scope.

In princple after name resolution a namespace should impose no overhead.

In general namespaces are not necessary if your scope of names
already has hierarchy.  Which means that new interfaces can almost
always be designed in such a way that you can support containers without
needing to add any special namespace support.  Which typically results
in more flexible and useful APIs for everyone, with no real code cost.



Further in the cgroup namespace patchset I looked at a while ago, the
only reason for having a cgroup namespace was to provide a measure of
backwards compatibility with existing userspace.  I expect removing the
/proc/<pid>/cgroup file and replacing it with something in cgroupfs
itself would serve just as well if backwards compatibility is not the
objective.  Or possibly replacincg /proc/<pid>/cgroup into a magic
symlink onto somewhere in the unified cgroupfs itself.


I just don't see any point in doing weird silly namespace things to keep
existing userspace working when the existing userspace won't work.

As such if a namespace doesn't implement compatibility with the existing
userspace it gets my nack.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11  6:29                                                                         ` Eric W. Biederman
  0 siblings, 0 replies; 384+ messages in thread
From: Eric W. Biederman @ 2015-02-11  6:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

Tejun Heo <tj@kernel.org> writes:

> Hey,
>
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
>> A slightly off topic comment, for where this thread has gone but
>> relevant if we are talking about cgroup namespaces.
>> 
>> If don't implement compatibility with existing userspace, they get a
>> nack.  A backwards-incompatible change should figure out how to remove
>> the need for any namespaces.
>>
>> Because that is what namespaces are about backwards compatibility.
>
> Are you claiming that namespaces are soley about backwards
> compatibility?  ie. to trick userland into scoping without letting it
> notice?  That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.

In principle a namespace is an additional layer of indirection from
names to objects.  A namespace does not invent new kinds of objects.
A namespace takes things that were previously global and gives them a
scope.

In princple after name resolution a namespace should impose no overhead.

In general namespaces are not necessary if your scope of names
already has hierarchy.  Which means that new interfaces can almost
always be designed in such a way that you can support containers without
needing to add any special namespace support.  Which typically results
in more flexible and useful APIs for everyone, with no real code cost.



Further in the cgroup namespace patchset I looked at a while ago, the
only reason for having a cgroup namespace was to provide a measure of
backwards compatibility with existing userspace.  I expect removing the
/proc/<pid>/cgroup file and replacing it with something in cgroupfs
itself would serve just as well if backwards compatibility is not the
objective.  Or possibly replacincg /proc/<pid>/cgroup into a magic
symlink onto somewhere in the unified cgroupfs itself.


I just don't see any point in doing weird silly namespace things to keep
existing userspace working when the existing userspace won't work.

As such if a namespace doesn't implement compatibility with the existing
userspace it gets my nack.

Eric

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                         ` <87twytklkv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-02-11 14:36                                                                           ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 14:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Linux API, cgroups mailinglist

Hey,

On Wed, Feb 11, 2015 at 12:29:20AM -0600, Eric W. Biederman wrote:
> In general namespaces are not necessary if your scope of names
> already has hierarchy.  Which means that new interfaces can almost
> always be designed in such a way that you can support containers without
> needing to add any special namespace support.  Which typically results
> in more flexible and useful APIs for everyone, with no real code cost.

Sure, and cgroup ns support isn't doing anything weird there.  Just
bind mounting a subhierarchy is enough for the core features.  The ns
part is dealing with things which can't easily be tied to such
hierarchical scoping like path reported under through proc and even
handling that can be achieved by, for example, marking delegation
points in cgroup proper and forcing tasks beyond that point to
consider that as its origin when determining the path to report.

However, note that something like that is inherently similar to what's
being provided by other namespaces.  It is true that it can be
implemented outside namespace facility proper but that doesn't
automatically make that the right choice and it's more likely to be
worse - we'd be introducing a different way to perform about the same
thing.

So, the argument that adding namespace interface except for backward
compatibility doesn't seem to hold water.  Like it or not, namespace
is serving as a platform for certain type of features and we'd be
foolish to not to consider putting a related feature together there
and I fail to see a valid technical argument as of yet.

> Further in the cgroup namespace patchset I looked at a while ago, the
> only reason for having a cgroup namespace was to provide a measure of
> backwards compatibility with existing userspace.  I expect removing the
> /proc/<pid>/cgroup file and replacing it with something in cgroupfs
> itself would serve just as well if backwards compatibility is not the
> objective.  Or possibly replacincg /proc/<pid>/cgroup into a magic
> symlink onto somewhere in the unified cgroupfs itself.

No matter what we do, we'd still need to mark the delegation point
somehow; otherwise, there's no way to produce a scoped identifier.
This isn't really about backward compatibility but rather the feature
to scope a subhierarcy properly.

> I just don't see any point in doing weird silly namespace things to keep
> existing userspace working when the existing userspace won't work.

If it's too different from existing namespaces, sure, doing something
is definitely an option but is it?

> As such if a namespace doesn't implement compatibility with the existing
> userspace it gets my nack.

Hmmm.... I don't think making the proposed NS support to work across
all hierarchies including the traditional multiple ones would be too
difficult.  That should work then, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                         ` <87twytklkv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-02-11 14:36                                                                           ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 14:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel, Andy Lutomirski, cgroups mailinglist,
	Ingo Molnar

Hey,

On Wed, Feb 11, 2015 at 12:29:20AM -0600, Eric W. Biederman wrote:
> In general namespaces are not necessary if your scope of names
> already has hierarchy.  Which means that new interfaces can almost
> always be designed in such a way that you can support containers without
> needing to add any special namespace support.  Which typically results
> in more flexible and useful APIs for everyone, with no real code cost.

Sure, and cgroup ns support isn't doing anything weird there.  Just
bind mounting a subhierarchy is enough for the core features.  The ns
part is dealing with things which can't easily be tied to such
hierarchical scoping like path reported under through proc and even
handling that can be achieved by, for example, marking delegation
points in cgroup proper and forcing tasks beyond that point to
consider that as its origin when determining the path to report.

However, note that something like that is inherently similar to what's
being provided by other namespaces.  It is true that it can be
implemented outside namespace facility proper but that doesn't
automatically make that the right choice and it's more likely to be
worse - we'd be introducing a different way to perform about the same
thing.

So, the argument that adding namespace interface except for backward
compatibility doesn't seem to hold water.  Like it or not, namespace
is serving as a platform for certain type of features and we'd be
foolish to not to consider putting a related feature together there
and I fail to see a valid technical argument as of yet.

> Further in the cgroup namespace patchset I looked at a while ago, the
> only reason for having a cgroup namespace was to provide a measure of
> backwards compatibility with existing userspace.  I expect removing the
> /proc/<pid>/cgroup file and replacing it with something in cgroupfs
> itself would serve just as well if backwards compatibility is not the
> objective.  Or possibly replacincg /proc/<pid>/cgroup into a magic
> symlink onto somewhere in the unified cgroupfs itself.

No matter what we do, we'd still need to mark the delegation point
somehow; otherwise, there's no way to produce a scoped identifier.
This isn't really about backward compatibility but rather the feature
to scope a subhierarcy properly.

> I just don't see any point in doing weird silly namespace things to keep
> existing userspace working when the existing userspace won't work.

If it's too different from existing namespaces, sure, doing something
is definitely an option but is it?

> As such if a namespace doesn't implement compatibility with the existing
> userspace it gets my nack.

Hmmm.... I don't think making the proposed NS support to work across
all hierarchies including the traditional multiple ones would be too
difficult.  That should work then, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11 14:36                                                                           ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 14:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Hey,

On Wed, Feb 11, 2015 at 12:29:20AM -0600, Eric W. Biederman wrote:
> In general namespaces are not necessary if your scope of names
> already has hierarchy.  Which means that new interfaces can almost
> always be designed in such a way that you can support containers without
> needing to add any special namespace support.  Which typically results
> in more flexible and useful APIs for everyone, with no real code cost.

Sure, and cgroup ns support isn't doing anything weird there.  Just
bind mounting a subhierarchy is enough for the core features.  The ns
part is dealing with things which can't easily be tied to such
hierarchical scoping like path reported under through proc and even
handling that can be achieved by, for example, marking delegation
points in cgroup proper and forcing tasks beyond that point to
consider that as its origin when determining the path to report.

However, note that something like that is inherently similar to what's
being provided by other namespaces.  It is true that it can be
implemented outside namespace facility proper but that doesn't
automatically make that the right choice and it's more likely to be
worse - we'd be introducing a different way to perform about the same
thing.

So, the argument that adding namespace interface except for backward
compatibility doesn't seem to hold water.  Like it or not, namespace
is serving as a platform for certain type of features and we'd be
foolish to not to consider putting a related feature together there
and I fail to see a valid technical argument as of yet.

> Further in the cgroup namespace patchset I looked at a while ago, the
> only reason for having a cgroup namespace was to provide a measure of
> backwards compatibility with existing userspace.  I expect removing the
> /proc/<pid>/cgroup file and replacing it with something in cgroupfs
> itself would serve just as well if backwards compatibility is not the
> objective.  Or possibly replacincg /proc/<pid>/cgroup into a magic
> symlink onto somewhere in the unified cgroupfs itself.

No matter what we do, we'd still need to mark the delegation point
somehow; otherwise, there's no way to produce a scoped identifier.
This isn't really about backward compatibility but rather the feature
to scope a subhierarcy properly.

> I just don't see any point in doing weird silly namespace things to keep
> existing userspace working when the existing userspace won't work.

If it's too different from existing namespaces, sure, doing something
is definitely an option but is it?

> As such if a namespace doesn't implement compatibility with the existing
> userspace it gets my nack.

Hmmm.... I don't think making the proposed NS support to work across
all hierarchies including the traditional multiple ones would be too
difficult.  That should work then, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                     ` <20150211051704.GB24897-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  2015-02-11  6:29                                                                         ` Eric W. Biederman
@ 2015-02-11 16:00                                                                       ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11 16:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hey,
> 
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> > A slightly off topic comment, for where this thread has gone but
> > relevant if we are talking about cgroup namespaces.
> > 
> > If don't implement compatibility with existing userspace, they get a
> > nack.  A backwards-incompatible change should figure out how to remove
> > the need for any namespaces.
> >
> > Because that is what namespaces are about backwards compatibility.
> 
> Are you claiming that namespaces are soley about backwards
> compatibility?  ie. to trick userland into scoping without letting it
> notice?  That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.

We absolutely would love to use cgroup namespaces to run older
userspace in containers.  I don't know that it's actually possible
to do both that and use unified hierarchy at the same time though,
which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
be able to run inside an ubuntu 16.04 host that is using unified
hierarchy, without using backported newer versions of lxc (etc) in
the container.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                     ` <20150211051704.GB24897-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2015-02-11 16:00                                                                       ` Serge E. Hallyn
  2015-02-11 16:00                                                                       ` Serge E. Hallyn
  1 sibling, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11 16:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Serge E. Hallyn, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn, linux-kernel,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj@kernel.org):
> Hey,
> 
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> > A slightly off topic comment, for where this thread has gone but
> > relevant if we are talking about cgroup namespaces.
> > 
> > If don't implement compatibility with existing userspace, they get a
> > nack.  A backwards-incompatible change should figure out how to remove
> > the need for any namespaces.
> >
> > Because that is what namespaces are about backwards compatibility.
> 
> Are you claiming that namespaces are soley about backwards
> compatibility?  ie. to trick userland into scoping without letting it
> notice?  That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.

We absolutely would love to use cgroup namespaces to run older
userspace in containers.  I don't know that it's actually possible
to do both that and use unified hierarchy at the same time though,
which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
be able to run inside an ubuntu 16.04 host that is using unified
hierarchy, without using backported newer versions of lxc (etc) in
the container.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11 16:00                                                                       ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11 16:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Serge E. Hallyn, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hey,
> 
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> > A slightly off topic comment, for where this thread has gone but
> > relevant if we are talking about cgroup namespaces.
> > 
> > If don't implement compatibility with existing userspace, they get a
> > nack.  A backwards-incompatible change should figure out how to remove
> > the need for any namespaces.
> >
> > Because that is what namespaces are about backwards compatibility.
> 
> Are you claiming that namespaces are soley about backwards
> compatibility?  ie. to trick userland into scoping without letting it
> notice?  That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.

We absolutely would love to use cgroup namespaces to run older
userspace in containers.  I don't know that it's actually possible
to do both that and use unified hierarchy at the same time though,
which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
be able to run inside an ubuntu 16.04 host that is using unified
hierarchy, without using backported newer versions of lxc (etc) in
the container.

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                       ` <20150211160023.GA1579-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-02-11 16:03                                                                         ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 16:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> We absolutely would love to use cgroup namespaces to run older
> userspace in containers.  I don't know that it's actually possible
> to do both that and use unified hierarchy at the same time though,
> which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
> be able to run inside an ubuntu 16.04 host that is using unified
> hierarchy, without using backported newer versions of lxc (etc) in
> the container.

So, the constraint there are the controllers.  A controller can't be
attached to two hierarchies at the same time for obvious reasons, so
regardless of NS, you can't use the same controller on a unified
hierarchy *and* a traditional hierarchy.  NS doesn't adds or
substracts from the situation.  If you decide to attach a controller
to a traditional hierarchy, that's where it's gonna be available.  If
you attach it to the unified hierarchy, the same story.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                       ` <20150211160023.GA1579-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-02-11 16:03                                                                         ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 16:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn, linux-kernel, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> We absolutely would love to use cgroup namespaces to run older
> userspace in containers.  I don't know that it's actually possible
> to do both that and use unified hierarchy at the same time though,
> which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
> be able to run inside an ubuntu 16.04 host that is using unified
> hierarchy, without using backported newer versions of lxc (etc) in
> the container.

So, the constraint there are the controllers.  A controller can't be
attached to two hierarchies at the same time for obvious reasons, so
regardless of NS, you can't use the same controller on a unified
hierarchy *and* a traditional hierarchy.  NS doesn't adds or
substracts from the situation.  If you decide to attach a controller
to a traditional hierarchy, that's where it's gonna be available.  If
you attach it to the unified hierarchy, the same story.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
@ 2015-02-11 16:03                                                                         ` Tejun Heo
  0 siblings, 0 replies; 384+ messages in thread
From: Tejun Heo @ 2015-02-11 16:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	cgroups mailinglist, Ingo Molnar

On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> We absolutely would love to use cgroup namespaces to run older
> userspace in containers.  I don't know that it's actually possible
> to do both that and use unified hierarchy at the same time though,
> which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
> be able to run inside an ubuntu 16.04 host that is using unified
> hierarchy, without using backported newer versions of lxc (etc) in
> the container.

So, the constraint there are the controllers.  A controller can't be
attached to two hierarchies at the same time for obvious reasons, so
regardless of NS, you can't use the same controller on a unified
hierarchy *and* a traditional hierarchy.  NS doesn't adds or
substracts from the situation.  If you decide to attach a controller
to a traditional hierarchy, that's where it's gonna be available.  If
you attach it to the unified hierarchy, the same story.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
       [not found]                                                                         ` <20150211160347.GE21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2015-02-11 16:18                                                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11 16:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski,
	Ingo Molnar, Eric W. Biederman, Linux API, cgroups mailinglist

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> > We absolutely would love to use cgroup namespaces to run older
> > userspace in containers.  I don't know that it's actually possible
> > to do both that and use unified hierarchy at the same time though,
> > which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
> > be able to run inside an ubuntu 16.04 host that is using unified
> > hierarchy, without using backported newer versions of lxc (etc) in
> > the container.
> 
> So, the constraint there are the controllers.  A controller can't be
> attached to two hierarchies at the same time for obvious reasons, so
> regardless of NS, you can't use the same controller on a unified
> hierarchy *and* a traditional hierarchy.  NS doesn't adds or
> substracts from the situation.  If you decide to attach a controller
> to a traditional hierarchy, that's where it's gonna be available.  If
> you attach it to the unified hierarchy, the same story.

Right, exactly.

thanks,
-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
  2015-02-11 16:03                                                                         ` Tejun Heo
  (?)
  (?)
@ 2015-02-11 16:18                                                                         ` Serge E. Hallyn
  -1 siblings, 0 replies; 384+ messages in thread
From: Serge E. Hallyn @ 2015-02-11 16:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Eric W. Biederman, Richard Weinberger,
	Linux API, Linux Containers, Serge Hallyn, linux-kernel,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar

Quoting Tejun Heo (tj@kernel.org):
> On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> > We absolutely would love to use cgroup namespaces to run older
> > userspace in containers.  I don't know that it's actually possible
> > to do both that and use unified hierarchy at the same time though,
> > which is unfortunate.  So an Ubuntu 12.04 container will never, afaics,
> > be able to run inside an ubuntu 16.04 host that is using unified
> > hierarchy, without using backported newer versions of lxc (etc) in
> > the container.
> 
> So, the constraint there are the controllers.  A controller can't be
> attached to two hierarchies at the same time for obvious reasons, so
> regardless of NS, you can't use the same controller on a unified
> hierarchy *and* a traditional hierarchy.  NS doesn't adds or
> substracts from the situation.  If you decide to attach a controller
> to a traditional hierarchy, that's where it's gonna be available.  If
> you attach it to the unified hierarchy, the same story.

Right, exactly.

thanks,
-serge

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
       [not found]       ` <87k33wpsl3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-22 18:10         ` Vincent Batts
  0 siblings, 0 replies; 384+ messages in thread
From: Vincent Batts @ 2015-07-22 18:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Has there been further movement on CLONE_NEWCGROUP outside of this?


vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Aditya Kali <adityakali@google.com> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
>    cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs.  There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs.  In a practical reality that is nonsense.  If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set.  So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) Setns to another cgroup namespace is allowed only when:
>>       (a) process has CAP_SYS_ADMIN in its current userns
>>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>>       (c) the process's current cgroup is a descendant cgroupns-root of the
>>           target namespace.
>>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>>       The last check (d) prevents processes from escaping their cgroupns-root by
>>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>>       is trying to restrict itself to a deeper cgroup hierarchy.
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>>       container management tools to be run inside the containers transparently.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>>  fs/kernfs/dir.c                  |  53 +++++++++---
>>  fs/kernfs/mount.c                |  48 +++++++++++
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  41 +++++++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>>  include/linux/kernfs.h           |   5 ++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 ++++-
>>  15 files changed, 518 insertions(+), 41 deletions(-)
>>  create mode 100644 include/linux/cgroup_namespace.h
>>  create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
       [not found]       ` <87k33wpsl3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-22 18:10         ` Vincent Batts
  0 siblings, 0 replies; 384+ messages in thread
From: Vincent Batts @ 2015-07-22 18:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, linux-api, Linux Containers, serge.hallyn,
	linux-kernel, luto, tj, cgroups, mingo

Has there been further movement on CLONE_NEWCGROUP outside of this?


vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Aditya Kali <adityakali@google.com> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
>    cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs.  There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs.  In a practical reality that is nonsense.  If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set.  So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) Setns to another cgroup namespace is allowed only when:
>>       (a) process has CAP_SYS_ADMIN in its current userns
>>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>>       (c) the process's current cgroup is a descendant cgroupns-root of the
>>           target namespace.
>>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>>       The last check (d) prevents processes from escaping their cgroupns-root by
>>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>>       is trying to restrict itself to a deeper cgroup hierarchy.
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>>       container management tools to be run inside the containers transparently.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>>  fs/kernfs/dir.c                  |  53 +++++++++---
>>  fs/kernfs/mount.c                |  48 +++++++++++
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  41 +++++++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>>  include/linux/kernfs.h           |   5 ++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 ++++-
>>  15 files changed, 518 insertions(+), 41 deletions(-)
>>  create mode 100644 include/linux/cgroup_namespace.h
>>  create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

* Re: [PATCHv1 0/8] CGroup Namespaces
@ 2015-07-22 18:10         ` Vincent Batts
  0 siblings, 0 replies; 384+ messages in thread
From: Vincent Batts @ 2015-07-22 18:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA

Has there been further movement on CLONE_NEWCGROUP outside of this?


vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
>    cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs.  There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs.  In a practical reality that is nonsense.  If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set.  So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) Setns to another cgroup namespace is allowed only when:
>>       (a) process has CAP_SYS_ADMIN in its current userns
>>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>>       (c) the process's current cgroup is a descendant cgroupns-root of the
>>           target namespace.
>>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>>       The last check (d) prevents processes from escaping their cgroupns-root by
>>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>>       is trying to restrict itself to a deeper cgroup hierarchy.
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>>       container management tools to be run inside the containers transparently.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>>  fs/kernfs/dir.c                  |  53 +++++++++---
>>  fs/kernfs/mount.c                |  48 +++++++++++
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  41 +++++++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>>  include/linux/kernfs.h           |   5 ++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 ++++-
>>  15 files changed, 518 insertions(+), 41 deletions(-)
>>  create mode 100644 include/linux/cgroup_namespace.h
>>  create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 384+ messages in thread

end of thread, other threads:[~2015-07-22 18:10 UTC | newest]

Thread overview: 384+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <adityakali-cgroupns>
2014-07-17 19:52 ` [PATCH 0/5] RFC: CGroup Namespaces Aditya Kali
2014-07-17 19:52   ` Aditya Kali
2014-07-17 19:52   ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
2014-07-17 19:52     ` Aditya Kali
2014-07-24 17:01     ` Serge Hallyn
2014-07-24 17:01       ` Serge Hallyn
2014-07-31 19:48       ` Aditya Kali
2014-07-31 19:48         ` Aditya Kali
2014-08-04 23:12         ` Serge Hallyn
2014-08-04 23:12           ` Serge Hallyn
     [not found]         ` <CAGr1F2FAiSFR_Y3t1=eBVoAtJvh4m=cNUi+vG146nDkgtBjisQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-04 23:12           ` Serge Hallyn
2014-07-31 19:48       ` Aditya Kali
     [not found]     ` <1405626731-12220-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-24 17:01       ` Serge Hallyn
2014-07-18 16:00   ` [PATCH 0/5] RFC: CGroup Namespaces Serge Hallyn
2014-07-18 16:00     ` Serge Hallyn
2014-07-24 16:10   ` Serge Hallyn
2014-07-24 16:10     ` Serge Hallyn
2014-07-24 16:36   ` Serge Hallyn
2014-07-24 16:36     ` Serge Hallyn
2014-07-25 19:29     ` Aditya Kali
2014-07-25 19:29       ` Aditya Kali
2014-07-25 20:27       ` Andy Lutomirski
2014-07-25 20:27         ` Andy Lutomirski
     [not found]       ` <CAGr1F2GcAema-E2q6PFj=R0Z505iD7JshrMuMdfPTJ95wMiQMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-25 20:27         ` Andy Lutomirski
2014-07-29  4:51         ` Serge E. Hallyn
2014-07-29  4:51       ` Serge E. Hallyn
2014-07-29  4:51         ` Serge E. Hallyn
     [not found]         ` <20140729045159.GB31047-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-07-29 15:08           ` Andy Lutomirski
2014-07-29 15:08             ` Andy Lutomirski
     [not found]             ` <CALCETrW5yQLo-SvDgqjt881OD1GnuxMmGKjoohYT4nwtYw=9+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-29 16:06               ` Serge E. Hallyn
2014-07-29 16:06                 ` Serge E. Hallyn
2014-07-25 19:29     ` Aditya Kali
     [not found]   ` <1405626731-12220-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-17 19:52     ` [PATCH 1/5] kernfs: Add API to get generate relative kernfs path Aditya Kali
2014-07-17 19:52       ` Aditya Kali
2014-07-24 15:10       ` Serge Hallyn
2014-07-24 15:10         ` Serge Hallyn
     [not found]       ` <1405626731-12220-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-24 15:10         ` Serge Hallyn
2014-07-17 19:52     ` [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
2014-07-17 19:52     ` [PATCH 3/5] cgroup: add function to get task's cgroup on default hierarchy Aditya Kali
2014-07-17 19:52       ` Aditya Kali
     [not found]       ` <1405626731-12220-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-24 16:59         ` Serge Hallyn
2014-07-24 16:59       ` Serge Hallyn
2014-07-24 16:59         ` Serge Hallyn
2014-07-17 19:52     ` [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put() Aditya Kali
2014-07-17 19:52       ` Aditya Kali
2014-07-24 17:03       ` Serge Hallyn
2014-07-24 17:03         ` Serge Hallyn
     [not found]       ` <1405626731-12220-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-24 17:03         ` Serge Hallyn
2014-07-17 19:52     ` [PATCH 5/5] cgroup: introduce cgroup namespaces Aditya Kali
2014-07-17 19:52       ` Aditya Kali
     [not found]       ` <1405626731-12220-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-17 19:57         ` Andy Lutomirski
2014-07-17 19:57           ` Andy Lutomirski
     [not found]           ` <CALCETrWXMMGzptvEu6TfzTjBou4t==W39_nNB5FJwSk2Zy8uCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-17 20:55             ` Aditya Kali
2014-07-17 20:55               ` Aditya Kali
2014-07-18 16:51               ` Andy Lutomirski
2014-07-18 16:51                 ` Andy Lutomirski
     [not found]                 ` <CALCETrW6YpyJBmr3sZC6KL03GP4dcGYavQF5DFZfys6Cok-vpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-18 18:51                   ` Aditya Kali
2014-07-18 18:51                     ` Aditya Kali
     [not found]                     ` <CAGr1F2GwZvZLPGLWKPPOt3vREwwVNbVPrgE6YJ01bACKejbc4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-18 18:57                       ` Andy Lutomirski
2014-07-18 18:57                         ` Andy Lutomirski
2014-07-21 22:11                         ` Aditya Kali
2014-07-21 22:11                           ` Aditya Kali
     [not found]                           ` <CAGr1F2Fd_4=WUm4STPd4kdd5tNLO6aQ1OOQMKnRqyOKZSGvCpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-21 22:16                             ` Andy Lutomirski
2014-07-21 22:16                           ` Andy Lutomirski
2014-07-21 22:16                             ` Andy Lutomirski
     [not found]                             ` <CALCETrUhd41LFfF9epbVYJSOwqBq308Z8RZG9tzyPfx+Joe15Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-23 19:52                               ` Aditya Kali
2014-07-23 19:52                                 ` Aditya Kali
     [not found]                         ` <CALCETrVeeL71sfVdbzRx0FpGrvQKbviEmUcMEosbUU+UJNQu9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-21 22:11                           ` Aditya Kali
     [not found]               ` <CAGr1F2Ht1q_nYGJwmQvEEyj8r3R1stgD=g3s8_5zYOTogjz-UQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-18 16:51                 ` Andy Lutomirski
2014-07-18 16:00     ` [PATCH 0/5] RFC: CGroup Namespaces Serge Hallyn
2014-07-24 16:10     ` Serge Hallyn
2014-07-24 16:36     ` Serge Hallyn
2014-07-17 19:52 ` Aditya Kali
2014-10-13 21:23 ` [PATCHv1 0/8] " Aditya Kali
2014-10-13 21:23   ` Aditya Kali
2014-10-13 21:23   ` [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path Aditya Kali
2014-10-13 21:23     ` Aditya Kali
2014-10-16 16:07     ` Serge E. Hallyn
2014-10-16 16:07       ` Serge E. Hallyn
     [not found]     ` <1413235430-22944-2-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 16:07       ` Serge E. Hallyn
2014-10-13 21:23   ` [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy Aditya Kali
2014-10-16 16:13     ` Serge E. Hallyn
2014-10-16 16:13       ` Serge E. Hallyn
     [not found]     ` <1413235430-22944-4-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 16:13       ` Serge E. Hallyn
2014-10-13 21:23   ` [PATCHv1 7/8] cgroup: cgroup namespace setns support Aditya Kali
2014-10-13 21:23     ` Aditya Kali
     [not found]     ` <1413235430-22944-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 21:12       ` Serge E. Hallyn
2014-10-16 21:12         ` Serge E. Hallyn
     [not found]         ` <20141016211236.GA4308-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-10-16 21:17           ` Andy Lutomirski
2014-10-16 21:17             ` Andy Lutomirski
2014-10-16 21:22           ` Aditya Kali
2014-10-16 21:22             ` Aditya Kali
     [not found]             ` <CAGr1F2EH0ynfFihTh1dv=n1faxUh0zS3ggk303bwGnDnW2PUCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-16 21:47               ` Serge E. Hallyn
2014-10-16 21:47                 ` Serge E. Hallyn
     [not found]                 ` <20141016214710.GA4759-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-10-19  5:23                   ` Eric W. Biederman
2014-10-19  5:23                     ` Eric W. Biederman
     [not found]                     ` <87iojgmy3o.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-10-19 18:26                       ` Andy Lutomirski
2014-10-19 18:26                     ` Andy Lutomirski
2014-10-19 18:26                       ` Andy Lutomirski
2014-10-20  4:55                       ` Eric W.Biederman
2014-10-20  4:55                         ` Eric W.Biederman
2014-10-20  4:55                         ` Eric W.Biederman
     [not found]                         ` <44072106-c0f3-46b8-b2b5-9b1cbd1b7d88-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
2014-10-21  0:20                           ` Andy Lutomirski
2014-10-21  0:20                         ` Andy Lutomirski
2014-10-21  0:20                           ` Andy Lutomirski
     [not found]                           ` <CALCETrXhGnBM_xx=Auz3WRQXkqhGGTWuZN=PU+A9HZ7Ek27FLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21  4:49                             ` Eric W. Biederman
2014-10-21  4:49                           ` Eric W. Biederman
2014-10-21  4:49                             ` Eric W. Biederman
     [not found]                             ` <87zjcq10ya.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-10-21  5:03                               ` Andy Lutomirski
2014-10-21  5:03                                 ` Andy Lutomirski
2014-10-21  5:42                                 ` Eric W. Biederman
2014-10-21  5:42                                   ` Eric W. Biederman
     [not found]                                   ` <87lhoayo59.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-10-21  5:49                                     ` Andy Lutomirski
2014-10-21  5:49                                       ` Andy Lutomirski
2014-10-21 18:49                                       ` Aditya Kali
2014-10-21 18:49                                         ` Aditya Kali
     [not found]                                         ` <CAGr1F2Ee2MCKOwALR2YV7ppDmyHxO6+EsHqSc1+WcwKFPPQB0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21 19:02                                           ` Andy Lutomirski
2014-10-21 19:02                                             ` Andy Lutomirski
     [not found]                                             ` <CALCETrWXDMRsexfvmh2CiMW4WX0ZLJ4pJvzHU55PEBk=NmnyZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21 22:33                                               ` Aditya Kali
2014-10-21 22:33                                                 ` Aditya Kali
     [not found]                                                 ` <CAGr1F2FdQ4VF1_o7mdybZ-WhLLhFxdgkNnzotHOwnhLU8W+YCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21 22:42                                                   ` Andy Lutomirski
2014-10-21 22:42                                                 ` Andy Lutomirski
2014-10-21 22:42                                                   ` Andy Lutomirski
     [not found]                                                   ` <CALCETrXEAegFmSs2LnfSJR0tQmqZudnESDER8CoqKxOCBFMwdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-22  0:46                                                     ` Aditya Kali
2014-10-22  0:46                                                       ` Aditya Kali
     [not found]                                                       ` <CAGr1F2HYGG9=jwugywD8tUdB+dOjN4z+3BSpqL_m2aaM+3Rz1A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-22  0:58                                                         ` Andy Lutomirski
2014-10-22  0:58                                                           ` Andy Lutomirski
     [not found]                                                           ` <CALCETrUtqozUE=Lr5d2dBKd_vaLzfVvVv8g6ZALz1MWqVzj9dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-22 18:37                                                             ` Aditya Kali
2014-10-22 18:37                                                               ` Aditya Kali
     [not found]                                                               ` <CAGr1F2EBDCVrXZd7fOdffQ2C0c25T8co4wfxRc8P0Jb18yq2uQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-22 18:50                                                                 ` Andy Lutomirski
2014-10-22 18:50                                                                   ` Andy Lutomirski
2014-10-22 19:42                                                                 ` Tejun Heo
2014-10-22 19:42                                                                   ` Tejun Heo
     [not found]                                       ` <CALCETrVFKvtHpTfY3kuE5ZTrwQAzuDmk6dm-mbQffDHAZmq-KQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21 18:49                                         ` Aditya Kali
     [not found]                                 ` <CALCETrVkMtsnEh57jFZrdx5vHbz97BdO7OuupT+xVNnWpJjxng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-21  5:42                                   ` Eric W. Biederman
     [not found]                       ` <CALCETrUC=yW72d2hDzjESmZAt85x1WcGz4L-DrtY5YXAQxbpMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-20  4:55                         ` Eric W.Biederman
2014-10-17  9:52       ` Serge E. Hallyn
2014-10-17  9:52         ` Serge E. Hallyn
     [not found]   ` <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-13 21:23     ` [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path Aditya Kali
2014-10-13 21:23     ` [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
2014-10-13 21:23       ` Aditya Kali
     [not found]       ` <1413235430-22944-3-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 16:08         ` Serge E. Hallyn
2014-10-16 16:08       ` Serge E. Hallyn
2014-10-16 16:08         ` Serge E. Hallyn
2014-10-13 21:23     ` [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy Aditya Kali
2014-10-13 21:23     ` [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put() Aditya Kali
2014-10-13 21:23       ` Aditya Kali
     [not found]       ` <1413235430-22944-5-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 16:14         ` Serge E. Hallyn
2014-10-16 16:14       ` Serge E. Hallyn
2014-10-16 16:14         ` Serge E. Hallyn
2014-10-13 21:23     ` [PATCHv1 5/8] cgroup: introduce cgroup namespaces Aditya Kali
2014-10-13 21:23       ` Aditya Kali
     [not found]       ` <1413235430-22944-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-16 16:37         ` Serge E. Hallyn
2014-10-16 16:37       ` Serge E. Hallyn
2014-10-16 16:37         ` Serge E. Hallyn
     [not found]         ` <20141016163703.GE1392-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-10-24  1:03           ` Aditya Kali
2014-10-24  1:03             ` Aditya Kali
     [not found]             ` <CAGr1F2E0VdBafZg6P2yeP6bgxsMEm53fEuT29HTLygTKobgi-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-25  3:16               ` Serge E. Hallyn
2014-10-25  3:16                 ` Serge E. Hallyn
2014-10-13 21:23     ` [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns Aditya Kali
2014-10-13 21:23       ` Aditya Kali
     [not found]       ` <1413235430-22944-7-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-17  9:28         ` Serge E. Hallyn
2014-10-17  9:28           ` Serge E. Hallyn
     [not found]           ` <20141017092814.GA8848-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-10-22 19:06             ` Aditya Kali
2014-10-22 19:06           ` Aditya Kali
2014-10-22 19:06             ` Aditya Kali
2014-10-19  4:57         ` Eric W. Biederman
2014-10-19  4:57           ` Eric W. Biederman
2014-10-13 21:23     ` [PATCHv1 7/8] cgroup: cgroup namespace setns support Aditya Kali
2014-10-13 21:23     ` [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
2014-10-14 22:42     ` [PATCHv1 0/8] CGroup Namespaces Andy Lutomirski
2014-10-14 22:42       ` Andy Lutomirski
     [not found]       ` <CALCETrVnjrBt3odufhAirf45_REq-S9T=HpoEWqmFef2M6PucA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-14 23:33         ` Aditya Kali
2014-10-14 23:33           ` Aditya Kali
2014-10-19  4:54     ` Eric W. Biederman
2014-10-19  4:54       ` Eric W. Biederman
2015-07-22 18:10       ` Vincent Batts
2015-07-22 18:10         ` Vincent Batts
     [not found]       ` <87k33wpsl3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-07-22 18:10         ` Vincent Batts
2014-10-13 21:23   ` [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
     [not found]     ` <1413235430-22944-9-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-17 12:19       ` Serge E. Hallyn
2014-10-17 12:19         ` Serge E. Hallyn
2014-10-31 19:18 ` [PATCHv2 0/7] CGroup Namespaces Aditya Kali
2014-10-31 19:18   ` Aditya Kali
2014-10-31 19:19   ` [PATCHv2 6/7] cgroup: cgroup namespace setns support Aditya Kali
2014-10-31 19:19     ` Aditya Kali
2014-10-31 19:19   ` [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
2014-11-01  0:07     ` Andy Lutomirski
2014-11-01  0:07       ` Andy Lutomirski
2014-11-01  2:59       ` Eric W. Biederman
2014-11-01  2:59         ` Eric W. Biederman
     [not found]         ` <87a94blj6m.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-11-01  3:29           ` Andy Lutomirski
2014-11-01  3:29             ` Andy Lutomirski
     [not found]       ` <CALCETrXTaZ3SJ_t-gnbc93BVZXg-912NqO78kFd0Tpi-5-dZoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-01  2:59         ` Eric W. Biederman
2014-11-03 23:12         ` Aditya Kali
2014-11-03 23:12           ` Aditya Kali
2014-11-03 23:15           ` Andy Lutomirski
2014-11-03 23:15             ` Andy Lutomirski
2014-11-03 23:23             ` Aditya Kali
2014-11-03 23:23               ` Aditya Kali
     [not found]               ` <CAGr1F2GX45gC-V7kEzVjp-EiYfdPDVBRs+99nASpgFVAdYX+1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-03 23:48                 ` Andy Lutomirski
2014-11-03 23:48                   ` Andy Lutomirski
     [not found]                   ` <CALCETrUB_xx5zno26k5UjAFt77nZTpgyndD4AuBSZxiZBNjXSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-04  0:12                     ` Aditya Kali
2014-11-04  0:12                   ` Aditya Kali
2014-11-04  0:12                     ` Aditya Kali
     [not found]                     ` <CAGr1F2EV4p_nJP_oMe3N8pBPedAZHbdB=XCMPjSEZTC9jmZoAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-04  0:17                       ` Andy Lutomirski
2014-11-04  0:17                         ` Andy Lutomirski
     [not found]                         ` <CALCETrXeG2t=fW9HbkirDZudw9pbDwoqDq5ygJBkBMbqqoDAvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-04  0:49                           ` Aditya Kali
2014-11-04  0:49                             ` Aditya Kali
     [not found]             ` <CALCETrW64-6xC6psP-8k0H-1GfVnWBTeEBNSrE_sH+-DFtuZQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-03 23:23               ` Aditya Kali
2014-11-04 13:57           ` Tejun Heo
2014-11-04 13:57             ` Tejun Heo
2014-11-06 17:28             ` Aditya Kali
2014-11-06 17:28               ` Aditya Kali
     [not found]             ` <20141104135726.GB14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-11-06 17:28               ` Aditya Kali
     [not found]           ` <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-03 23:15             ` Andy Lutomirski
2014-11-04 13:57             ` Tejun Heo
2014-11-04  1:59     ` Aditya Kali
2014-11-04  1:59       ` Aditya Kali
     [not found]     ` <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-11-01  0:07       ` Andy Lutomirski
2014-11-01  1:09       ` Eric W. Biederman
2014-11-01  1:09         ` Eric W. Biederman
2014-11-03 22:46         ` Aditya Kali
2014-11-03 22:46           ` Aditya Kali
     [not found]         ` <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-11-03 22:43           ` Aditya Kali
     [not found]             ` <CAGr1F2Hd_PS_AscBGMXdZC9qkHGRUp-MeQvJksDOQkRBB3RGoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-03 22:56               ` Andy Lutomirski
2014-11-03 22:56                 ` Andy Lutomirski
2014-11-04 13:46               ` Tejun Heo
2014-11-04 13:46             ` Tejun Heo
2014-11-04 13:46               ` Tejun Heo
2014-11-04 15:00               ` Andy Lutomirski
2014-11-04 15:00                 ` Andy Lutomirski
     [not found]                 ` <CALCETrUggQCJyxsTWRNrjt3GM=R0VMU6RjMkU1aw3YUNMx1xEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-04 15:50                   ` Serge E. Hallyn
2014-11-04 15:50                     ` Serge E. Hallyn
     [not found]                     ` <20141104155052.GA7027-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2014-11-12 17:48                       ` Aditya Kali
2014-11-12 17:48                     ` Aditya Kali
2014-11-12 17:48                       ` Aditya Kali
     [not found]               ` <20141104134633.GA14014-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-11-04 15:00                 ` Andy Lutomirski
2014-11-03 22:46           ` Aditya Kali
2014-11-04  1:59       ` Aditya Kali
     [not found]   ` <1414783141-6947-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-10-31 19:18     ` [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path Aditya Kali
2014-10-31 19:18       ` Aditya Kali
2014-10-31 19:18     ` [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
2014-10-31 19:18       ` Aditya Kali
2014-10-31 19:18     ` [PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy Aditya Kali
2014-10-31 19:18       ` Aditya Kali
2014-10-31 19:18     ` [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put() Aditya Kali
2014-10-31 19:18       ` Aditya Kali
2014-10-31 19:18     ` [PATCHv2 5/7] cgroup: introduce cgroup namespaces Aditya Kali
2014-10-31 19:18       ` Aditya Kali
     [not found]       ` <1414783141-6947-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-11-01  0:02         ` Andy Lutomirski
2014-11-01  0:02           ` Andy Lutomirski
     [not found]           ` <CALCETrWzYPngmWPMWnSFyiTPDwNJYPpXUj1C-294uQgjvp9wcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-01  0:58             ` Eric W. Biederman
2014-11-01  0:58               ` Eric W. Biederman
     [not found]               ` <87y4rvspnd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-11-03 23:42                 ` Aditya Kali
2014-11-03 23:42                   ` Aditya Kali
2014-11-03 23:40             ` Aditya Kali
2014-11-03 23:40               ` Aditya Kali
2014-11-04  1:56         ` Aditya Kali
2014-11-04  1:56           ` Aditya Kali
2014-10-31 19:19     ` [PATCHv2 6/7] cgroup: cgroup namespace setns support Aditya Kali
2014-10-31 19:19     ` [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
2014-11-04 13:10     ` [PATCHv2 0/7] CGroup Namespaces Vivek Goyal
2014-11-04 13:10       ` Vivek Goyal
     [not found]       ` <20141104131030.GA2937-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-11-06 17:33         ` Aditya Kali
2014-11-06 17:33           ` Aditya Kali
     [not found]           ` <CAGr1F2Hm4+aCUz3RqkgUhbJAQtWvUbb2CRDkW5rJSZkwLM_huw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-11-26 22:58             ` Richard Weinberger
2014-11-26 22:58               ` Richard Weinberger
     [not found]               ` <CAFLxGvybiem34J3zrtVhW=4itSdczassNt9RcuxnpJQeAz-JVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-12-02 19:14                 ` Aditya Kali
2014-12-02 19:14                   ` Aditya Kali
2014-12-05  1:55 ` [PATCHv3 0/8] " Aditya Kali
2014-12-05  1:55   ` Aditya Kali
2014-12-05  1:55   ` [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path Aditya Kali
2014-12-05  1:55     ` Aditya Kali
2014-12-05  1:55   ` [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
2014-12-05  1:55     ` Aditya Kali
     [not found]     ` <1417744550-6461-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-12-12  8:55       ` Zefan Li
2014-12-12  8:55         ` Zefan Li
     [not found]   ` <1417744550-6461-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-12-05  1:55     ` [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path Aditya Kali
2014-12-05  1:55     ` [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace Aditya Kali
2014-12-05  1:55       ` Aditya Kali
2014-12-05  1:55     ` [PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy Aditya Kali
2014-12-05  1:55       ` Aditya Kali
2014-12-05  1:55     ` [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put() Aditya Kali
2014-12-05  1:55       ` Aditya Kali
2014-12-05  1:55     ` [PATCHv3 5/8] cgroup: introduce cgroup namespaces Aditya Kali
2014-12-05  1:55       ` Aditya Kali
     [not found]       ` <1417744550-6461-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-12-12  8:54         ` Zefan Li
2014-12-12  8:54           ` Zefan Li
2014-12-05  1:55     ` [PATCHv3 6/8] cgroup: cgroup namespace setns support Aditya Kali
2014-12-05  1:55       ` Aditya Kali
2014-12-05  1:55     ` [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns Aditya Kali
2014-12-05  1:55     ` [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces Aditya Kali
2014-12-05  1:55       ` Aditya Kali
     [not found]       ` <1417744550-6461-9-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-12-12  8:54         ` Zefan Li
2014-12-12  8:54           ` Zefan Li
     [not found]           ` <548AAD42.5010002-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2015-01-05 22:54             ` Aditya Kali
2015-01-05 22:54               ` Aditya Kali
2014-12-14 23:05         ` Richard Weinberger
2014-12-14 23:05           ` Richard Weinberger
     [not found]           ` <548E17CE.8010704-/L3Ra7n9ekc@public.gmane.org>
2015-01-05 22:48             ` Aditya Kali
2015-01-05 22:48               ` Aditya Kali
     [not found]               ` <CAGr1F2HA6mzFwgp5ngX8P7=198-5CmCjLmuCJ8j3eQ08J2d9Qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-05 22:52                 ` Richard Weinberger
2015-01-05 22:52                   ` Richard Weinberger
2015-01-05 23:53                   ` Eric W. Biederman
2015-01-05 23:53                     ` Eric W. Biederman
     [not found]                     ` <87lhlgpyxk.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-06  0:07                       ` Richard Weinberger
2015-01-06  0:07                         ` Richard Weinberger
2015-01-06  0:10                       ` Aditya Kali
2015-01-06  0:10                     ` Aditya Kali
2015-01-06  0:10                       ` Aditya Kali
2015-01-06  0:17                       ` Richard Weinberger
2015-01-06  0:17                         ` Richard Weinberger
2015-01-06 23:20                         ` Aditya Kali
2015-01-06 23:20                           ` Aditya Kali
2015-01-06 23:39                           ` Richard Weinberger
2015-01-06 23:39                             ` Richard Weinberger
     [not found]                           ` <CAGr1F2EGOUSEd3-G4PS0mq=9kU1nWG4CwHUOQaNUATepc11_Sw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-06 23:39                             ` Richard Weinberger
2015-01-07  9:28                             ` Richard Weinberger
2015-01-07  9:28                               ` Richard Weinberger
2015-01-07 18:57                               ` Aditya Kali
2015-01-07 18:57                                 ` Aditya Kali
     [not found]                               ` <54ACFC38.5070007-/L3Ra7n9ekc@public.gmane.org>
2015-01-07 14:45                                 ` Eric W. Biederman
2015-01-07 14:45                                   ` Eric W. Biederman
2015-01-07 19:30                                   ` Serge E. Hallyn
2015-01-07 19:30                                     ` Serge E. Hallyn
2015-01-07 22:14                                     ` Eric W. Biederman
2015-01-07 22:14                                       ` Eric W. Biederman
     [not found]                                       ` <87bnma6xwv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 22:45                                         ` Tejun Heo
2015-01-07 22:45                                           ` Tejun Heo
2015-01-07 23:02                                           ` Eric W. Biederman
2015-01-07 23:02                                             ` Eric W. Biederman
     [not found]                                             ` <878uhe42km.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 23:06                                               ` Tejun Heo
2015-01-07 23:06                                                 ` Tejun Heo
     [not found]                                                 ` <20150107230615.GA28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-01-07 23:09                                                   ` Eric W. Biederman
2015-01-07 23:09                                                 ` Eric W. Biederman
2015-01-07 23:09                                                   ` Eric W. Biederman
2015-01-07 23:27                                                   ` Eric W. Biederman
2015-01-07 23:27                                                     ` Eric W. Biederman
     [not found]                                                     ` <87y4peyxw5.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 23:35                                                       ` Tejun Heo
2015-01-07 23:35                                                         ` Tejun Heo
     [not found]                                                         ` <20150107233553.GC28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-02-11  3:46                                                           ` Serge E. Hallyn
2015-02-11  3:46                                                         ` Serge E. Hallyn
2015-02-11  3:46                                                           ` Serge E. Hallyn
     [not found]                                                           ` <20150211034616.GA25022-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-02-11  4:09                                                             ` Tejun Heo
2015-02-11  4:09                                                               ` Tejun Heo
     [not found]                                                               ` <20150211040957.GC21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11  4:29                                                                 ` Serge E. Hallyn
2015-02-11  4:29                                                               ` Serge E. Hallyn
2015-02-11  4:29                                                                 ` Serge E. Hallyn
2015-02-11  5:02                                                                 ` Eric W. Biederman
2015-02-11  5:02                                                                   ` Eric W. Biederman
     [not found]                                                                   ` <87oap1qbv3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-02-11  5:17                                                                     ` Tejun Heo
2015-02-11  5:17                                                                   ` Tejun Heo
2015-02-11  5:17                                                                     ` Tejun Heo
2015-02-11 16:00                                                                     ` Serge E. Hallyn
2015-02-11 16:00                                                                       ` Serge E. Hallyn
     [not found]                                                                       ` <20150211160023.GA1579-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-02-11 16:03                                                                         ` Tejun Heo
2015-02-11 16:03                                                                       ` Tejun Heo
2015-02-11 16:03                                                                         ` Tejun Heo
     [not found]                                                                         ` <20150211160347.GE21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11 16:18                                                                           ` Serge E. Hallyn
2015-02-11 16:18                                                                         ` Serge E. Hallyn
     [not found]                                                                     ` <20150211051704.GB24897-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-02-11  6:29                                                                       ` Eric W. Biederman
2015-02-11  6:29                                                                         ` Eric W. Biederman
     [not found]                                                                         ` <87twytklkv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-02-11 14:36                                                                           ` Tejun Heo
2015-02-11 14:36                                                                         ` Tejun Heo
2015-02-11 14:36                                                                           ` Tejun Heo
2015-02-11 16:00                                                                       ` Serge E. Hallyn
     [not found]                                                                 ` <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-02-11  5:02                                                                   ` Eric W. Biederman
2015-02-11  5:10                                                                   ` Tejun Heo
2015-02-11  5:10                                                                 ` Tejun Heo
2015-02-11  5:10                                                                   ` Tejun Heo
     [not found]                                                   ` <87fvbm2nni.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 23:16                                                     ` Tejun Heo
2015-01-07 23:16                                                       ` Tejun Heo
2015-01-07 23:27                                                     ` Eric W. Biederman
     [not found]                                           ` <20150107224430.GA28414-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-01-07 23:02                                             ` Eric W. Biederman
     [not found]                                     ` <20150107193059.GA1857-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-01-07 22:14                                       ` Eric W. Biederman
     [not found]                                   ` <87fvbmir9q.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-01-07 19:30                                     ` Serge E. Hallyn
2015-01-07 18:57                                 ` Aditya Kali
     [not found]                         ` <54AB2992.6060707-/L3Ra7n9ekc@public.gmane.org>
2015-01-06 23:20                           ` Aditya Kali
     [not found]                       ` <CAGr1F2HSi_D07r2c5CKOsjSR1+58k9G2MrtACsd+HV6XKvJ7cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-06  0:17                         ` Richard Weinberger
     [not found]                   ` <54AB15BD.8020007-/L3Ra7n9ekc@public.gmane.org>
2015-01-05 23:53                     ` Eric W. Biederman
2014-12-05  3:20     ` [PATCHv3 0/8] CGroup Namespaces Aditya Kali
2014-12-05  3:20   ` Aditya Kali
2014-12-05  3:20     ` Aditya Kali

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.