All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 16:49 ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-03 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello, Linus.

A lot updates for cgroup.

* The biggest one is cgroup's conversion to kernfs.  cgroup took after
  the long abandoned vfs-entangled sysfs implementation and made it
  even more convoluted over time.  cgroup's internal objects were
  fused with vfs objects which also brought in vfs locking and object
  lifetime rules.  Naturally, there are places where vfs rules don't
  fit and nasty hacks, such as credential switching or lock dance
  interleaving inode mutex and cgroup_mutex with object serial number
  comparison thrown in to decide whether the operation is actually
  necessary, needed to be employed.

  After conversion to kernfs, internal object lifetime and locking
  rules are mostly isolated from vfs interactions allowing shedding of
  several nasty hacks and overall simplification.  This will also
  allow implmentation of operations which may affect multiple cgroups
  which weren't possible before as it would have required nesting
  i_mutexes.

* Various simplifications including dropping of module support, easier
  cgroup name/path handling, simplified cgroup file type handling and
  task_cg_lists optimization.

* Prepatory changes for the planned unified hierarchy, which is still
  a patchset away from being actually operational.  The dummy
  hierarchy is updated to serve as the default unified hierarchy.
  Controllers which aren't claimed by other hierarchies are associated
  with it, which BTW was what the dummy hierarchy was for anyway.

* Various fixes from Li and others.  This pull request includes some
  patches to add missing slab.h to various subsystems.  This was
  triggered xattr.h include removal from cgroup.h.  cgroup.h
  indirectly got included a lot of files which brought in xattr.h
  which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch.

Pulling in this branch creates four conflicts - two detected during
merge and two silent.

C1. e61734c55c24 ("cgroup: remove cgroup->name") dropped a bunch of
    vars from mem_cgroup_print_oom_info and 08088cb9ac0a ("memcg:
    change oom_info_lock to mutex") changed one of them.  Can be
    combined by dropping what's dropped in the former and updating
    oom_info_lock according to the latter.

void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
<<<<<<< HEAD
	/*
	 * protects memcg_name and makes sure that parallel ooms do not
	 * interleave
	 */
	static DEFINE_MUTEX(oom_info_lock);
	struct cgroup *task_cgrp;
	struct cgroup *mem_cgrp;
	static char memcg_name[PATH_MAX];
	int ret;
=======
	/* oom_info_lock ensures that parallel ooms do not interleave */
	static DEFINE_SPINLOCK(oom_info_lock);
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc
	struct mem_cgroup *iter;
	unsigned int i;

******* RESOLUTION *******

void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
	/* oom_info_lock ensures that parallel ooms do not interleave */
	static DEFINE_MUTEX(oom_info_lock);
	struct mem_cgroup *iter;
	unsigned int i;


C2. b8dadcb58d54 ("cpuset: use rcu_read_lock() to protect task_cs()")
    replaced task_lock() with rcu_read_lock() and 99afb0fd5f05
    ("cpuset: fix a race condition in
    __cpuset_node_allowed_softwall()") relocated a line into the
    locked region from after.  The resolution is straight-forward.

	rcu_read_lock();
	cs = nearest_hardwall_ancestor(task_cs(current));
<<<<<<< HEAD
	allowed = node_isset(node, cs->mems_allowed);
	task_unlock(current);
=======
	rcu_read_unlock();
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc

	mutex_unlock(&callback_mutex);
	return allowed;

******* RESOLUTION *******

	rcu_read_lock();
	cs = nearest_hardwall_ancestor(task_cs(current));
	allowed = node_isset(node, cs->mems_allowed);
	rcu_read_unlock();

	mutex_unlock(&callback_mutex);
	return allowed;


C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
    argument to kernfs_mount() and was routed through driver-core-next
    after cgroup pulled the tree.  cgroup's usage needs to be updated
    accordingly.


C4. 3eb59ec64fc7 ("cgroup: fix a failure path in create_css()") routed
    through cgroup/for-3.14-fixes added new usage of ss->subsys_id
    which was renamed to ss->id in the devel branch.  The new usage
    needs to be updated accordingly.


The following patch resolves C3 and C4.

index 432bdec..fede3d3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1603,7 +1603,7 @@ out_unlock:
        if (ret)
                return ERR_PTR(ret);

-       dentry = kernfs_mount(fs_type, flags, root->kf_root);
+       dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
        if (IS_ERR(dentry))
                cgroup_put(&root->cgrp);
        return dentry;
@@ -3654,7 +3654,7 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
        return 0;

 err_clear_dir:
-       cgroup_clear_dir(css->cgroup, 1 << css->ss->subsys_id);
+       cgroup_clear_dir(css->cgroup, 1 << css->ss->id);
 err_free_percpu_ref:
        percpu_ref_cancel_init(&css->refcnt);
 err_free_css:


Due to the intermediate merges, git-request-pull's diffstat didn't
match the changes made by pulling in the tree.  diffstat is generated
by comparing pre and post merge trees.  The test merge is available in
test-merge-3.15 branch.

The following changes since commit 532de3fc72adc2a6525c4d53c07bf81e1732083d:

  cgroup: update cgroup_enable_task_cg_lists() to grab siglock (2014-02-18 18:23:18 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.15

for you to fetch changes up to 1ec41830e087cda1f62dda4182c2b62811eb0ffc:

  cgroup: remove useless argument from cgroup_exit() (2014-03-29 09:15:54 -0400)

Thanks.

----------------------------------------------------------------
Fengguang Wu (1):
      cgroup: fix coccinelle warnings

Li Zefan (7):
      cgroup: fix locking in cgroupstats_build()
      cgroup: fix memory leak in cgroup_mount()
      cgroup: deal with dummp_top in cgroup_name() and cgroup_path()
      cgroup: add a validation check to cgroup_add_cftyps()
      cpuset: use rcu_read_lock() to protect task_cs()
      cgroup: fix spurious lockdep warning in cgroup_exit()
      cgroup: remove useless argument from cgroup_exit()

Monam Agarwal (1):
      cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c

Paul Gortmaker (1):
      sparc: fix implicit include of slab.h in leon_pci_grpci2.c

Stephen Rothwell (1):
      sun4M: add include of slab.h for kzalloc

Tejun Heo (67):
      cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup()
      cgroup: drop module support
      cgroup: clean up cgroup_subsys names and initialization
      cgroup: rename cgroup_subsys->subsys_id to ->id
      cgroup: update locking in cgroup_show_options()
      cgroup: remove cgroup_root_mutex
      Merge branch 'for-3.14-fixes' into for-3.15
      Merge branch 'driver-core-next' into cgroup/for-3.15
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
      cgroup: improve css_from_dir() into css_tryget_from_dir()
      cgroup: introduce cgroup_tree_mutex
      cgroup: release cgroup_mutex over file removals
      cgroup: restructure locking and error handling in cgroup_mount()
      cgroup: factor out cgroup_setup_root() from cgroup_mount()
      cgroup: update cgroup name handling
      cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()
      cgroup: update the meaning of cftype->max_write_len
      cgroup: introduce cgroup_init/exit_cftypes()
      cgroup: introduce cgroup_ino()
      cgroup: misc preps for kernfs conversion
      cgroup: relocate functions in preparation of kernfs conversion
      cgroup: convert to kernfs
      cgroup: warn if "xattr" is specified with "sane_behavior"
      cgroup: relocate cgroup_rm_cftypes()
      cgroup: remove cftype_set
      cgroup: simplify dynamic cftype addition and removal
      cgroup: make cgroup hold onto its kernfs_node
      cgroup: remove cgroup->name
      cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t
      cgroup: remove cgroupfs_root->refcnt
      cgroup: disallow xattr, release_agent and name if sane_behavior
      cgroup: drop CGRP_ROOT_SUBSYS_BOUND
      cgroup: enable task_cg_lists on the first cgroup mount
      cgroup: relocate cgroup_enable_task_cg_lists()
      cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()
      cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()
      cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
      cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks()
      cgroup: remove css_scan_tasks()
      cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()
      cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()
      cgroup: drop @skip_css from cgroup_taskset_for_each()
      cpuset: don't use cgroup_taskset_cur_css()
      cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()
      cgroup: cosmetic updates to cgroup_attach_task()
      cgroup: unexport functions
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
      cgroup: add css_set->mg_tasks
      cgroup: use css_set->mg_tasks to track target tasks during migration
      cgroup: separate out cset_group_from_root() from task_cgroup_from_root()
      cgroup: split process / task migration into four steps
      cgroup: update how a newly forked task gets associated with css_set
      cgroup: drop task_lock() protection around task->cgroups
      cgroup: update cgroup_transfer_tasks() to either succeed or fail
      cgroup_freezer: document freezer_fork() subtleties
      cgroup: relocate setting of CGRP_DEAD
      cgroup: reorganize cgroup bootstrapping
      cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
      cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
      cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
      cgroup: move ->subsys_mask from cgroupfs_root to cgroup
      cgroup: rename cgroup_dummy_root and related names
      cgroup: drop const from @buffer of cftype->write_string()
      cgroup: make cgrp_dfl_root mountable
      cgroup: implement CFTYPE_ONLY_ON_DFL
      cgroup: fix cgroup_taskset walking order
      cgroup: break kernfs active_ref protection in cgroup directory operations

 arch/sparc/kernel/leon_pci_grpci2.c |    1 
 arch/sparc/kernel/sun4m_irq.c       |    2 
 block/blk-cgroup.c                  |   11 
 block/blk-cgroup.h                  |   14 
 block/blk-throttle.c                |    8 
 block/cfq-iosched.c                 |    7 
 fs/bio.c                            |    2 
 fs/kernfs/dir.c                     |    1 
 include/linux/cgroup.h              |  275 +-
 include/linux/cgroup_subsys.h       |   30 
 include/linux/hugetlb_cgroup.h      |    2 
 include/linux/memcontrol.h          |    2 
 include/net/cls_cgroup.h            |    2 
 include/net/netprio_cgroup.h        |   17 
 init/Kconfig                        |    1 
 kernel/cgroup.c                     | 3721 ++++++++++++++----------------------
 kernel/cgroup_freezer.c             |   40 
 kernel/cpuset.c                     |  262 --
 kernel/events/core.c                |   25 
 kernel/exit.c                       |    2 
 kernel/fork.c                       |    5 
 kernel/sched/core.c                 |   10 
 kernel/sched/cpuacct.c              |    6 
 kernel/sched/debug.c                |    3 
 mm/hugetlb_cgroup.c                 |   11 
 mm/memcontrol.c                     |  110 -
 mm/memory-failure.c                 |    8 
 net/Kconfig                         |    2 
 net/core/netclassid_cgroup.c        |   15 
 net/core/netprio_cgroup.c           |   41 
 net/ipv4/tcp_memcontrol.c           |    4 
 security/device_cgroup.c            |   12 
 32 files changed, 1913 insertions(+), 2739 deletions(-)

--
tejun

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 16:49 ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-03 16:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Li Zefan, containers, cgroups

Hello, Linus.

A lot updates for cgroup.

* The biggest one is cgroup's conversion to kernfs.  cgroup took after
  the long abandoned vfs-entangled sysfs implementation and made it
  even more convoluted over time.  cgroup's internal objects were
  fused with vfs objects which also brought in vfs locking and object
  lifetime rules.  Naturally, there are places where vfs rules don't
  fit and nasty hacks, such as credential switching or lock dance
  interleaving inode mutex and cgroup_mutex with object serial number
  comparison thrown in to decide whether the operation is actually
  necessary, needed to be employed.

  After conversion to kernfs, internal object lifetime and locking
  rules are mostly isolated from vfs interactions allowing shedding of
  several nasty hacks and overall simplification.  This will also
  allow implmentation of operations which may affect multiple cgroups
  which weren't possible before as it would have required nesting
  i_mutexes.

* Various simplifications including dropping of module support, easier
  cgroup name/path handling, simplified cgroup file type handling and
  task_cg_lists optimization.

* Prepatory changes for the planned unified hierarchy, which is still
  a patchset away from being actually operational.  The dummy
  hierarchy is updated to serve as the default unified hierarchy.
  Controllers which aren't claimed by other hierarchies are associated
  with it, which BTW was what the dummy hierarchy was for anyway.

* Various fixes from Li and others.  This pull request includes some
  patches to add missing slab.h to various subsystems.  This was
  triggered xattr.h include removal from cgroup.h.  cgroup.h
  indirectly got included a lot of files which brought in xattr.h
  which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch.

Pulling in this branch creates four conflicts - two detected during
merge and two silent.

C1. e61734c55c24 ("cgroup: remove cgroup->name") dropped a bunch of
    vars from mem_cgroup_print_oom_info and 08088cb9ac0a ("memcg:
    change oom_info_lock to mutex") changed one of them.  Can be
    combined by dropping what's dropped in the former and updating
    oom_info_lock according to the latter.

void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
<<<<<<< HEAD
	/*
	 * protects memcg_name and makes sure that parallel ooms do not
	 * interleave
	 */
	static DEFINE_MUTEX(oom_info_lock);
	struct cgroup *task_cgrp;
	struct cgroup *mem_cgrp;
	static char memcg_name[PATH_MAX];
	int ret;
=======
	/* oom_info_lock ensures that parallel ooms do not interleave */
	static DEFINE_SPINLOCK(oom_info_lock);
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc
	struct mem_cgroup *iter;
	unsigned int i;

******* RESOLUTION *******

void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
	/* oom_info_lock ensures that parallel ooms do not interleave */
	static DEFINE_MUTEX(oom_info_lock);
	struct mem_cgroup *iter;
	unsigned int i;


C2. b8dadcb58d54 ("cpuset: use rcu_read_lock() to protect task_cs()")
    replaced task_lock() with rcu_read_lock() and 99afb0fd5f05
    ("cpuset: fix a race condition in
    __cpuset_node_allowed_softwall()") relocated a line into the
    locked region from after.  The resolution is straight-forward.

	rcu_read_lock();
	cs = nearest_hardwall_ancestor(task_cs(current));
<<<<<<< HEAD
	allowed = node_isset(node, cs->mems_allowed);
	task_unlock(current);
=======
	rcu_read_unlock();
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc

	mutex_unlock(&callback_mutex);
	return allowed;

******* RESOLUTION *******

	rcu_read_lock();
	cs = nearest_hardwall_ancestor(task_cs(current));
	allowed = node_isset(node, cs->mems_allowed);
	rcu_read_unlock();

	mutex_unlock(&callback_mutex);
	return allowed;


C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
    argument to kernfs_mount() and was routed through driver-core-next
    after cgroup pulled the tree.  cgroup's usage needs to be updated
    accordingly.


C4. 3eb59ec64fc7 ("cgroup: fix a failure path in create_css()") routed
    through cgroup/for-3.14-fixes added new usage of ss->subsys_id
    which was renamed to ss->id in the devel branch.  The new usage
    needs to be updated accordingly.


The following patch resolves C3 and C4.

index 432bdec..fede3d3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1603,7 +1603,7 @@ out_unlock:
        if (ret)
                return ERR_PTR(ret);

-       dentry = kernfs_mount(fs_type, flags, root->kf_root);
+       dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
        if (IS_ERR(dentry))
                cgroup_put(&root->cgrp);
        return dentry;
@@ -3654,7 +3654,7 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
        return 0;

 err_clear_dir:
-       cgroup_clear_dir(css->cgroup, 1 << css->ss->subsys_id);
+       cgroup_clear_dir(css->cgroup, 1 << css->ss->id);
 err_free_percpu_ref:
        percpu_ref_cancel_init(&css->refcnt);
 err_free_css:


Due to the intermediate merges, git-request-pull's diffstat didn't
match the changes made by pulling in the tree.  diffstat is generated
by comparing pre and post merge trees.  The test merge is available in
test-merge-3.15 branch.

The following changes since commit 532de3fc72adc2a6525c4d53c07bf81e1732083d:

  cgroup: update cgroup_enable_task_cg_lists() to grab siglock (2014-02-18 18:23:18 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.15

for you to fetch changes up to 1ec41830e087cda1f62dda4182c2b62811eb0ffc:

  cgroup: remove useless argument from cgroup_exit() (2014-03-29 09:15:54 -0400)

Thanks.

----------------------------------------------------------------
Fengguang Wu (1):
      cgroup: fix coccinelle warnings

Li Zefan (7):
      cgroup: fix locking in cgroupstats_build()
      cgroup: fix memory leak in cgroup_mount()
      cgroup: deal with dummp_top in cgroup_name() and cgroup_path()
      cgroup: add a validation check to cgroup_add_cftyps()
      cpuset: use rcu_read_lock() to protect task_cs()
      cgroup: fix spurious lockdep warning in cgroup_exit()
      cgroup: remove useless argument from cgroup_exit()

Monam Agarwal (1):
      cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c

Paul Gortmaker (1):
      sparc: fix implicit include of slab.h in leon_pci_grpci2.c

Stephen Rothwell (1):
      sun4M: add include of slab.h for kzalloc

Tejun Heo (67):
      cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup()
      cgroup: drop module support
      cgroup: clean up cgroup_subsys names and initialization
      cgroup: rename cgroup_subsys->subsys_id to ->id
      cgroup: update locking in cgroup_show_options()
      cgroup: remove cgroup_root_mutex
      Merge branch 'for-3.14-fixes' into for-3.15
      Merge branch 'driver-core-next' into cgroup/for-3.15
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
      cgroup: improve css_from_dir() into css_tryget_from_dir()
      cgroup: introduce cgroup_tree_mutex
      cgroup: release cgroup_mutex over file removals
      cgroup: restructure locking and error handling in cgroup_mount()
      cgroup: factor out cgroup_setup_root() from cgroup_mount()
      cgroup: update cgroup name handling
      cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()
      cgroup: update the meaning of cftype->max_write_len
      cgroup: introduce cgroup_init/exit_cftypes()
      cgroup: introduce cgroup_ino()
      cgroup: misc preps for kernfs conversion
      cgroup: relocate functions in preparation of kernfs conversion
      cgroup: convert to kernfs
      cgroup: warn if "xattr" is specified with "sane_behavior"
      cgroup: relocate cgroup_rm_cftypes()
      cgroup: remove cftype_set
      cgroup: simplify dynamic cftype addition and removal
      cgroup: make cgroup hold onto its kernfs_node
      cgroup: remove cgroup->name
      cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t
      cgroup: remove cgroupfs_root->refcnt
      cgroup: disallow xattr, release_agent and name if sane_behavior
      cgroup: drop CGRP_ROOT_SUBSYS_BOUND
      cgroup: enable task_cg_lists on the first cgroup mount
      cgroup: relocate cgroup_enable_task_cg_lists()
      cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()
      cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()
      cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
      cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks()
      cgroup: remove css_scan_tasks()
      cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()
      cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()
      cgroup: drop @skip_css from cgroup_taskset_for_each()
      cpuset: don't use cgroup_taskset_cur_css()
      cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()
      cgroup: cosmetic updates to cgroup_attach_task()
      cgroup: unexport functions
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
      cgroup: add css_set->mg_tasks
      cgroup: use css_set->mg_tasks to track target tasks during migration
      cgroup: separate out cset_group_from_root() from task_cgroup_from_root()
      cgroup: split process / task migration into four steps
      cgroup: update how a newly forked task gets associated with css_set
      cgroup: drop task_lock() protection around task->cgroups
      cgroup: update cgroup_transfer_tasks() to either succeed or fail
      cgroup_freezer: document freezer_fork() subtleties
      cgroup: relocate setting of CGRP_DEAD
      cgroup: reorganize cgroup bootstrapping
      cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
      cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
      cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
      cgroup: move ->subsys_mask from cgroupfs_root to cgroup
      cgroup: rename cgroup_dummy_root and related names
      cgroup: drop const from @buffer of cftype->write_string()
      cgroup: make cgrp_dfl_root mountable
      cgroup: implement CFTYPE_ONLY_ON_DFL
      cgroup: fix cgroup_taskset walking order
      cgroup: break kernfs active_ref protection in cgroup directory operations

 arch/sparc/kernel/leon_pci_grpci2.c |    1 
 arch/sparc/kernel/sun4m_irq.c       |    2 
 block/blk-cgroup.c                  |   11 
 block/blk-cgroup.h                  |   14 
 block/blk-throttle.c                |    8 
 block/cfq-iosched.c                 |    7 
 fs/bio.c                            |    2 
 fs/kernfs/dir.c                     |    1 
 include/linux/cgroup.h              |  275 +-
 include/linux/cgroup_subsys.h       |   30 
 include/linux/hugetlb_cgroup.h      |    2 
 include/linux/memcontrol.h          |    2 
 include/net/cls_cgroup.h            |    2 
 include/net/netprio_cgroup.h        |   17 
 init/Kconfig                        |    1 
 kernel/cgroup.c                     | 3721 ++++++++++++++----------------------
 kernel/cgroup_freezer.c             |   40 
 kernel/cpuset.c                     |  262 --
 kernel/events/core.c                |   25 
 kernel/exit.c                       |    2 
 kernel/fork.c                       |    5 
 kernel/sched/core.c                 |   10 
 kernel/sched/cpuacct.c              |    6 
 kernel/sched/debug.c                |    3 
 mm/hugetlb_cgroup.c                 |   11 
 mm/memcontrol.c                     |  110 -
 mm/memory-failure.c                 |    8 
 net/Kconfig                         |    2 
 net/core/netclassid_cgroup.c        |   15 
 net/core/netprio_cgroup.c           |   41 
 net/ipv4/tcp_memcontrol.c           |    4 
 security/device_cgroup.c            |   12 
 32 files changed, 1913 insertions(+), 2739 deletions(-)

--
tejun

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found] ` <20140403164911.GE24119-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-04-03 18:11   ` Linus Torvalds
  2014-04-05  1:06     ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 18:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
>     argument to kernfs_mount() and was routed through driver-core-next
>     after cgroup pulled the tree.  cgroup's usage needs to be updated
>     accordingly.

This one I find very annoying. There's exactly *one* user of that
interface (back in fed95bab8d29 there were none), and it doesn't want
that dummy argument.

WTF? Why did the driver-core-next development add that argument at
all, rather than just pass in NULL directly? Do we expect to grow more
users of kernfs_mount?

And the "bool *new_sb_created" argument really makes *zero* sense to
kernfs_mount(). It was added to fix a namespace refcount leak, BUT
kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!

So quite frankly, that kernfs_mount() calling convention change looks
completely f*cked up.

I think I'm going to resolve that conflict differently from your
suggested one - I'm going to just remove the broken stupid
"new_sb_created" argument from the kernfs_mount() helper function.

(It obviously stays around for kernfs_mount_ns() that *does* take a
namespace pointer).

People, holler if you disagree, I'll hold off pushing out my
resolution for a while anyway (do all the normal build tests and
reboot into it).

                      Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found] ` <20140403164911.GE24119-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-04-03 18:11   ` Linus Torvalds
  2014-04-05  1:06     ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 18:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <tj@kernel.org> wrote:
>
> C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
>     argument to kernfs_mount() and was routed through driver-core-next
>     after cgroup pulled the tree.  cgroup's usage needs to be updated
>     accordingly.

This one I find very annoying. There's exactly *one* user of that
interface (back in fed95bab8d29 there were none), and it doesn't want
that dummy argument.

WTF? Why did the driver-core-next development add that argument at
all, rather than just pass in NULL directly? Do we expect to grow more
users of kernfs_mount?

And the "bool *new_sb_created" argument really makes *zero* sense to
kernfs_mount(). It was added to fix a namespace refcount leak, BUT
kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!

So quite frankly, that kernfs_mount() calling convention change looks
completely f*cked up.

I think I'm going to resolve that conflict differently from your
suggested one - I'm going to just remove the broken stupid
"new_sb_created" argument from the kernfs_mount() helper function.

(It obviously stays around for kernfs_mount_ns() that *does* take a
namespace pointer).

People, holler if you disagree, I'll hold off pushing out my
resolution for a while anyway (do all the normal build tests and
reboot into it).

                      Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 18:11   ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 18:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
>     argument to kernfs_mount() and was routed through driver-core-next
>     after cgroup pulled the tree.  cgroup's usage needs to be updated
>     accordingly.

This one I find very annoying. There's exactly *one* user of that
interface (back in fed95bab8d29 there were none), and it doesn't want
that dummy argument.

WTF? Why did the driver-core-next development add that argument at
all, rather than just pass in NULL directly? Do we expect to grow more
users of kernfs_mount?

And the "bool *new_sb_created" argument really makes *zero* sense to
kernfs_mount(). It was added to fix a namespace refcount leak, BUT
kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!

So quite frankly, that kernfs_mount() calling convention change looks
completely f*cked up.

I think I'm going to resolve that conflict differently from your
suggested one - I'm going to just remove the broken stupid
"new_sb_created" argument from the kernfs_mount() helper function.

(It obviously stays around for kernfs_mount_ns() that *does* take a
namespace pointer).

People, holler if you disagree, I'll hold off pushing out my
resolution for a while anyway (do all the normal build tests and
reboot into it).

                      Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-03 18:11   ` Linus Torvalds
@ 2014-04-03 18:24       ` Linus Torvalds
  -1 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 18:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

On Thu, Apr 3, 2014 at 11:11 AM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> And the "bool *new_sb_created" argument really makes *zero* sense to
> kernfs_mount(). It was added to fix a namespace refcount leak, BUT
> kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!

Let me clarify that: the only reason I can see for why you'd care
about whether a new sb has been created is because the new sb needs a
namespace refcount.

Now, *if* there is some other reason to care, then the
"new_sb_created" argument may make sense even for kernfs_mount(). I
just don't see it. But if I'm wrong about that, then my alternate
resolution is wrong. However, as far as I can tell, anything else
should be properly refcounted by the mounting logic, and the "ns"
parameter is the only thing that needs to be handled by the caller
because it has that stupid opaque pointer that doesn't know its own
type.

Could that perhaps be fixed? If kernfs_mount_ns() could just do the
proper "grab" on the "void *ns" directly, all of the stupid
"new_sb_created" crud could just go away, even for kernfs_mount_ns().

                  Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 18:24       ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 18:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

On Thu, Apr 3, 2014 at 11:11 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And the "bool *new_sb_created" argument really makes *zero* sense to
> kernfs_mount(). It was added to fix a namespace refcount leak, BUT
> kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!

Let me clarify that: the only reason I can see for why you'd care
about whether a new sb has been created is because the new sb needs a
namespace refcount.

Now, *if* there is some other reason to care, then the
"new_sb_created" argument may make sense even for kernfs_mount(). I
just don't see it. But if I'm wrong about that, then my alternate
resolution is wrong. However, as far as I can tell, anything else
should be properly refcounted by the mounting logic, and the "ns"
parameter is the only thing that needs to be handled by the caller
because it has that stupid opaque pointer that doesn't know its own
type.

Could that perhaps be fixed? If kernfs_mount_ns() could just do the
proper "grab" on the "void *ns" directly, all of the stupid
"new_sb_created" crud could just go away, even for kernfs_mount_ns().

                  Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]       ` <CAOS58YMckmoCocguf9BC_Wxbn3D2Rx3MArhgozO9qCj4g=5aDw@mail.gmail.com>
@ 2014-04-03 19:01             ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 19:01 UTC (permalink / raw)
  To: Tejun Heo, Eric W. Biederman, David Miller
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

[ Extending the participants list a bit ]

On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On the road so sending from phone. Iirc the param is necessary to
> distinguishe when a new sb is created so that it can be put properly later.
> I think cgroup is leaking super ref now and li was planning to send a fix
> once things are merged.

So as far as I can tell, cgroup is fine, because the superblock itself
is properly refcounted by the mounting code. It's the magic hidden
"namespace" pointer that isn't properly refcounted, because that
stupid sucker is typeless and depends on the namespace kind (so to
drop a namespace pointer you need to pass in not just the pointer, but
also the right "KOBJ_NS_TYPE_xxx" type, which right now is always
KOBJ_NS_TYPE_NET).

Quite frankly, the whole "let's make the 'ns' pointer be a 'const void
*' and rely on people passing the right type along" is some seriously
crazy sh*t. Couldn't we force it to be a pointer to a union where the
first member is the type, so that

 (a) we don't use "const void *" that casts silently a bit too well to
*anything* (the "const" helps a bit, thankfully)

 (b) we don't have to pass both pointer and type along, we can just
pass the pointer, and the type can be looked up from it?

Hmm? It *looks* like the only thing that generates namespace pointers
right now is "net_namespace()", which just returns a "struct net *".
Could we just add that "type" as the first member of that struct, and
make the rule be (for any future namespaces) that the first member has
to be an "int type" field?

I might well have missed some other source that generates namespace
pointers, but it really looks like right now all those namespace()
things end up calling ->class->namespace(), and the only class that
then implements a namespace pointer function is networking and that
"net_namespace()". And I don't *think* that the networking code would
mind having that "int type" at the head of the "struct net" it uses.

The reason I'd like to have the type accessible from the pointer is
that then we *could* use that type to index into the namespace ops,
and grab reference counts etc. Instead of having that "const void *ns"
pointer that you can't do anything about, and have to expect that the
caller (that knows the type of it) will do the right thing wrt
refcounting etc.

Is there something I'm missing?

            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 19:01             ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 19:01 UTC (permalink / raw)
  To: Tejun Heo, Eric W. Biederman, David Miller
  Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

[ Extending the participants list a bit ]

On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj@kernel.org> wrote:
> On the road so sending from phone. Iirc the param is necessary to
> distinguishe when a new sb is created so that it can be put properly later.
> I think cgroup is leaking super ref now and li was planning to send a fix
> once things are merged.

So as far as I can tell, cgroup is fine, because the superblock itself
is properly refcounted by the mounting code. It's the magic hidden
"namespace" pointer that isn't properly refcounted, because that
stupid sucker is typeless and depends on the namespace kind (so to
drop a namespace pointer you need to pass in not just the pointer, but
also the right "KOBJ_NS_TYPE_xxx" type, which right now is always
KOBJ_NS_TYPE_NET).

Quite frankly, the whole "let's make the 'ns' pointer be a 'const void
*' and rely on people passing the right type along" is some seriously
crazy sh*t. Couldn't we force it to be a pointer to a union where the
first member is the type, so that

 (a) we don't use "const void *" that casts silently a bit too well to
*anything* (the "const" helps a bit, thankfully)

 (b) we don't have to pass both pointer and type along, we can just
pass the pointer, and the type can be looked up from it?

Hmm? It *looks* like the only thing that generates namespace pointers
right now is "net_namespace()", which just returns a "struct net *".
Could we just add that "type" as the first member of that struct, and
make the rule be (for any future namespaces) that the first member has
to be an "int type" field?

I might well have missed some other source that generates namespace
pointers, but it really looks like right now all those namespace()
things end up calling ->class->namespace(), and the only class that
then implements a namespace pointer function is networking and that
"net_namespace()". And I don't *think* that the networking code would
mind having that "int type" at the head of the "struct net" it uses.

The reason I'd like to have the type accessible from the pointer is
that then we *could* use that type to index into the namespace ops,
and grab reference counts etc. Instead of having that "const void *ns"
pointer that you can't do anything about, and have to expect that the
caller (that knows the type of it) will do the right thing wrt
refcounting etc.

Is there something I'm missing?

            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]             ` <CA+55aFw2+VKojzoF2aDaruOMKHikTO8J0HeK9DoDLZr+HqTWbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-03 19:43               ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-03 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Containers, Linux Kernel Mailing List, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, David Miller

Hello,

On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
> [ Extending the participants list a bit ]
> 
> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On the road so sending from phone. Iirc the param is necessary to
> > distinguishe when a new sb is created so that it can be put properly later.
> > I think cgroup is leaking super ref now and li was planning to send a fix
> > once things are merged.
> 
> So as far as I can tell, cgroup is fine, because the superblock itself
> is properly refcounted by the mounting code. It's the magic hidden

Ah, I remembered the other way around.  We could leak cgroup_root
reference, not the other way around.  cgroup_mount() can be called
multiple times for the same sb and we inc cgroup_root's ref each time
but cgroup_kill_sb() only happens when the sb is released, so if we do
the following,

 # mkdir cpuset cpuset1
 # mount -t cgroup -o cpuset cgroup cpuset
 # mount -t cgroup -o cpuset cgroup cpuset1
 # umount cpuset
 # umount cpuset1

The cgroup_root should be destroyed but it isn't, I think.  We'd need
to bump cgroup_root's refcnt only when a new sb is created.  It's
kinda ugly.  Hmmm...

As for using specific type for ns tag, yeah, that'd be better
regardless of this.  The opaqueness is a bit extreme now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]             ` <CA+55aFw2+VKojzoF2aDaruOMKHikTO8J0HeK9DoDLZr+HqTWbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-03 19:43               ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-03 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups

Hello,

On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
> [ Extending the participants list a bit ]
> 
> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj@kernel.org> wrote:
> > On the road so sending from phone. Iirc the param is necessary to
> > distinguishe when a new sb is created so that it can be put properly later.
> > I think cgroup is leaking super ref now and li was planning to send a fix
> > once things are merged.
> 
> So as far as I can tell, cgroup is fine, because the superblock itself
> is properly refcounted by the mounting code. It's the magic hidden

Ah, I remembered the other way around.  We could leak cgroup_root
reference, not the other way around.  cgroup_mount() can be called
multiple times for the same sb and we inc cgroup_root's ref each time
but cgroup_kill_sb() only happens when the sb is released, so if we do
the following,

 # mkdir cpuset cpuset1
 # mount -t cgroup -o cpuset cgroup cpuset
 # mount -t cgroup -o cpuset cgroup cpuset1
 # umount cpuset
 # umount cpuset1

The cgroup_root should be destroyed but it isn't, I think.  We'd need
to bump cgroup_root's refcnt only when a new sb is created.  It's
kinda ugly.  Hmmm...

As for using specific type for ns tag, yeah, that'd be better
regardless of this.  The opaqueness is a bit extreme now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 19:43               ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-03 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
> [ Extending the participants list a bit ]
> 
> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On the road so sending from phone. Iirc the param is necessary to
> > distinguishe when a new sb is created so that it can be put properly later.
> > I think cgroup is leaking super ref now and li was planning to send a fix
> > once things are merged.
> 
> So as far as I can tell, cgroup is fine, because the superblock itself
> is properly refcounted by the mounting code. It's the magic hidden

Ah, I remembered the other way around.  We could leak cgroup_root
reference, not the other way around.  cgroup_mount() can be called
multiple times for the same sb and we inc cgroup_root's ref each time
but cgroup_kill_sb() only happens when the sb is released, so if we do
the following,

 # mkdir cpuset cpuset1
 # mount -t cgroup -o cpuset cgroup cpuset
 # mount -t cgroup -o cpuset cgroup cpuset1
 # umount cpuset
 # umount cpuset1

The cgroup_root should be destroyed but it isn't, I think.  We'd need
to bump cgroup_root's refcnt only when a new sb is created.  It's
kinda ugly.  Hmmm...

As for using specific type for ns tag, yeah, that'd be better
regardless of this.  The opaqueness is a bit extreme now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]               ` <20140403194335.GC2472-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-04-03 20:02                 ` Linus Torvalds
  2014-04-03 23:18                 ` Eric W. Biederman
  2014-04-04  9:14                   ` Li Zefan
  2 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 20:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Containers, Linux Kernel Mailing List, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, David Miller

On Thu, Apr 3, 2014 at 12:43 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Ah, I remembered the other way around.  We could leak cgroup_root
> reference, not the other way around.  cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,

Oh, Christ, I see what you are talking about.

That interface is all kinds of crazy.

> The cgroup_root should be destroyed but it isn't, I think.  We'd need
> to bump cgroup_root's refcnt only when a new sb is created.  It's
> kinda ugly.  Hmmm...

Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
my merge resolution to reflect that. I do find this incredibly ugly.

              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]               ` <20140403194335.GC2472-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-04-03 20:02                 ` Linus Torvalds
  2014-04-03 23:18                 ` Eric W. Biederman
  2014-04-04  9:14                   ` Li Zefan
  2 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 20:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups

On Thu, Apr 3, 2014 at 12:43 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Ah, I remembered the other way around.  We could leak cgroup_root
> reference, not the other way around.  cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,

Oh, Christ, I see what you are talking about.

That interface is all kinds of crazy.

> The cgroup_root should be destroyed but it isn't, I think.  We'd need
> to bump cgroup_root's refcnt only when a new sb is created.  It's
> kinda ugly.  Hmmm...

Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
my merge resolution to reflect that. I do find this incredibly ugly.

              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 20:02                 ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-03 20:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 3, 2014 at 12:43 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Ah, I remembered the other way around.  We could leak cgroup_root
> reference, not the other way around.  cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,

Oh, Christ, I see what you are talking about.

That interface is all kinds of crazy.

> The cgroup_root should be destroyed but it isn't, I think.  We'd need
> to bump cgroup_root's refcnt only when a new sb is created.  It's
> kinda ugly.  Hmmm...

Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
my merge resolution to reflect that. I do find this incredibly ugly.

              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]               ` <20140403194335.GC2472-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2014-04-03 20:02                 ` Linus Torvalds
@ 2014-04-03 23:18                 ` Eric W. Biederman
  2014-04-04  9:14                   ` Li Zefan
  2 siblings, 0 replies; 36+ messages in thread
From: Eric W. Biederman @ 2014-04-03 23:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Containers, Linux Kernel Mailing List,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds, David Miller

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hello,
>
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>> 
> As for using specific type for ns tag, yeah, that'd be better
> regardless of this.  The opaqueness is a bit extreme now.

(The opaqueness has alwasy been silly but it was the compromise
 required to get the code merged.  Sigh)

With respect to cleaning up the namespace opaqueue pointers while
working on the sysctls I came up with a design that is quite a bit
more maintainable.

The basic idea in terms of kernfs is to have an array (sized 1 for now)
in the superblock of pointers to root kernfs_nodes.  Then you have a
nslink node type that holds the index of the kernfs_node pointer it is
associated with.  When following a nslink (which is unconditional
because they would not be visible to userspace) the appropriate root
kernfs_node is selected from the superblock and traversed to the same
path as the nslink.

The major benefit is that except for special cases in mounting (to
populate the extra root), and following the nslink there is no extra
logic to worry about or deal with.

Since that winds up using ordinary kernfs_nodes, which are already
reference counted in a well understood way we should be able to
completely remove passive refcount to the network namespace.

Since it has been clear what is needed for 3.15 and for resolving the
merge conflict I am not suggesting we do anything this merge window but
since we are talking about how to clean up warts in the existing design,
I figured I should mention it.

Eric

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]               ` <20140403194335.GC2472-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-04-03 23:18                 ` Eric W. Biederman
  2014-04-03 23:18                 ` Eric W. Biederman
  2014-04-04  9:14                   ` Li Zefan
  2 siblings, 0 replies; 36+ messages in thread
From: Eric W. Biederman @ 2014-04-03 23:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>> 
> As for using specific type for ns tag, yeah, that'd be better
> regardless of this.  The opaqueness is a bit extreme now.

(The opaqueness has alwasy been silly but it was the compromise
 required to get the code merged.  Sigh)

With respect to cleaning up the namespace opaqueue pointers while
working on the sysctls I came up with a design that is quite a bit
more maintainable.

The basic idea in terms of kernfs is to have an array (sized 1 for now)
in the superblock of pointers to root kernfs_nodes.  Then you have a
nslink node type that holds the index of the kernfs_node pointer it is
associated with.  When following a nslink (which is unconditional
because they would not be visible to userspace) the appropriate root
kernfs_node is selected from the superblock and traversed to the same
path as the nslink.

The major benefit is that except for special cases in mounting (to
populate the extra root), and following the nslink there is no extra
logic to worry about or deal with.

Since that winds up using ordinary kernfs_nodes, which are already
reference counted in a well understood way we should be able to
completely remove passive refcount to the network namespace.

Since it has been clear what is needed for 3.15 and for resolving the
merge conflict I am not suggesting we do anything this merge window but
since we are talking about how to clean up warts in the existing design,
I figured I should mention it.

Eric


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-03 23:18                 ` Eric W. Biederman
  0 siblings, 0 replies; 36+ messages in thread
From: Eric W. Biederman @ 2014-04-03 23:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups-u79uwXL29TY76Z2rM5mHXA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hello,
>
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>> 
> As for using specific type for ns tag, yeah, that'd be better
> regardless of this.  The opaqueness is a bit extreme now.

(The opaqueness has alwasy been silly but it was the compromise
 required to get the code merged.  Sigh)

With respect to cleaning up the namespace opaqueue pointers while
working on the sysctls I came up with a design that is quite a bit
more maintainable.

The basic idea in terms of kernfs is to have an array (sized 1 for now)
in the superblock of pointers to root kernfs_nodes.  Then you have a
nslink node type that holds the index of the kernfs_node pointer it is
associated with.  When following a nslink (which is unconditional
because they would not be visible to userspace) the appropriate root
kernfs_node is selected from the superblock and traversed to the same
path as the nslink.

The major benefit is that except for special cases in mounting (to
populate the extra root), and following the nslink there is no extra
logic to worry about or deal with.

Since that winds up using ordinary kernfs_nodes, which are already
reference counted in a well understood way we should be able to
completely remove passive refcount to the network namespace.

Since it has been clear what is needed for 3.15 and for resolving the
merge conflict I am not suggesting we do anything this merge window but
since we are talking about how to clean up warts in the existing design,
I figured I should mention it.

Eric

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-03 19:43               ` Tejun Heo
@ 2014-04-04  9:14                   ` Li Zefan
  -1 siblings, 0 replies; 36+ messages in thread
From: Li Zefan @ 2014-04-04  9:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Containers, Linux Kernel Mailing List, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds, David Miller

On 2014/4/4 3:43, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>>
>> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>> On the road so sending from phone. Iirc the param is necessary to
>>> distinguishe when a new sb is created so that it can be put properly later.
>>> I think cgroup is leaking super ref now and li was planning to send a fix
>>> once things are merged.
>>
>> So as far as I can tell, cgroup is fine, because the superblock itself
>> is properly refcounted by the mounting code. It's the magic hidden
> 
> Ah, I remembered the other way around.  We could leak cgroup_root
> reference, not the other way around.  cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,
> 
>  # mkdir cpuset cpuset1
>  # mount -t cgroup -o cpuset cgroup cpuset
>  # mount -t cgroup -o cpuset cgroup cpuset1
>  # umount cpuset
>  # umount cpuset1
> 
> The cgroup_root should be destroyed but it isn't, I think.  We'd need
> to bump cgroup_root's refcnt only when a new sb is created.

Yeah, I had been waiting for the kernfs change to be merged into mainline,
so I can fix this cgroup refcnting, and here is the fix.

====================

[PATCH] cgroup: fix top cgroup refcnt leak

As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.

Try:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mount -t cgroup -o cpuacct xxx /cgroup2
        # cat /proc/cgroups | grep cpuacct
        cpuacct 2       1       1
        # umount /cgroup
        # umount /cgroup2
        # cat /proc/cgroups | grep cpuacct
        cpuacct 2       1       1

You'll see cgroupfs will never be freed.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2189462..48bd9c9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1487,6 +1487,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	struct cgroup_sb_opts opts;
 	struct dentry *dentry;
 	int ret;
+	bool new_sb;
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1603,8 +1604,8 @@ out_unlock:
 	if (ret)
 		return ERR_PTR(ret);
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
-	if (IS_ERR(dentry))
+	dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
+	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 	return dentry;
 }
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-04  9:14                   ` Li Zefan
  0 siblings, 0 replies; 36+ messages in thread
From: Li Zefan @ 2014-04-04  9:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Eric W. Biederman, David Miller,
	Linux Kernel Mailing List, Linux Containers, cgroups

On 2014/4/4 3:43, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>>
>> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <tj@kernel.org> wrote:
>>> On the road so sending from phone. Iirc the param is necessary to
>>> distinguishe when a new sb is created so that it can be put properly later.
>>> I think cgroup is leaking super ref now and li was planning to send a fix
>>> once things are merged.
>>
>> So as far as I can tell, cgroup is fine, because the superblock itself
>> is properly refcounted by the mounting code. It's the magic hidden
> 
> Ah, I remembered the other way around.  We could leak cgroup_root
> reference, not the other way around.  cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,
> 
>  # mkdir cpuset cpuset1
>  # mount -t cgroup -o cpuset cgroup cpuset
>  # mount -t cgroup -o cpuset cgroup cpuset1
>  # umount cpuset
>  # umount cpuset1
> 
> The cgroup_root should be destroyed but it isn't, I think.  We'd need
> to bump cgroup_root's refcnt only when a new sb is created.

Yeah, I had been waiting for the kernfs change to be merged into mainline,
so I can fix this cgroup refcnting, and here is the fix.

====================

[PATCH] cgroup: fix top cgroup refcnt leak

As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.

Try:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mount -t cgroup -o cpuacct xxx /cgroup2
        # cat /proc/cgroups | grep cpuacct
        cpuacct 2       1       1
        # umount /cgroup
        # umount /cgroup2
        # cat /proc/cgroups | grep cpuacct
        cpuacct 2       1       1

You'll see cgroupfs will never be freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2189462..48bd9c9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1487,6 +1487,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	struct cgroup_sb_opts opts;
 	struct dentry *dentry;
 	int ret;
+	bool new_sb;
 
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1603,8 +1604,8 @@ out_unlock:
 	if (ret)
 		return ERR_PTR(ret);
 
-	dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
-	if (IS_ERR(dentry))
+	dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
+	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 	return dentry;
 }
-- 
1.8.0.2



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-03 20:02                 ` Linus Torvalds
@ 2014-04-04 12:03                     ` Tejun Heo
  -1 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-04 12:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Containers, Linux Kernel Mailing List, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, David Miller

On Thu, Apr 03, 2014 at 01:02:45PM -0700, Linus Torvalds wrote:
> > The cgroup_root should be destroyed but it isn't, I think.  We'd need
> > to bump cgroup_root's refcnt only when a new sb is created.  It's
> > kinda ugly.  Hmmm...
> 
> Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
> my merge resolution to reflect that. I do find this incredibly ugly.

I apparently missed the issue and designed the interface without
considering this ugliness.  Maybe kernfs could be made to wrap rather
than providing mount/kill_sb() functions and hide details about sb or
we can simply add fstype->umount() so that there's symmetry; however,
the problem is kernfs-specific and other kernfs users would have
single static backing store and won't need to care about this, so, for
now, I think what it's a ugly but acceptable compromise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-04 12:03                     ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-04 12:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, David Miller, Linux Kernel Mailing List,
	Li Zefan, Linux Containers, cgroups

On Thu, Apr 03, 2014 at 01:02:45PM -0700, Linus Torvalds wrote:
> > The cgroup_root should be destroyed but it isn't, I think.  We'd need
> > to bump cgroup_root's refcnt only when a new sb is created.  It's
> > kinda ugly.  Hmmm...
> 
> Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
> my merge resolution to reflect that. I do find this incredibly ugly.

I apparently missed the issue and designed the interface without
considering this ugliness.  Maybe kernfs could be made to wrap rather
than providing mount/kill_sb() functions and hide details about sb or
we can simply add fstype->umount() so that there's symmetry; however,
the problem is kernfs-specific and other kernfs users would have
single static backing store and won't need to care about this, so, for
now, I think what it's a ugly but acceptable compromise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-04  9:14                   ` Li Zefan
@ 2014-04-04 12:22                       ` Tejun Heo
  -1 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-04 12:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: Linux Containers, Linux Kernel Mailing List, Eric W. Biederman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds, David Miller

On Fri, Apr 04, 2014 at 05:14:41PM +0800, Li Zefan wrote:
> [PATCH] cgroup: fix top cgroup refcnt leak
> 
> As mount() and kill_sb() is not a one-to-one match, If we mount the same
> cgroupfs in serveral mount points, and then umount all of them, kill_sb()
> will be called only once.
> 
> Try:
>         # mount -t cgroup -o cpuacct xxx /cgroup
>         # mount -t cgroup -o cpuacct xxx /cgroup2
>         # cat /proc/cgroups | grep cpuacct
>         cpuacct 2       1       1
>         # umount /cgroup
>         # umount /cgroup2
>         # cat /proc/cgroups | grep cpuacct
>         cpuacct 2       1       1
> 
> You'll see cgroupfs will never be freed.
> 
> Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Applied to cgroup/for-3.15-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-04 12:22                       ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-04 12:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: Linus Torvalds, Eric W. Biederman, David Miller,
	Linux Kernel Mailing List, Linux Containers, cgroups

On Fri, Apr 04, 2014 at 05:14:41PM +0800, Li Zefan wrote:
> [PATCH] cgroup: fix top cgroup refcnt leak
> 
> As mount() and kill_sb() is not a one-to-one match, If we mount the same
> cgroupfs in serveral mount points, and then umount all of them, kill_sb()
> will be called only once.
> 
> Try:
>         # mount -t cgroup -o cpuacct xxx /cgroup
>         # mount -t cgroup -o cpuacct xxx /cgroup2
>         # cat /proc/cgroups | grep cpuacct
>         cpuacct 2       1       1
>         # umount /cgroup
>         # umount /cgroup2
>         # cat /proc/cgroups | grep cpuacct
>         cpuacct 2       1       1
> 
> You'll see cgroupfs will never be freed.
> 
> Signed-off-by: Li Zefan <lizefan@huawei.com>

Applied to cgroup/for-3.15-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-03 16:49 ` Tejun Heo
@ 2014-04-05  1:06     ` Linus Torvalds
  -1 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> A lot updates for cgroup [...]

Btw, just a heads up to let you know: I'm in the middle of bisecting
why my machine no longer reboots cleanly, and while I'm a few boots
away from pinpointing it exactly, it has now drilled down to the point
that the only remaining commits are from your cgroup pull that I
pulled yesterday.

Maybe I screwed something up in the bisection and it's not 100%
reliable, but I don't think so.

Will report back soon when I've narrowed it down more.

               Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-05  1:06     ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <tj@kernel.org> wrote:
>
> A lot updates for cgroup [...]

Btw, just a heads up to let you know: I'm in the middle of bisecting
why my machine no longer reboots cleanly, and while I'm a few boots
away from pinpointing it exactly, it has now drilled down to the point
that the only remaining commits are from your cgroup pull that I
pulled yesterday.

Maybe I screwed something up in the bisection and it's not 100%
reliable, but I don't think so.

Will report back soon when I've narrowed it down more.

               Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]     ` <CA+55aFyGz2q4iDEt56Bw+x_ri_-DQYLAKPJvk3Uxj9z-y+0uPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-05  1:11       ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> Will report back soon when I've narrowed it down more.

Ok, that may end up being harder than it looked. Commit
a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
so apparently there are some really broken points in that whole
development chain.

I'll try to pick kernels away from that, but that makes bisection much
less effective.

                Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]     ` <CA+55aFyGz2q4iDEt56Bw+x_ri_-DQYLAKPJvk3Uxj9z-y+0uPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-05  1:11       ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Will report back soon when I've narrowed it down more.

Ok, that may end up being harder than it looked. Commit
a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
so apparently there are some really broken points in that whole
development chain.

I'll try to pick kernels away from that, but that makes bisection much
less effective.

                Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-05  1:11       ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> Will report back soon when I've narrowed it down more.

Ok, that may end up being harder than it looked. Commit
a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
so apparently there are some really broken points in that whole
development chain.

I'll try to pick kernels away from that, but that makes bisection much
less effective.

                Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
  2014-04-05  1:11       ` Linus Torvalds
@ 2014-04-05  1:34           ` Linus Torvalds
  -1 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

On Fri, Apr 4, 2014 at 6:11 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> I'll try to pick kernels away from that, but that makes bisection much
> less effective.

Ok, I think I'm away from the broken region. The problem seems to be in between

good: 8e30e2b8ba0e ("cgroup: restructure locking and error handling in
cgroup_mount()")

bad: 6f30558f37bf ("cgroup: make cgroup hold onto its kernfs_node")

and let's see what bisect says about the rest.. Does that give you any ideas?

            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-05  1:34           ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2014-04-05  1:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

On Fri, Apr 4, 2014 at 6:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll try to pick kernels away from that, but that makes bisection much
> less effective.

Ok, I think I'm away from the broken region. The problem seems to be in between

good: 8e30e2b8ba0e ("cgroup: restructure locking and error handling in
cgroup_mount()")

bad: 6f30558f37bf ("cgroup: make cgroup hold onto its kernfs_node")

and let's see what bisect says about the rest.. Does that give you any ideas?

            Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]       ` <CA+55aFy+Jt0B32kj=Ldz9Krp4DXcXSP4r-o3f-2LPcVs4udGVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-04-05  1:34           ` Linus Torvalds
@ 2014-04-06 14:31         ` Markus Trippelsdorf
  1 sibling, 0 replies; 36+ messages in thread
From: Markus Trippelsdorf @ 2014-04-06 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Containers, Linux Kernel Mailing List, Andy Lutomirski,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014.04.04 at 18:11 -0700, Linus Torvalds wrote:
> On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> >
> > Will report back soon when I've narrowed it down more.
> 
> Ok, that may end up being harder than it looked. Commit
> a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
> so apparently there are some really broken points in that whole
> development chain.

FWIW I've seen similar boot failures, that started during this merge
window, on my machine. I first thought they might be linker related so I
opened a bug on: https://sourceware.org/bugzilla/show_bug.cgi?id=16788
But if I remove e.g. a single printk the bug goes away...
Unfortunately the same thing happens if I enable more debugging options,
so I haven't tracked the issue down yet.

-- 
Markus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
       [not found]       ` <CA+55aFy+Jt0B32kj=Ldz9Krp4DXcXSP4r-o3f-2LPcVs4udGVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-06 14:31         ` Markus Trippelsdorf
  2014-04-06 14:31         ` [GIT PULL] cgroup changes for v3.15-rc1 Markus Trippelsdorf
  1 sibling, 0 replies; 36+ messages in thread
From: Markus Trippelsdorf @ 2014-04-06 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Linux Kernel Mailing List, Li Zefan, Linux Containers,
	cgroups, Andy Lutomirski

On 2014.04.04 at 18:11 -0700, Linus Torvalds wrote:
> On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Will report back soon when I've narrowed it down more.
> 
> Ok, that may end up being harder than it looked. Commit
> a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
> so apparently there are some really broken points in that whole
> development chain.

FWIW I've seen similar boot failures, that started during this merge
window, on my machine. I first thought they might be linker related so I
opened a bug on: https://sourceware.org/bugzilla/show_bug.cgi?id=16788
But if I remove e.g. a single printk the bug goes away...
Unfortunately the same thing happens if I enable more debugging options,
so I haven't tracked the issue down yet.

-- 
Markus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] cgroup changes for v3.15-rc1
@ 2014-04-06 14:31         ` Markus Trippelsdorf
  0 siblings, 0 replies; 36+ messages in thread
From: Markus Trippelsdorf @ 2014-04-06 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Linux Kernel Mailing List, Li Zefan, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andy Lutomirski

On 2014.04.04 at 18:11 -0700, Linus Torvalds wrote:
> On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> >
> > Will report back soon when I've narrowed it down more.
> 
> Ok, that may end up being harder than it looked. Commit
> a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
> so apparently there are some really broken points in that whole
> development chain.

FWIW I've seen similar boot failures, that started during this merge
window, on my machine. I first thought they might be linker related so I
opened a bug on: https://sourceware.org/bugzilla/show_bug.cgi?id=16788
But if I remove e.g. a single printk the bug goes away...
Unfortunately the same thing happens if I enable more debugging options,
so I haven't tracked the issue down yet.

-- 
Markus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH cgroup/for-3.15-fixes] cgroup: newly created dirs and files should be owned by the creator
  2014-04-05  1:34           ` Linus Torvalds
@ 2014-04-07 20:59               ` Tejun Heo
  -1 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-07 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Linux Kernel Mailing List

Applied to cgroup/for-3.15.  Will soon send pull request for this one
and the cgroup_root refcnt fix from Li.

Thanks.

------ 8< ------
From 49957f8e2a43035a97d05bddefa394492a969c0d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Mon, 7 Apr 2014 16:44:47 -0400

While converting cgroup to kernfs, 2bd59d48ebfb ("cgroup: convert to
kernfs") accidentally dropped the logic which makes newly created
cgroup dirs and files owned by the current uid / gid.  This broke
cases where cgroup subtree management is delegated to !root as the sub
manager wouldn't be able to create more than single level of hierarchy
or put tasks into child cgroups it created.

Among other things, this breaks user session management in systemd and
one of the symptoms was 90s hang during shutdown.  User session
systemd running as the user creates a sub-service to initiate shutdown
and tries to put kill(1) into it but fails because cgroup.procs is
owned by root.  This leads to 90s hang during shutdown.

Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those
of the caller and use it from file and dir creation paths.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 kernel/cgroup.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0dfc732..9fcdaa7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2346,11 +2346,26 @@ static int cgroup_rename(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	return ret;
 }
 
+/* set uid and gid of cgroup dirs and files to that of the creator */
+static int cgroup_kn_set_ugid(struct kernfs_node *kn)
+{
+	struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+			       .ia_uid = current_fsuid(),
+			       .ia_gid = current_fsgid(), };
+
+	if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+	    gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+		return 0;
+
+	return kernfs_setattr(kn, &iattr);
+}
+
 static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 {
 	char name[CGROUP_FILE_NAME_MAX];
 	struct kernfs_node *kn;
 	struct lock_class_key *key = NULL;
+	int ret;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	key = &cft->lockdep_key;
@@ -2358,7 +2373,13 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
 				  cgroup_file_mode(cft), 0, cft->kf_ops, cft,
 				  NULL, false, key);
-	return PTR_ERR_OR_ZERO(kn);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+
+	ret = cgroup_kn_set_ugid(kn);
+	if (ret)
+		kernfs_remove(kn);
+	return ret;
 }
 
 /**
@@ -3753,6 +3774,10 @@ static long cgroup_create(struct cgroup *parent, const char *name,
 	 */
 	idr_replace(&root->cgroup_idr, cgrp, cgrp->id);
 
+	err = cgroup_kn_set_ugid(kn);
+	if (err)
+		goto err_destroy;
+
 	err = cgroup_addrm_files(cgrp, cgroup_base_files, true);
 	if (err)
 		goto err_destroy;
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH cgroup/for-3.15-fixes] cgroup: newly created dirs and files should be owned by the creator
@ 2014-04-07 20:59               ` Tejun Heo
  0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2014-04-07 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Li Zefan, Linux Containers, cgroups

Applied to cgroup/for-3.15.  Will soon send pull request for this one
and the cgroup_root refcnt fix from Li.

Thanks.

------ 8< ------
>From 49957f8e2a43035a97d05bddefa394492a969c0d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Apr 2014 16:44:47 -0400

While converting cgroup to kernfs, 2bd59d48ebfb ("cgroup: convert to
kernfs") accidentally dropped the logic which makes newly created
cgroup dirs and files owned by the current uid / gid.  This broke
cases where cgroup subtree management is delegated to !root as the sub
manager wouldn't be able to create more than single level of hierarchy
or put tasks into child cgroups it created.

Among other things, this breaks user session management in systemd and
one of the symptoms was 90s hang during shutdown.  User session
systemd running as the user creates a sub-service to initiate shutdown
and tries to put kill(1) into it but fails because cgroup.procs is
owned by root.  This leads to 90s hang during shutdown.

Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those
of the caller and use it from file and dir creation paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 kernel/cgroup.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0dfc732..9fcdaa7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2346,11 +2346,26 @@ static int cgroup_rename(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	return ret;
 }
 
+/* set uid and gid of cgroup dirs and files to that of the creator */
+static int cgroup_kn_set_ugid(struct kernfs_node *kn)
+{
+	struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+			       .ia_uid = current_fsuid(),
+			       .ia_gid = current_fsgid(), };
+
+	if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+	    gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+		return 0;
+
+	return kernfs_setattr(kn, &iattr);
+}
+
 static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 {
 	char name[CGROUP_FILE_NAME_MAX];
 	struct kernfs_node *kn;
 	struct lock_class_key *key = NULL;
+	int ret;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	key = &cft->lockdep_key;
@@ -2358,7 +2373,13 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
 				  cgroup_file_mode(cft), 0, cft->kf_ops, cft,
 				  NULL, false, key);
-	return PTR_ERR_OR_ZERO(kn);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+
+	ret = cgroup_kn_set_ugid(kn);
+	if (ret)
+		kernfs_remove(kn);
+	return ret;
 }
 
 /**
@@ -3753,6 +3774,10 @@ static long cgroup_create(struct cgroup *parent, const char *name,
 	 */
 	idr_replace(&root->cgroup_idr, cgrp, cgrp->id);
 
+	err = cgroup_kn_set_ugid(kn);
+	if (err)
+		goto err_destroy;
+
 	err = cgroup_addrm_files(cgrp, cgroup_base_files, true);
 	if (err)
 		goto err_destroy;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-04-07 20:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-03 16:49 [GIT PULL] cgroup changes for v3.15-rc1 Tejun Heo
2014-04-03 16:49 ` Tejun Heo
2014-04-03 18:11 ` Linus Torvalds
2014-04-03 18:11   ` Linus Torvalds
     [not found]   ` <CA+55aFy1dm_CoTv+tihQ+m54G3gR9siOngurHsuKf9W3k1sKKw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-03 18:24     ` Linus Torvalds
2014-04-03 18:24       ` Linus Torvalds
     [not found]       ` <CAOS58YMckmoCocguf9BC_Wxbn3D2Rx3MArhgozO9qCj4g=5aDw@mail.gmail.com>
     [not found]         ` <CAOS58YMckmoCocguf9BC_Wxbn3D2Rx3MArhgozO9qCj4g=5aDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-03 19:01           ` Linus Torvalds
2014-04-03 19:01             ` Linus Torvalds
     [not found]             ` <CA+55aFw2+VKojzoF2aDaruOMKHikTO8J0HeK9DoDLZr+HqTWbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-03 19:43               ` Tejun Heo
2014-04-03 19:43             ` Tejun Heo
2014-04-03 19:43               ` Tejun Heo
2014-04-03 20:02               ` Linus Torvalds
2014-04-03 20:02                 ` Linus Torvalds
     [not found]                 ` <CA+55aFw+jFdmmi6eZ6+RpLM9heyowfXsZxb6Y91GNkPU677PmA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-04 12:03                   ` Tejun Heo
2014-04-04 12:03                     ` Tejun Heo
2014-04-03 23:18               ` Eric W. Biederman
2014-04-03 23:18                 ` Eric W. Biederman
     [not found]               ` <20140403194335.GC2472-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-04-03 20:02                 ` Linus Torvalds
2014-04-03 23:18                 ` Eric W. Biederman
2014-04-04  9:14                 ` Li Zefan
2014-04-04  9:14                   ` Li Zefan
     [not found]                   ` <533E7801.8090604-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-04-04 12:22                     ` Tejun Heo
2014-04-04 12:22                       ` Tejun Heo
     [not found] ` <20140403164911.GE24119-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-04-03 18:11   ` Linus Torvalds
2014-04-05  1:06   ` Linus Torvalds
2014-04-05  1:06     ` Linus Torvalds
2014-04-05  1:11     ` Linus Torvalds
2014-04-05  1:11       ` Linus Torvalds
     [not found]       ` <CA+55aFy+Jt0B32kj=Ldz9Krp4DXcXSP4r-o3f-2LPcVs4udGVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-05  1:34         ` Linus Torvalds
2014-04-05  1:34           ` Linus Torvalds
     [not found]           ` <CA+55aFwM0aG1PHY7xOQbsFG+Ot4mL4=7yRnT7dssUaOyaLQ3GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-07 20:59             ` [PATCH cgroup/for-3.15-fixes] cgroup: newly created dirs and files should be owned by the creator Tejun Heo
2014-04-07 20:59               ` Tejun Heo
2014-04-06 14:31         ` [GIT PULL] cgroup changes for v3.15-rc1 Markus Trippelsdorf
2014-04-06 14:31       ` Markus Trippelsdorf
2014-04-06 14:31         ` Markus Trippelsdorf
     [not found]     ` <CA+55aFyGz2q4iDEt56Bw+x_ri_-DQYLAKPJvk3Uxj9z-y+0uPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-05  1:11       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.