All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/6] tracking fd counts per cgroup
@ 2023-11-08  0:26 Tycho Andersen
  2023-11-08  0:26 ` [RFC 1/6] fs: count_open_files() -> count_possible_open_files() Tycho Andersen
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen

Hi all,

At Netflix, we have a "canary" framework that will run test versions of
an application and automatically detect anomalies in various metrics. We
also have two "fleets", one full of virtual machines, and one that is a
multi-tenant container environment.

On our full-VM fleet, one of the metrics we analyze is the number of open file
descriptors, from /proc/sys/fs/file-nr. However, no equivalent exists for the
multi-tenant container fleet, which has lead to several production issues.

This idea is not new of course [1], but hopefully the existence of the new misc
cgroup will make it more tenable.

I'm not really tied to any of the semantics in this series (e.g. threads could
be double counted even with a shared table), and am open to implementing this
in other ways if it makes more sense.

Thoughts welcome,

Tycho

[1]: https://lore.kernel.org/all/1404311407-4851-1-git-send-email-merimus@google.com/

Tycho Andersen (6):
  fs: count_open_files() -> count_possible_open_files()
  fs: introduce count_open_files()
  misc: introduce misc_cg_charge()
  misc cgroup: introduce an fd counter
  selftests/cgroup: add a flags arg to clone_into_cgroup()
  selftests/cgroup: add a test for misc cgroup

 fs/file.c                                    |  82 +++-
 include/linux/fdtable.h                      |   6 +
 include/linux/misc_cgroup.h                  |   2 +
 kernel/cgroup/misc.c                         | 232 ++++++++++-
 tools/testing/selftests/cgroup/.gitignore    |   1 +
 tools/testing/selftests/cgroup/Makefile      |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c |   8 +-
 tools/testing/selftests/cgroup/cgroup_util.h |   2 +-
 tools/testing/selftests/cgroup/test_core.c   |   4 +-
 tools/testing/selftests/cgroup/test_misc.c   | 385 +++++++++++++++++++
 10 files changed, 712 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/cgroup/test_misc.c


base-commit: 13d88ac54ddd1011b6e94443958e798aa06eb835
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/6] fs: count_open_files() -> count_possible_open_files()
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-08  0:26 ` [RFC 2/6] fs: introduce count_open_files() Tycho Andersen
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

This doesn't really count the number of open files, but the max possible
number of open files in an fd table.

In the next patch, we're going to introduce a helper to compute this number
exactly, so rename this one. (It is only called in one place, maybe we
should just inline it?)

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 fs/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..b1633c00bd3c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -271,7 +271,7 @@ static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
-static unsigned int count_open_files(struct fdtable *fdt)
+static unsigned int count_possible_open_files(struct fdtable *fdt)
 {
 	unsigned int size = fdt->max_fds;
 	unsigned int i;
@@ -302,7 +302,7 @@ static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
 {
 	unsigned int count;
 
-	count = count_open_files(fdt);
+	count = count_possible_open_files(fdt);
 	if (max_fds < NR_OPEN_DEFAULT)
 		max_fds = NR_OPEN_DEFAULT;
 	return ALIGN(min(count, max_fds), BITS_PER_LONG);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/6] fs: introduce count_open_files()
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
  2023-11-08  0:26 ` [RFC 1/6] fs: count_open_files() -> count_possible_open_files() Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-08  0:26 ` [RFC 3/6] misc: introduce misc_cg_charge() Tycho Andersen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

In future patches, we'll need a count of the number of open file
descriptors for misc NOFILE cgroup migration, so introduce a helper to do
this.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 fs/file.c               | 10 ++++++++++
 include/linux/fdtable.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index b1633c00bd3c..539bead2364e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -285,6 +285,16 @@ static unsigned int count_possible_open_files(struct fdtable *fdt)
 	return i;
 }
 
+u64 count_open_files(struct fdtable *fdt)
+{
+	int i;
+	u64 retval = 0;
+
+	for (i = 0; i < DIV_ROUND_UP(fdt->max_fds, BITS_PER_LONG); i++)
+		retval += hweight64((__u64)fdt->open_fds[i]);
+	return retval;
+}
+
 /*
  * Note that a sane fdtable size always has to be a multiple of
  * BITS_PER_LONG, since we have bitmaps that are sized by this.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..d74234c5d4e9 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -77,6 +77,8 @@ struct dentry;
 #define files_fdtable(files) \
 	rcu_dereference_check_fdtable((files), (files)->fdt)
 
+u64 count_open_files(struct fdtable *fdt);
+
 /*
  * The caller must ensure that fd table isn't shared or hold rcu or file lock
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/6] misc: introduce misc_cg_charge()
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
  2023-11-08  0:26 ` [RFC 1/6] fs: count_open_files() -> count_possible_open_files() Tycho Andersen
  2023-11-08  0:26 ` [RFC 2/6] fs: introduce count_open_files() Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-09 18:04   ` kernel test robot
  2023-11-08  0:26 ` [RFC 4/6] misc cgroup: introduce an fd counter Tycho Andersen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

Similar to cases in e.g. the pids cgroup with pids_charge(), if a migration
fails we will need to force-unwind it, which may put misc cgroups over
their limits. We need to charge them anyway, which is what this helper is
for.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 include/linux/misc_cgroup.h |  1 +
 kernel/cgroup/misc.c        | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..6ddffeeb6f97 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -57,6 +57,7 @@ struct misc_cg {
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
+void misc_cg_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..bbce097270cf 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -121,6 +121,38 @@ static void misc_cg_cancel_charge(enum misc_res_type type, struct misc_cg *cg,
 		  misc_res_name[type]);
 }
 
+/**
+ * misc_cg_charge() - Charge the cgroup, ignoring limits/capacity.
+ * @type: Misc res type to charge.
+ * @cg: Misc cgroup which will be charged.
+ * @amount: Amount to charge.
+ *
+ * Charge @amount to the misc cgroup. Caller must use the same cgroup during
+ * the uncharge call.
+ *
+ * Context: Any context.
+ */
+void misc_cg_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
+{
+	struct misc_res *res;
+	struct misc_cg *i;
+
+	if (!(valid_type(type) && cg && READ_ONCE(misc_res_capacity[type]))) {
+		WARN_ON_ONCE(!valid_type(type));
+		return;
+	}
+
+	if (!amount)
+		return;
+
+	for (i = cg; i; i = parent_misc(i)) {
+		res = &i->res[type];
+
+		atomic_long_add(amount, &res->usage);
+	}
+}
+EXPORT_SYMBOL_GPL(misc_cg_charge);
+
 /**
  * misc_cg_try_charge() - Try charging the misc cgroup.
  * @type: Misc res type to charge.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 4/6] misc cgroup: introduce an fd counter
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
                   ` (2 preceding siblings ...)
  2023-11-08  0:26 ` [RFC 3/6] misc: introduce misc_cg_charge() Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-08 16:57   ` Al Viro
  2023-11-09  9:53   ` Christian Brauner
  2023-11-08  0:26 ` [RFC 5/6] selftests/cgroup: add a flags arg to clone_into_cgroup() Tycho Andersen
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

This idea is not new [1], but I'm hoping using the "new" misc controller to
avoid having to introduce a completely new controller will make it more
tenable.

This patch introduces cgroup migration primitives to the misc cgroup, which
didn't have them before. I didn't know what to do in the face of kvm
migrations, so I left those as no-ops for now. I also did not do any
abstraction of the migration primitives. We are interested in adding other
rlimit-y things to the misc cgroup if this approach looks reasonable, for
the same reasons described above. I tried writing both dynamic-dispatch and
static-dispatch versions, but they introduced a lot of noise that seemed
unnecessary for this first draft. I'm happy to take suggestions here.

One oddity here is when the migration happens, which is in the
can_attach/can_fork() family vs. doing it in the actual attach/fork()
functions. This saves us from having to track some delta, or walk the fd
table twice, at the expense of a more costly revert.

As a result, the cancel_fork/cancel_attach functions do nontrivial things,
which are hard to test form userspace as far as I can tell. I did some
manual hacky fault injection, but if there is an official way to test these
I'm happy to add that.

Finally, this exposes misc.current at the root level. This was useful for
testing, but perhaps is not something we need/want in the final version.

[1]: https://lore.kernel.org/all/1404311407-4851-1-git-send-email-merimus@google.com/

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 fs/file.c                   |  68 +++++++++++-
 include/linux/fdtable.h     |   4 +
 include/linux/misc_cgroup.h |   1 +
 kernel/cgroup/misc.c        | 200 +++++++++++++++++++++++++++++++++++-
 4 files changed, 270 insertions(+), 3 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 539bead2364e..b27ec5c9d77e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -20,6 +20,7 @@
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/close_range.h>
+#include <linux/misc_cgroup.h>
 #include <net/sock.h>
 
 #include "internal.h"
@@ -318,6 +319,45 @@ static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
 	return ALIGN(min(count, max_fds), BITS_PER_LONG);
 }
 
+#ifdef CONFIG_CGROUP_MISC
+static int charge_current_fds(struct files_struct *files, unsigned int count)
+{
+	return misc_cg_try_charge(MISC_CG_RES_NOFILE, files->mcg, count);
+}
+
+static void uncharge_current_fds(struct files_struct *files, unsigned int count)
+{
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, files->mcg, count);
+}
+
+static void files_get_misc_cg(struct files_struct *newf)
+{
+	newf->mcg = get_current_misc_cg();
+}
+
+static void files_put_misc_cg(struct files_struct *newf)
+{
+	put_misc_cg(newf->mcg);
+}
+#else
+static int charge_current_fds(struct files_struct *files, unsigned int count)
+{
+	return 0;
+}
+
+static void uncharge_current_fds(struct files_struct *files, unsigned int count)
+{
+}
+
+static void files_get_misc_cg(struct files_struct *newf)
+{
+}
+
+static void files_put_misc_cg(struct files_struct *newf)
+{
+}
+#endif
+
 /*
  * Allocate a new files structure and copy contents from the
  * passed in files structure.
@@ -341,6 +381,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 	newf->resize_in_progress = false;
 	init_waitqueue_head(&newf->resize_wait);
 	newf->next_fd = 0;
+	files_get_misc_cg(newf);
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
@@ -350,6 +391,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 	spin_lock(&oldf->file_lock);
 	old_fdt = files_fdtable(oldf);
+
 	open_files = sane_fdtable_size(old_fdt, max_fds);
 
 	/*
@@ -411,9 +453,22 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
-	return newf;
+	if (!charge_current_fds(newf, count_open_files(new_fdt)))
+		return newf;
+
+	new_fds = new_fdt->fd;
+	for (i = open_files; i != 0; i--) {
+		struct file *f = *new_fds++;
+
+		if (f)
+			fput(f);
+	}
+	if (new_fdt != &newf->fdtab)
+		__free_fdtable(new_fdt);
+	*errorp = -EMFILE;
 
 out_release:
+	files_put_misc_cg(newf);
 	kmem_cache_free(files_cachep, newf);
 out:
 	return NULL;
@@ -439,6 +494,7 @@ static struct fdtable *close_files(struct files_struct * files)
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
 				if (file) {
+					uncharge_current_fds(files, 1);
 					filp_close(file, files);
 					cond_resched();
 				}
@@ -448,6 +504,8 @@ static struct fdtable *close_files(struct files_struct * files)
 		}
 	}
 
+	files_put_misc_cg(files);
+
 	return fdt;
 }
 
@@ -542,6 +600,10 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (error)
 		goto repeat;
 
+	error = -EMFILE;
+	if (charge_current_fds(files, 1) < 0)
+		goto out;
+
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
@@ -578,6 +640,8 @@ EXPORT_SYMBOL(get_unused_fd_flags);
 static void __put_unused_fd(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = files_fdtable(files);
+	if (test_bit(fd, fdt->open_fds))
+		uncharge_current_fds(files, 1);
 	__clear_open_fd(fd, fdt);
 	if (fd < files->next_fd)
 		files->next_fd = fd;
@@ -1248,7 +1312,7 @@ __releases(&files->file_lock)
 	 */
 	fdt = files_fdtable(files);
 	tofree = fdt->fd[fd];
-	if (!tofree && fd_is_open(fd, fdt))
+	if (!tofree && (fd_is_open(fd, fdt) || charge_current_fds(files, 1) < 0))
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index d74234c5d4e9..b8783fa0f63f 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/misc_cgroup.h>
 
 #include <linux/atomic.h>
 
@@ -65,6 +66,9 @@ struct files_struct {
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
+#ifdef CONFIG_CGROUP_MISC
+	struct misc_cg *mcg;
+#endif
 };
 
 struct file_operations;
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 6ddffeeb6f97..a675be53c875 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -18,6 +18,7 @@ enum misc_res_type {
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
 #endif
+	MISC_CG_RES_NOFILE,
 	MISC_CG_RES_TYPES
 };
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index bbce097270cf..a28f97307b3e 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -12,6 +12,8 @@
 #include <linux/atomic.h>
 #include <linux/slab.h>
 #include <linux/misc_cgroup.h>
+#include <linux/mm.h>
+#include <linux/fdtable.h>
 
 #define MAX_STR "max"
 #define MAX_NUM U64_MAX
@@ -24,6 +26,7 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+	"nofile",
 };
 
 /* Root misc cgroup */
@@ -37,7 +40,9 @@ static struct misc_cg root_cg;
  * more than the actual capacity. We are using Limits resource distribution
  * model of cgroup for miscellaneous controller.
  */
-static u64 misc_res_capacity[MISC_CG_RES_TYPES];
+static u64 misc_res_capacity[MISC_CG_RES_TYPES] = {
+	[MISC_CG_RES_NOFILE] = MAX_NUM,
+};
 
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
@@ -445,10 +450,203 @@ static void misc_cg_free(struct cgroup_subsys_state *css)
 	kfree(css_misc(css));
 }
 
+static void revert_attach_until(struct cgroup_taskset *tset, struct task_struct *stop)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct misc_cg *misc, *old_misc;
+		struct cgroup_subsys_state *old_css;
+		struct files_struct *files;
+		struct fdtable *fdt;
+		unsigned long nofile;
+
+		if (task == stop)
+			break;
+
+		misc = css_misc(dst_css);
+		old_css = task_css(task, misc_cgrp_id);
+		old_misc = css_misc(old_css);
+
+		if (misc == old_misc)
+			continue;
+
+		task_lock(task);
+		files = task->files;
+		spin_lock(&files->file_lock);
+		fdt = files_fdtable(files);
+
+		if (old_misc == files->mcg)
+			goto done;
+
+		WARN_ON_ONCE(misc != files->mcg);
+
+		nofile = count_open_files(fdt);
+		misc_cg_charge(MISC_CG_RES_NOFILE, old_misc, nofile);
+		misc_cg_uncharge(MISC_CG_RES_NOFILE, misc, nofile);
+
+		put_misc_cg(files->mcg);
+		css_get(old_css);
+		files->mcg = old_misc;
+
+done:
+		spin_unlock(&files->file_lock);
+		task_unlock(task);
+	}
+}
+
+static int misc_cg_can_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct misc_cg *misc, *old_misc;
+		struct cgroup_subsys_state *old_css;
+		unsigned long nofile;
+		struct files_struct *files;
+		struct fdtable *fdt;
+		int ret;
+
+		misc = css_misc(dst_css);
+		old_css = task_css(task, misc_cgrp_id);
+		old_misc = css_misc(old_css);
+
+		if (misc == old_misc)
+			continue;
+
+		task_lock(task);
+		files = task->files;
+		spin_lock(&files->file_lock);
+		fdt = files_fdtable(files);
+
+		/*
+		 * If this task->files was already in the right place (either
+		 * because of dup_fd() or because some other thread had already
+		 * migrated it), we don't need to do anything.
+		 */
+		if (misc == files->mcg)
+			goto done;
+
+		WARN_ON_ONCE(old_misc != files->mcg);
+
+		nofile = count_open_files(fdt);
+		ret = misc_cg_try_charge(MISC_CG_RES_NOFILE, misc, nofile);
+		if (ret < 0) {
+			spin_unlock(&files->file_lock);
+			task_unlock(task);
+			revert_attach_until(tset, task);
+			return ret;
+		}
+		misc_cg_uncharge(MISC_CG_RES_NOFILE, old_misc, nofile);
+
+		/*
+		 * let's ref the new table, install it, and
+		 * deref the old one.
+		 */
+		put_misc_cg(files->mcg);
+		css_get(dst_css);
+		files->mcg = misc;
+
+done:
+		spin_unlock(&files->file_lock);
+		task_unlock(task);
+
+	}
+
+	return 0;
+}
+
+static void misc_cg_cancel_attach(struct cgroup_taskset *tset)
+{
+	revert_attach_until(tset, NULL);
+}
+
+static int misc_cg_can_fork(struct task_struct *task, struct css_set *cset)
+{
+	struct misc_cg *dst_misc, *init_misc;
+	struct files_struct *files;
+	struct fdtable *fdt;
+	unsigned long nofile;
+	struct cgroup_subsys_state *dst_css, *cur_css;
+	int ret;
+
+	init_misc = css_misc(init_css_set.subsys[misc_cgrp_id]);
+	cur_css = task_get_css(task, misc_cgrp_id);
+
+	WARN_ON_ONCE(init_misc != css_misc(cur_css));
+
+	dst_css = cset->subsys[misc_cgrp_id];
+	dst_misc = css_misc(dst_css);
+
+	/*
+	 * When forking, tasks are initially put into the init_css_set (see
+	 * cgroup_fork()). Then, we do a dup_fd() and charge init_css_set for
+	 * the new task's fds. We need to migrate from the init_css_set to the
+	 * target one so we can charge the right place.
+	 */
+	task_lock(task);
+	files = task->files;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	ret = 0;
+	if (files->mcg == dst_misc)
+		goto out;
+
+	nofile = count_open_files(fdt);
+	ret = misc_cg_try_charge(MISC_CG_RES_NOFILE, dst_misc, nofile);
+	if (ret < 0)
+		goto out;
+
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, init_misc, nofile);
+
+	put_misc_cg(files->mcg);
+	css_get(dst_css);
+	files->mcg = dst_misc;
+	ret = 0;
+
+out:
+	spin_unlock(&files->file_lock);
+	task_unlock(task);
+
+	return ret;
+}
+
+static void misc_cg_cancel_fork(struct task_struct *task, struct css_set *cset)
+{
+	struct misc_cg *dst_misc;
+	struct files_struct *files;
+	struct fdtable *fdt;
+	unsigned long nofile;
+	struct cgroup_subsys_state *dst_css;
+
+	dst_css = cset->subsys[misc_cgrp_id];
+	dst_misc = css_misc(dst_css);
+
+	task_lock(task);
+	files = task->files;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	/*
+	 * we don't need to re-charge anyone, since this fork is going away.
+	 */
+	nofile = count_open_files(fdt);
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, dst_misc, nofile);
+	spin_unlock(&files->file_lock);
+	task_unlock(task);
+}
+
 /* Cgroup controller callbacks */
 struct cgroup_subsys misc_cgrp_subsys = {
 	.css_alloc = misc_cg_alloc,
 	.css_free = misc_cg_free,
 	.legacy_cftypes = misc_cg_files,
 	.dfl_cftypes = misc_cg_files,
+	.can_attach = misc_cg_can_attach,
+	.cancel_attach = misc_cg_cancel_attach,
+	.can_fork = misc_cg_can_fork,
+	.cancel_fork = misc_cg_cancel_fork,
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 5/6] selftests/cgroup: add a flags arg to clone_into_cgroup()
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
                   ` (3 preceding siblings ...)
  2023-11-08  0:26 ` [RFC 4/6] misc cgroup: introduce an fd counter Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-08  0:26 ` [RFC 6/6] selftests/cgroup: add a test for misc cgroup Tycho Andersen
  2023-11-09 18:44 ` [RFC 0/6] tracking fd counts per cgroup Tejun Heo
  6 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

We'll use this to test some extra scenarios in the nofile misc cg test
suite.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 tools/testing/selftests/cgroup/cgroup_util.c | 8 ++++----
 tools/testing/selftests/cgroup/cgroup_util.h | 2 +-
 tools/testing/selftests/cgroup/test_core.c   | 4 ++--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 0340d4ca8f51..c65b9f41fdd2 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -345,13 +345,13 @@ int cg_run(const char *cgroup,
 	}
 }
 
-pid_t clone_into_cgroup(int cgroup_fd)
+pid_t clone_into_cgroup(int cgroup_fd, unsigned long extra_flags)
 {
 #ifdef CLONE_ARGS_SIZE_VER2
 	pid_t pid;
 
 	struct __clone_args args = {
-		.flags = CLONE_INTO_CGROUP,
+		.flags = CLONE_INTO_CGROUP | extra_flags,
 		.exit_signal = SIGCHLD,
 		.cgroup = cgroup_fd,
 	};
@@ -429,7 +429,7 @@ static int clone_into_cgroup_run_nowait(const char *cgroup,
 	if (cgroup_fd < 0)
 		return -1;
 
-	pid = clone_into_cgroup(cgroup_fd);
+	pid = clone_into_cgroup(cgroup_fd, 0);
 	close_prot_errno(cgroup_fd);
 	if (pid == 0)
 		exit(fn(cgroup, arg));
@@ -588,7 +588,7 @@ int clone_into_cgroup_run_wait(const char *cgroup)
 	if (cgroup_fd < 0)
 		return -1;
 
-	pid = clone_into_cgroup(cgroup_fd);
+	pid = clone_into_cgroup(cgroup_fd, 0);
 	close_prot_errno(cgroup_fd);
 	if (pid < 0)
 		return -1;
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 1df7f202214a..e2190202e8c3 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -57,7 +57,7 @@ extern int cg_killall(const char *cgroup);
 int proc_mount_contains(const char *option);
 extern ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t size);
 extern int proc_read_strstr(int pid, bool thread, const char *item, const char *needle);
-extern pid_t clone_into_cgroup(int cgroup_fd);
+extern pid_t clone_into_cgroup(int cgroup_fd, unsigned long extra_flags);
 extern int clone_reap(pid_t pid, int options);
 extern int clone_into_cgroup_run_wait(const char *cgroup);
 extern int dirfd_open_opath(const char *dir);
diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index 80aa6b2373b9..1b2ec7b6825c 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -201,7 +201,7 @@ static int test_cgcore_populated(const char *root)
 	if (cgroup_fd < 0)
 		goto cleanup;
 
-	pid = clone_into_cgroup(cgroup_fd);
+	pid = clone_into_cgroup(cgroup_fd, 0);
 	if (pid < 0) {
 		if (errno == ENOSYS)
 			goto cleanup_pass;
@@ -233,7 +233,7 @@ static int test_cgcore_populated(const char *root)
 		cg_test_d = NULL;
 	}
 
-	pid = clone_into_cgroup(cgroup_fd);
+	pid = clone_into_cgroup(cgroup_fd, 0);
 	if (pid < 0)
 		goto cleanup_pass;
 	if (pid == 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 6/6] selftests/cgroup: add a test for misc cgroup
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
                   ` (4 preceding siblings ...)
  2023-11-08  0:26 ` [RFC 5/6] selftests/cgroup: add a flags arg to clone_into_cgroup() Tycho Andersen
@ 2023-11-08  0:26 ` Tycho Andersen
  2023-11-09 18:44 ` [RFC 0/6] tracking fd counts per cgroup Tejun Heo
  6 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08  0:26 UTC (permalink / raw)
  To: cgroups, linux-fsdevel
  Cc: Christian Brauner, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen, Tycho Andersen

From: Tycho Andersen <tandersen@netflix.com>

There's four tests here: a basic smoke test, and tests for clone/fork.
Ideally there'd be a test for the cancel_attach() path too.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_misc.c | 385 +++++++++++++++++++++
 3 files changed, 388 insertions(+)

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 2732e0b29271..7e57580ed363 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -9,3 +9,4 @@ test_cpuset
 test_zswap
 test_hugetlb_memcg
 wait_inotify
+test_misc
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 00b441928909..2e5b72947134 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -15,6 +15,7 @@ TEST_GEN_PROGS += test_cpu
 TEST_GEN_PROGS += test_cpuset
 TEST_GEN_PROGS += test_zswap
 TEST_GEN_PROGS += test_hugetlb_memcg
+TEST_GEN_PROGS += test_misc
 
 LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
 
@@ -29,3 +30,4 @@ $(OUTPUT)/test_cpu: cgroup_util.c
 $(OUTPUT)/test_cpuset: cgroup_util.c
 $(OUTPUT)/test_zswap: cgroup_util.c
 $(OUTPUT)/test_hugetlb_memcg: cgroup_util.c
+$(OUTPUT)/test_misc: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/test_misc.c b/tools/testing/selftests/cgroup/test_misc.c
new file mode 100644
index 000000000000..8f15d899ed4a
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_misc.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/socket.h>
+#include <limits.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sched.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+#define N 100
+
+static int open_N_fds(const char *cgroup, void *arg)
+{
+	int i;
+	long nofile;
+
+	for (i = 0; i < N; i++) {
+		int fd;
+
+		fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+		if (fd < 0) {
+			ksft_print_msg("%d socket: %s\n", i, strerror(errno));
+			return 1;
+		}
+	}
+
+	/*
+	 * N+3 std fds + 1 fd for "misc.current"
+	 */
+	nofile = cg_read_key_long(cgroup, "misc.current", "nofile ");
+	if (nofile != N+3+1) {
+		ksft_print_msg("bad open files count: %ld\n", nofile);
+		return 1;
+	}
+
+	return 0;
+}
+
+static int test_misc_cg_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *foo;
+
+	foo = cg_name(root, "foo");
+	if (!foo)
+		goto cleanup;
+
+	if (cg_create(foo)) {
+		perror("cg_create");
+		ksft_print_msg("cg_create failed\n");
+		goto cleanup;
+	}
+
+	if (cg_write(root, "cgroup.subtree_control", "+misc")) {
+		ksft_print_msg("cg_write failed\n");
+		goto cleanup;
+	}
+
+	ret = cg_run(foo, open_N_fds, NULL);
+	if (ret < 0) {
+		ksft_print_msg("cg_run failed\n");
+		goto cleanup;
+	}
+
+	if (ret == 0)
+		ret = KSFT_PASS;
+
+cleanup:
+	cg_destroy(foo);
+	free(foo);
+	return ret;
+}
+
+static int open_N_fds_and_sleep(const char *root, void *arg)
+{
+	int i, *sock = arg;
+
+	for (i = 0; i < N; i++) {
+		int fd;
+
+		fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+		if (fd < 0) {
+			ksft_print_msg("%d socket: %s\n", i, strerror(errno));
+			return 1;
+		}
+	}
+
+	if (write(*sock, "c", 1) != 1) {
+		ksft_print_msg("%d write: %s\n", i, strerror(errno));
+		return 1;
+	}
+
+	while (1)
+		sleep(1000);
+}
+
+#define COPIES 5
+static int test_misc_cg_threads(const char *root)
+{
+	int ret = KSFT_FAIL, i;
+	char *foo;
+	int pids[COPIES] = {};
+	long nofile;
+
+	foo = cg_name(root, "foo");
+	if (!foo)
+		goto cleanup;
+
+	if (cg_create(foo)) {
+		ksft_print_msg("cg_create failed\n");
+		goto cleanup;
+	}
+
+	if (cg_write(root, "cgroup.subtree_control", "+misc")) {
+		ksft_print_msg("cg_write failed\n");
+		goto cleanup;
+	}
+
+	for (i = 0; i < COPIES; i++) {
+		char c;
+		int sk_pair[2];
+
+		if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+			ksft_print_msg("socketpair failed %s\n", strerror(errno));
+			goto cleanup;
+		}
+
+		pids[i] = cg_run_nowait(foo, open_N_fds_and_sleep, sk_pair+1);
+		if (pids[i] < 0) {
+			perror("cg_run_nowait");
+			ksft_print_msg("cg_run failed\n");
+			goto cleanup;
+		}
+		close(sk_pair[1]);
+
+		if (read(sk_pair[0], &c, 1) != 1) {
+			ksft_print_msg("%d read: %s\n", i, strerror(errno));
+			goto cleanup;
+		}
+		close(sk_pair[0]);
+	}
+
+	/*
+	 * We expect COPIES * (N + 3 stdfs + 2 socketpair fds).
+	 */
+	nofile = cg_read_key_long(foo, "misc.current", "nofile ");
+	if (nofile != COPIES*(N+3+2)) {
+		ksft_print_msg("bad open files count: %ld != %d\n", nofile, COPIES*(N+3+1));
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+cleanup:
+	for (i = 0; i < COPIES; i++) {
+		if (pids[i] >= 0) {
+			kill(pids[i], SIGKILL);
+			waitpid(pids[i], NULL, 0);
+		}
+	}
+	cg_destroy(foo);
+	free(foo);
+	return ret;
+}
+
+static int test_shared_files_count(const char *root)
+{
+	char *foo, c;
+	int dfd, ret = KSFT_FAIL, sk_pair[2];
+	pid_t pid;
+	long nofile;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		ksft_print_msg("socketpair failed %s\n", strerror(errno));
+		return ret;
+	}
+
+	foo = cg_name(root, "foo");
+	if (!foo)
+		goto cleanup;
+
+	if (cg_write(root, "cgroup.subtree_control", "+misc")) {
+		ksft_print_msg("cg_write failed\n");
+		goto cleanup;
+	}
+
+	if (cg_create(foo)) {
+		ksft_print_msg("cg_create failed\n");
+		goto cleanup;
+	}
+
+	dfd = dirfd_open_opath(foo);
+	if (dfd < 0) {
+		perror("cgroup dir open");
+		goto cleanup;
+	}
+
+	pid = clone_into_cgroup(dfd, CLONE_FILES);
+	if (pid < 0) {
+		perror("clone");
+		goto cleanup;
+	}
+
+	if (pid == 0) {
+		close(sk_pair[0]);
+		exit(open_N_fds_and_sleep(foo, sk_pair+1));
+	}
+
+	errno = 0;
+	nofile = read(sk_pair[0], &c, 1);
+	if (nofile != 1) {
+		ksft_print_msg("read: %s\n", strerror(errno));
+		goto cleanup;
+	}
+	close(sk_pair[0]);
+
+	/*
+	 * We have two threads with a shared fd table, so the fds should be
+	 * counted only once.
+	 * We expect N + 3 stdfs + 2 socketpair fds.
+	 */
+	nofile = cg_read_key_long(foo, "misc.current", "nofile ");
+	if (nofile != (N+3+2)) {
+		ksft_print_msg("bad open files count: %ld != %d\n", nofile, N+3+1);
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+cleanup:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	close(dfd);
+	kill(pid, SIGKILL);
+	waitpid(pid, NULL, 0);
+	cg_destroy(foo);
+	free(foo);
+	return ret;
+}
+
+static int test_misc_cg_threads_shared_files(const char *root)
+{
+	pid_t pid;
+	int status;
+
+	/*
+	 * get a fresh process to share fd tables so we don't pollute the test
+	 * suite's fd table in the case of failure.
+	 */
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		return KSFT_FAIL;
+	}
+
+	if (pid == 0)
+		exit(test_shared_files_count(root));
+
+	if (waitpid(pid, &status, 0) != pid) {
+		ksft_print_msg("wait failed\n");
+		return KSFT_FAIL;
+	}
+
+	if (!WIFEXITED(status)) {
+		ksft_print_msg("died with %x\n", status);
+		return KSFT_FAIL;
+	}
+
+	return WEXITSTATUS(status);
+}
+
+#define EXTRA 5
+static int open_more_than_N_fds(const char *cgroup, void *arg)
+{
+	int emfiles = 0, i;
+
+	for (i = 0; i < N+EXTRA; i++) {
+		int fd;
+
+		fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+		if (fd < 0) {
+			if (errno != EMFILE) {
+				ksft_print_msg("%d socket: %s\n", i, strerror(errno));
+				return 1;
+			}
+
+			emfiles++;
+		}
+	}
+
+	/*
+	 * We have 3 existing stdfds open, plus the 100 that we tried to open,
+	 * plus the five extra.
+	 */
+	if (emfiles != EXTRA+3) {
+		ksft_print_msg("got %d EMFILEs\n", emfiles);
+		return 1;
+	}
+	return 0;
+}
+
+static int test_misc_cg_emfile_count(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *foo;
+	char nofile[128];
+	long nofile_events;
+
+	foo = cg_name(root, "foo");
+	if (!foo)
+		goto cleanup;
+
+	if (cg_create(foo)) {
+		ksft_print_msg("cg_create failed\n");
+		goto cleanup;
+	}
+
+	if (cg_write(root, "cgroup.subtree_control", "+misc")) {
+		ksft_print_msg("cg_write failed\n");
+		goto cleanup;
+	}
+
+	snprintf(nofile, sizeof(nofile), "nofile %d", N);
+	if (cg_write(foo, "misc.max", nofile)) {
+		ksft_print_msg("cg_write failed\n");
+		goto cleanup;
+	}
+
+	if (cg_run(foo, open_more_than_N_fds, NULL)) {
+		perror("cg_run");
+		ksft_print_msg("cg_run failed\n");
+		goto cleanup;
+	}
+
+	nofile_events = cg_read_key_long(foo, "misc.events", "nofile.max ");
+	if (nofile_events != EXTRA+3) {
+		ksft_print_msg("bad nofile events: %ld\n", nofile_events);
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+cleanup:
+	cg_destroy(foo);
+	free(foo);
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct misccg_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_misc_cg_basic),
+	T(test_misc_cg_threads),
+	T(test_misc_cg_threads_shared_files),
+	T(test_misc_cg_emfile_count),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 4/6] misc cgroup: introduce an fd counter
  2023-11-08  0:26 ` [RFC 4/6] misc cgroup: introduce an fd counter Tycho Andersen
@ 2023-11-08 16:57   ` Al Viro
  2023-11-08 21:01     ` Tycho Andersen
  2023-11-09  9:53   ` Christian Brauner
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2023-11-08 16:57 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: cgroups, linux-fsdevel, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Haitao Huang, Kamalesh Babulal, Tycho Andersen

On Tue, Nov 07, 2023 at 05:26:45PM -0700, Tycho Andersen wrote:

> +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> +		return newf;

Are you sure that on configs that are not cgroup-infested compiler
will figure out that count_open_files() would have no side effects
and doesn't need to be evaluated?

Incidentally, since you are adding your charge/uncharge stuff on each
allocation/freeing, why not simply maintain an accurate counter, cgroup or
no cgroup?  IDGI...  Make it an inlined helper right there in fs/file.c,
doing increment/decrement and, conditional upon config, calling
the cgroup side of things.  No need to look at fdt, etc. outside
of fs/file.c either - the counter can be picked right from the
files_struct...

>  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>  {
>  	struct fdtable *fdt = files_fdtable(files);
> +	if (test_bit(fd, fdt->open_fds))
> +		uncharge_current_fds(files, 1);

Umm...  Just where do we call it without the bit in ->open_fds set?
Any such caller would be a serious bug; suppose you are trying to
call __put_unused_fd(files, N) while N is not in open_fds.  Just before
your call another thread grabs a descriptor and picks N.  Resulting
state won't be pretty, especially if right *after* your call the
third thread also asks for a descriptor - and also gets N.

Sure, you have an exclusion on ->file_lock, but AFAICS all callers
are under it and in all callers except for put_unused_fd() we
have just observed a non-NULL file reference in ->fd[N]; that
would *definitely* be a hard constraint violation if it ever
happened with N not in ->open_fds at that moment.

So the only possibility would be a broken caller of put_unused_fd(),
and any such would be a serious bug.

Details, please - have you ever observed that?

BTW, what about the locking hierarchy?  In the current tree ->files_lock
nests inside of everything; what happens with your patches in place?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 4/6] misc cgroup: introduce an fd counter
  2023-11-08 16:57   ` Al Viro
@ 2023-11-08 21:01     ` Tycho Andersen
  0 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-08 21:01 UTC (permalink / raw)
  To: Al Viro
  Cc: cgroups, linux-fsdevel, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Haitao Huang, Kamalesh Babulal, Tycho Andersen

Hi Al,

Thanks for looking. Somehow I also missed CCing you, whoops,

On Wed, Nov 08, 2023 at 04:57:49PM +0000, Al Viro wrote:
> On Tue, Nov 07, 2023 at 05:26:45PM -0700, Tycho Andersen wrote:
> 
> > +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> > +		return newf;
> 
> Are you sure that on configs that are not cgroup-infested compiler
> will figure out that count_open_files() would have no side effects
> and doesn't need to be evaluated?
> 
> Incidentally, since you are adding your charge/uncharge stuff on each
> allocation/freeing, why not simply maintain an accurate counter, cgroup or
> no cgroup?  IDGI...  Make it an inlined helper right there in fs/file.c,
> doing increment/decrement and, conditional upon config, calling
> the cgroup side of things.  No need to look at fdt, etc. outside
> of fs/file.c either - the counter can be picked right from the
> files_struct...

Thanks, I can re-work it to look like this.

> >  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> >  {
> >  	struct fdtable *fdt = files_fdtable(files);
> > +	if (test_bit(fd, fdt->open_fds))
> > +		uncharge_current_fds(files, 1);
> 
> Umm...  Just where do we call it without the bit in ->open_fds set?
> Any such caller would be a serious bug; suppose you are trying to
> call __put_unused_fd(files, N) while N is not in open_fds.  Just before
> your call another thread grabs a descriptor and picks N.  Resulting
> state won't be pretty, especially if right *after* your call the
> third thread also asks for a descriptor - and also gets N.
> 
> Sure, you have an exclusion on ->file_lock, but AFAICS all callers
> are under it and in all callers except for put_unused_fd() we
> have just observed a non-NULL file reference in ->fd[N]; that
> would *definitely* be a hard constraint violation if it ever
> happened with N not in ->open_fds at that moment.
> 
> So the only possibility would be a broken caller of put_unused_fd(),
> and any such would be a serious bug.
> 
> Details, please - have you ever observed that?

No, I just kept it from the original series. I agree that it should be
safe to drop.

> BTW, what about the locking hierarchy?  In the current tree ->files_lock
> nests inside of everything; what happens with your patches in place?

If I understand correctly you're asking about ->files_lock nesting
inside of task_lock()? I tried to make the cgroup side in this patch
do the same thing in the same order. Or am I misunderstanding?

I did test this with some production container traffic and didn't see
anything too strange, but no doubt there are complicated edge cases
here.

Thanks,

Tycho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 4/6] misc cgroup: introduce an fd counter
  2023-11-08  0:26 ` [RFC 4/6] misc cgroup: introduce an fd counter Tycho Andersen
  2023-11-08 16:57   ` Al Viro
@ 2023-11-09  9:53   ` Christian Brauner
  2023-11-09 14:58     ` Tycho Andersen
  1 sibling, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2023-11-09  9:53 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: cgroups, linux-fsdevel, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen

> @@ -411,9 +453,22 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
>  
>  	rcu_assign_pointer(newf->fdt, new_fdt);
>  
> -	return newf;
> +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> +		return newf;


> @@ -542,6 +600,10 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (error)
>  		goto repeat;
>  
> +	error = -EMFILE;
> +	if (charge_current_fds(files, 1) < 0)
> +		goto out;

Whoops, I had that message ready to fire but didn't send it.

This may have a noticeable performance impact as charge_current_fds()
calls misc_cg_try_charge() which looks pretty expensive in this
codepath.

We're constantly getting patches to tweak performance during file open
and closing and adding a function that does require multiple atomics and
spinlocks won't exactly improve this.

On top of that I really dislike that we're pulling cgroups into this
code here at all.

Can you get a similar effect through a bpf program somehow that you
don't even tie this to cgroups?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 4/6] misc cgroup: introduce an fd counter
  2023-11-09  9:53   ` Christian Brauner
@ 2023-11-09 14:58     ` Tycho Andersen
  0 siblings, 0 replies; 13+ messages in thread
From: Tycho Andersen @ 2023-11-09 14:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: cgroups, linux-fsdevel, Tejun Heo, Zefan Li, Johannes Weiner,
	Haitao Huang, Kamalesh Babulal, Tycho Andersen

On Thu, Nov 09, 2023 at 10:53:17AM +0100, Christian Brauner wrote:
> > @@ -411,9 +453,22 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
> >  
> >  	rcu_assign_pointer(newf->fdt, new_fdt);
> >  
> > -	return newf;
> > +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> > +		return newf;
> 
> 
> > @@ -542,6 +600,10 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	if (error)
> >  		goto repeat;
> >  
> > +	error = -EMFILE;
> > +	if (charge_current_fds(files, 1) < 0)
> > +		goto out;
> 
> Whoops, I had that message ready to fire but didn't send it.
> 
> This may have a noticeable performance impact as charge_current_fds()
> calls misc_cg_try_charge() which looks pretty expensive in this
> codepath.
> 
> We're constantly getting patches to tweak performance during file open
> and closing and adding a function that does require multiple atomics and
> spinlocks won't exactly improve this.

I don't see any spin locks in misc_cg_try_charge(), but it does walk
up the tree, resulting in multiple atomic writes.
If we didn't walk up the tree it would change the semantics, but
Netflix probably wouldn't delegate this, so at least for our purposes
having only one atomic would be sufficient. Is that more tenable?

> On top of that I really dislike that we're pulling cgroups into this
> code here at all.
> 
> Can you get a similar effect through a bpf program somehow that you
> don't even tie this to cgroups?

Possibly, I can look into it.

Tycho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 3/6] misc: introduce misc_cg_charge()
  2023-11-08  0:26 ` [RFC 3/6] misc: introduce misc_cg_charge() Tycho Andersen
@ 2023-11-09 18:04   ` kernel test robot
  0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2023-11-09 18:04 UTC (permalink / raw)
  To: Tycho Andersen; +Cc: oe-kbuild-all

Hi Tycho,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on 13d88ac54ddd1011b6e94443958e798aa06eb835]

url:    https://github.com/intel-lab-lkp/linux/commits/Tycho-Andersen/fs-count_open_files-count_possible_open_files/20231108-083733
base:   13d88ac54ddd1011b6e94443958e798aa06eb835
patch link:    https://lore.kernel.org/r/20231108002647.73784-4-tycho%40tycho.pizza
patch subject: [RFC 3/6] misc: introduce misc_cg_charge()
config: arc-allmodconfig (https://download.01.org/0day-ci/archive/20231110/202311100146.ydtzwawu-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311100146.ydtzwawu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311100146.ydtzwawu-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/cgroup/misc.c: In function 'misc_cg_charge':
>> kernel/cgroup/misc.c:151:41: error: passing argument 2 of 'atomic_long_add' from incompatible pointer type [-Werror=incompatible-pointer-types]
     151 |                 atomic_long_add(amount, &res->usage);
         |                                         ^~~~~~~~~~~
         |                                         |
         |                                         atomic64_t *
   In file included from include/linux/atomic.h:82,
                    from include/linux/rcupdate.h:25,
                    from include/linux/rculist.h:11,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/misc.c:10:
   include/linux/atomic/atomic-instrumented.h:3230:40: note: expected 'atomic_long_t *' {aka 'atomic_t *'} but argument is of type 'atomic64_t *'
    3230 | atomic_long_add(long i, atomic_long_t *v)
         |                         ~~~~~~~~~~~~~~~^
   cc1: some warnings being treated as errors


vim +/atomic_long_add +151 kernel/cgroup/misc.c

   123	
   124	/**
   125	 * misc_cg_charge() - Charge the cgroup, ignoring limits/capacity.
   126	 * @type: Misc res type to charge.
   127	 * @cg: Misc cgroup which will be charged.
   128	 * @amount: Amount to charge.
   129	 *
   130	 * Charge @amount to the misc cgroup. Caller must use the same cgroup during
   131	 * the uncharge call.
   132	 *
   133	 * Context: Any context.
   134	 */
   135	void misc_cg_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
   136	{
   137		struct misc_res *res;
   138		struct misc_cg *i;
   139	
   140		if (!(valid_type(type) && cg && READ_ONCE(misc_res_capacity[type]))) {
   141			WARN_ON_ONCE(!valid_type(type));
   142			return;
   143		}
   144	
   145		if (!amount)
   146			return;
   147	
   148		for (i = cg; i; i = parent_misc(i)) {
   149			res = &i->res[type];
   150	
 > 151			atomic_long_add(amount, &res->usage);
   152		}
   153	}
   154	EXPORT_SYMBOL_GPL(misc_cg_charge);
   155	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/6] tracking fd counts per cgroup
  2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
                   ` (5 preceding siblings ...)
  2023-11-08  0:26 ` [RFC 6/6] selftests/cgroup: add a test for misc cgroup Tycho Andersen
@ 2023-11-09 18:44 ` Tejun Heo
  6 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2023-11-09 18:44 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: cgroups, linux-fsdevel, Christian Brauner, Zefan Li,
	Johannes Weiner, Haitao Huang, Kamalesh Babulal

Hello,

On Tue, Nov 07, 2023 at 05:26:41PM -0700, Tycho Andersen wrote:
> Hi all,
> 
> At Netflix, we have a "canary" framework that will run test versions of
> an application and automatically detect anomalies in various metrics. We
> also have two "fleets", one full of virtual machines, and one that is a
> multi-tenant container environment.
> 
> On our full-VM fleet, one of the metrics we analyze is the number of open file
> descriptors, from /proc/sys/fs/file-nr. However, no equivalent exists for the
> multi-tenant container fleet, which has lead to several production issues.
> 
> This idea is not new of course [1], but hopefully the existence of the new misc
> cgroup will make it more tenable.
> 
> I'm not really tied to any of the semantics in this series (e.g. threads could
> be double counted even with a shared table), and am open to implementing this
> in other ways if it makes more sense.

As already raised by Christian, if the goal is monitoring, you should be
able to do that easily with BPF.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-11-09 18:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08  0:26 [RFC 0/6] tracking fd counts per cgroup Tycho Andersen
2023-11-08  0:26 ` [RFC 1/6] fs: count_open_files() -> count_possible_open_files() Tycho Andersen
2023-11-08  0:26 ` [RFC 2/6] fs: introduce count_open_files() Tycho Andersen
2023-11-08  0:26 ` [RFC 3/6] misc: introduce misc_cg_charge() Tycho Andersen
2023-11-09 18:04   ` kernel test robot
2023-11-08  0:26 ` [RFC 4/6] misc cgroup: introduce an fd counter Tycho Andersen
2023-11-08 16:57   ` Al Viro
2023-11-08 21:01     ` Tycho Andersen
2023-11-09  9:53   ` Christian Brauner
2023-11-09 14:58     ` Tycho Andersen
2023-11-08  0:26 ` [RFC 5/6] selftests/cgroup: add a flags arg to clone_into_cgroup() Tycho Andersen
2023-11-08  0:26 ` [RFC 6/6] selftests/cgroup: add a test for misc cgroup Tycho Andersen
2023-11-09 18:44 ` [RFC 0/6] tracking fd counts per cgroup Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.