linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/18] Present useful limits to user
@ 2016-06-13 19:44 Topi Miettinen
  2016-06-13 19:44 ` [RFC 01/18] capabilities: track actually used capabilities Topi Miettinen
                   ` (17 more replies)
  0 siblings, 18 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel

Hello,

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find out
useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice starting
point from the actual maximum values. I looked where each limit is checked
and added a call to limit bump nearby.


Capabilities
[RFC 01/18] capabilities: track actually used capabilities

Currently, there is no way to know which capabilities are actually used. Even
the source code is only implicit, in-depth knowledge of each capability must
be used when analyzing a program to judge which capabilities the program will
exercise.
 
Cgroups
[RFC 02/18] cgroup_pids: track maximum pids
[RFC 03/18] memcontrol: present maximum used memory also for
[RFC 04/18] device_cgroup: track and present accessed devices

For tasks and memory cgroup limits the situation is somewhat better as the
current tasks and memory status can be easily seen with ps(1). However, any
transient tasks or temporary higher memory use might slip from the view.
Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
universal method. Program sources typically give no useful indication about
memory use or how many tasks there could be.
 
Resource limits
[RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
[RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
[RFC 07/18] limits: track RLIMIT_FSIZE actual max
[RFC 08/18] limits: track RLIMIT_DATA actual max
[RFC 09/18] limits: track RLIMIT_CORE actual max
[RFC 10/18] limits: track RLIMIT_STACK actual max
[RFC 11/18] limits: track and present RLIMIT_NPROC actual max
[RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
[RFC 13/18] limits: track RLIMIT_AS actual max
[RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
[RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
[RFC 16/18] limits: track RLIMIT_NICE actual max
[RFC 17/18] limits: track RLIMIT_RTPRIO actual max
[RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps

Current number of files and current VM usage (data pages, address space size)
could be calculated from available /proc files. Again, any temporarily higher
values could be easily missed. For many limits, there is no way to see what
is the current situation and source code is mostly useless.

As a side note, the resouce limits seem to be in bad shape. For example,
RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
some changes. Adding RLIMIT_CODE could be useful.

The current maximum values for the resource limits are now shown in
/proc/task/limits. If this is deemed too confusing for the existing
programs which rely on the exact format, I can change that to a new file.


Finally, the patches work in my testing but I have probably missed finer
lock/RCU details.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 01/18] capabilities: track actually used capabilities
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 20:32   ` Andy Lutomirski
  2016-06-13 19:44 ` [RFC 02/18] cgroup_pids: track maximum pids Topi Miettinen
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Serge Hallyn, Andrew Morton, Kees Cook, Christoph Lameter,
	Serge E. Hallyn, Andy Shevchenko, Richard W.M. Jones,
	Iago López Galeiras, Chris Metcalf, Andy Lutomirski,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure),
	open list:CAPABILITIES

Track what capabilities are actually used and present the current
situation in /proc/self/status.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/exec.c             | 1 +
 fs/proc/array.c       | 1 +
 include/linux/sched.h | 1 +
 kernel/capability.c   | 1 +
 4 files changed, 4 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 887c1c9..ff6f644 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1269,6 +1269,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 		if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)
 			set_dumpable(current->mm, suid_dumpable);
 	}
+	cap_clear(current->cap_used);
 
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c7de1..cccc9ee 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -343,6 +343,7 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
 	render_cap_t(m, "CapEff:\t", &cap_effective);
 	render_cap_t(m, "CapBnd:\t", &cap_bset);
 	render_cap_t(m, "CapAmb:\t", &cap_ambient);
+	render_cap_t(m, "CapUsd:\t", &p->cap_used);
 }
 
 static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..9c48a08 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,7 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+	kernel_cap_t	cap_used;	/* Capabilities actually used */
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..aad8854 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -380,6 +380,7 @@ bool ns_capable(struct user_namespace *ns, int cap)
 	}
 
 	if (security_capable(current_cred(), ns, cap) == 0) {
+		cap_raise(current->cap_used, cap);
 		current->flags |= PF_SUPERPRIV;
 		return true;
 	}
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
  2016-06-13 19:44 ` [RFC 01/18] capabilities: track actually used capabilities Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 21:12   ` Tejun Heo
  2016-06-13 19:44 ` [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2 Topi Miettinen
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Tejun Heo, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

Track maximum pids in the cgroup, present it in cgroup pids.current_max.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/cgroup_pids.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index 303097b..53fb21d 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -48,6 +48,7 @@ struct pids_cgroup {
 	 * %PIDS_MAX = (%PID_MAX_LIMIT + 1).
 	 */
 	atomic64_t			counter;
+	atomic64_t			cur_max;
 	int64_t				limit;
 };
 
@@ -72,6 +73,7 @@ pids_css_alloc(struct cgroup_subsys_state *parent)
 
 	pids->limit = PIDS_MAX;
 	atomic64_set(&pids->counter, 0);
+	atomic64_set(&pids->cur_max, 0);
 	return &pids->css;
 }
 
@@ -182,6 +184,10 @@ static int pids_can_attach(struct cgroup_taskset *tset)
 
 		pids_charge(pids, 1);
 		pids_uncharge(old_pids, 1);
+		if (atomic64_read(&pids->cur_max) <
+		    atomic64_read(&pids->counter))
+			atomic64_set(&pids->cur_max,
+				     atomic64_read(&pids->counter));
 	}
 
 	return 0;
@@ -202,6 +208,10 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
 
 		pids_charge(old_pids, 1);
 		pids_uncharge(pids, 1);
+		if (atomic64_read(&old_pids->cur_max) <
+		    atomic64_read(&old_pids->counter))
+			atomic64_set(&old_pids->cur_max,
+				     atomic64_read(&old_pids->counter));
 	}
 }
 
@@ -236,6 +246,14 @@ static void pids_free(struct task_struct *task)
 	pids_uncharge(pids, 1);
 }
 
+static void pids_fork(struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
+
+	if (atomic64_read(&pids->cur_max) < atomic64_read(&pids->counter))
+		atomic64_set(&pids->cur_max, atomic64_read(&pids->counter));
+}
+
 static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
@@ -288,6 +306,14 @@ static s64 pids_current_read(struct cgroup_subsys_state *css,
 	return atomic64_read(&pids->counter);
 }
 
+static s64 pids_current_max_read(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	struct pids_cgroup *pids = css_pids(css);
+
+	return atomic64_read(&pids->cur_max);
+}
+
 static struct cftype pids_files[] = {
 	{
 		.name = "max",
@@ -300,6 +326,11 @@ static struct cftype pids_files[] = {
 		.read_s64 = pids_current_read,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "current_max",
+		.read_s64 = pids_current_max_read,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{ }	/* terminate */
 };
 
@@ -313,4 +344,5 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
+	.fork		= pids_fork,
 };
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
  2016-06-13 19:44 ` [RFC 01/18] capabilities: track actually used capabilities Topi Miettinen
  2016-06-13 19:44 ` [RFC 02/18] cgroup_pids: track maximum pids Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-14  7:01   ` Michal Hocko
  2016-06-13 19:44 ` [RFC 04/18] device_cgroup: track and present accessed devices Topi Miettinen
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

Present maximum used memory in cgroup memory.current_max.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 include/linux/page_counter.h |  7 ++++++-
 mm/memcontrol.c              | 13 +++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 7e62920..be4de17 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -9,9 +9,9 @@ struct page_counter {
 	atomic_long_t count;
 	unsigned long limit;
 	struct page_counter *parent;
+	unsigned long watermark;
 
 	/* legacy */
-	unsigned long watermark;
 	unsigned long failcnt;
 };
 
@@ -34,6 +34,11 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 	return atomic_long_read(&counter->count);
 }
 
+static inline unsigned long page_counter_read_watermark(struct page_counter *counter)
+{
+	return counter->watermark;
+}
+
 void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 bool page_counter_try_charge(struct page_counter *counter,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e7440..5513771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4966,6 +4966,14 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
 }
 
+static u64 memory_current_max_read(struct cgroup_subsys_state *css,
+				   struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return (u64)page_counter_read_watermark(&memcg->memory) * PAGE_SIZE;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5179,6 +5187,11 @@ static struct cftype memory_files[] = {
 		.read_u64 = memory_current_read,
 	},
 	{
+		.name = "current_max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = memory_current_max_read,
+	},
+	{
 		.name = "low",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_low_show,
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 04/18] device_cgroup: track and present accessed devices
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (2 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2 Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-17 15:22   ` Serge E. Hallyn
  2016-06-13 19:44 ` [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max Topi Miettinen
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, James Morris, Serge E. Hallyn,
	open list:SECURITY SUBSYSTEM

Track what devices are accessed and present them cgroup devices.accessed.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 security/device_cgroup.c | 70 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 10 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 03c1652..45aa730 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -48,6 +48,7 @@ struct dev_exception_item {
 struct dev_cgroup {
 	struct cgroup_subsys_state css;
 	struct list_head exceptions;
+	struct list_head accessed;
 	enum devcg_behavior behavior;
 };
 
@@ -90,7 +91,7 @@ free_and_exit:
 /*
  * called under devcgroup_mutex
  */
-static int dev_exception_add(struct dev_cgroup *dev_cgroup,
+static int dev_exception_add(struct list_head *exceptions,
 			     struct dev_exception_item *ex)
 {
 	struct dev_exception_item *excopy, *walk;
@@ -101,7 +102,7 @@ static int dev_exception_add(struct dev_cgroup *dev_cgroup,
 	if (!excopy)
 		return -ENOMEM;
 
-	list_for_each_entry(walk, &dev_cgroup->exceptions, list) {
+	list_for_each_entry(walk, exceptions, list) {
 		if (walk->type != ex->type)
 			continue;
 		if (walk->major != ex->major)
@@ -115,7 +116,7 @@ static int dev_exception_add(struct dev_cgroup *dev_cgroup,
 	}
 
 	if (excopy != NULL)
-		list_add_tail_rcu(&excopy->list, &dev_cgroup->exceptions);
+		list_add_tail_rcu(&excopy->list, exceptions);
 	return 0;
 }
 
@@ -155,6 +156,16 @@ static void __dev_exception_clean(struct dev_cgroup *dev_cgroup)
 	}
 }
 
+static void dev_accessed_clean(struct dev_cgroup *dev_cgroup)
+{
+	struct dev_exception_item *ex, *tmp;
+
+	list_for_each_entry_safe(ex, tmp, &dev_cgroup->accessed, list) {
+		list_del_rcu(&ex->list);
+		kfree_rcu(ex, rcu);
+	}
+}
+
 /**
  * dev_exception_clean - frees all entries of the exception list
  * @dev_cgroup: dev_cgroup with the exception list to be cleaned
@@ -221,6 +232,7 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (!dev_cgroup)
 		return ERR_PTR(-ENOMEM);
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
+	INIT_LIST_HEAD(&dev_cgroup->accessed);
 	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
 
 	return &dev_cgroup->css;
@@ -231,6 +243,7 @@ static void devcgroup_css_free(struct cgroup_subsys_state *css)
 	struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
 
 	__dev_exception_clean(dev_cgroup);
+	dev_accessed_clean(dev_cgroup);
 	kfree(dev_cgroup);
 }
 
@@ -272,9 +285,9 @@ static void set_majmin(char *str, unsigned m)
 		sprintf(str, "%u", m);
 }
 
-static int devcgroup_seq_show(struct seq_file *m, void *v)
+static int devcgroup_seq_show_list(struct seq_file *m, struct dev_cgroup *devcgroup,
+				   struct list_head *exceptions, bool allow)
 {
-	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
 	struct dev_exception_item *ex;
 	char maj[MAJMINLEN], min[MAJMINLEN], acc[ACCLEN];
 
@@ -285,14 +298,14 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
 	 * - List the exceptions in case the default policy is to deny
 	 * This way, the file remains as a "whitelist of devices"
 	 */
-	if (devcgroup->behavior == DEVCG_DEFAULT_ALLOW) {
+	if (allow) {
 		set_access(acc, ACC_MASK);
 		set_majmin(maj, ~0);
 		set_majmin(min, ~0);
 		seq_printf(m, "%c %s:%s %s\n", type_to_char(DEV_ALL),
 			   maj, min, acc);
 	} else {
-		list_for_each_entry_rcu(ex, &devcgroup->exceptions, list) {
+		list_for_each_entry_rcu(ex, exceptions, list) {
 			set_access(acc, ex->access);
 			set_majmin(maj, ex->major);
 			set_majmin(min, ex->minor);
@@ -305,6 +318,36 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int devcgroup_seq_show(struct seq_file *m, void *v)
+{
+	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
+
+	return devcgroup_seq_show_list(m, devcgroup, &devcgroup->exceptions,
+				       devcgroup->behavior == DEVCG_DEFAULT_ALLOW);
+}
+
+static int devcgroup_seq_show_accessed(struct seq_file *m, void *v)
+{
+	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
+
+	return devcgroup_seq_show_list(m, devcgroup, &devcgroup->accessed, false);
+}
+
+static void devcgroup_add_accessed(struct dev_cgroup *dev_cgroup, short type,
+				   u32 major, u32 minor, short access)
+{
+	struct dev_exception_item ex;
+
+	ex.type = type;
+	ex.major = major;
+	ex.minor = minor;
+	ex.access = access;
+
+	mutex_lock(&devcgroup_mutex);
+	dev_exception_add(&dev_cgroup->accessed, &ex);
+	mutex_unlock(&devcgroup_mutex);
+}
+
 /**
  * match_exception	- iterates the exception list trying to find a complete match
  * @exceptions: list of exceptions
@@ -566,7 +609,7 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
 		 */
 		if (devcg_root->behavior == DEVCG_DEFAULT_ALLOW &&
 		    devcg->behavior == DEVCG_DEFAULT_ALLOW) {
-			rc = dev_exception_add(devcg, ex);
+			rc = dev_exception_add(&devcg->exceptions, ex);
 			if (rc)
 				break;
 		} else {
@@ -736,7 +779,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
 
 		if (!parent_has_perm(devcgroup, &ex))
 			return -EPERM;
-		rc = dev_exception_add(devcgroup, &ex);
+		rc = dev_exception_add(&devcgroup->exceptions, &ex);
 		break;
 	case DEVCG_DENY:
 		/*
@@ -747,7 +790,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
 		if (devcgroup->behavior == DEVCG_DEFAULT_DENY)
 			dev_exception_rm(devcgroup, &ex);
 		else
-			rc = dev_exception_add(devcgroup, &ex);
+			rc = dev_exception_add(&devcgroup->exceptions, &ex);
 
 		if (rc)
 			break;
@@ -788,6 +831,11 @@ static struct cftype dev_cgroup_files[] = {
 		.seq_show = devcgroup_seq_show,
 		.private = DEVCG_LIST,
 	},
+	{
+		.name = "accessed",
+		.seq_show = devcgroup_seq_show_accessed,
+		.private = DEVCG_LIST,
+	},
 	{ }	/* terminate */
 };
 
@@ -830,6 +878,8 @@ static int __devcgroup_check_permission(short type, u32 major, u32 minor,
 	if (!rc)
 		return -EPERM;
 
+	devcgroup_add_accessed(dev_cgroup, type, major, minor, access);
+
 	return 0;
 }
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (3 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 04/18] device_cgroup: track and present accessed devices Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 20:40   ` Andy Lutomirski
  2016-06-13 19:44 ` [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status Topi Miettinen
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Kees Cook, Cyrill Gorcunov, Alexey Dobriyan,
	John Stultz, Janis Danisevskis, Calvin Owens, Jann Horn,
	open list:FILESYSTEMS (VFS and infrastructure)

Track maximum number of files for the process, present current maximum
in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/file.c             |  4 ++++
 fs/proc/base.c        | 10 ++++++----
 include/linux/sched.h |  7 +++++++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6b1acdf..2d0d206 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,8 @@ repeat:
 	}
 #endif
 
+	bump_rlimit(RLIMIT_NOFILE, fd);
+
 out:
 	spin_unlock(&files->file_lock);
 	return error;
@@ -857,6 +859,8 @@ __releases(&files->file_lock)
 	if (tofree)
 		filp_close(tofree, files);
 
+	bump_rlimit(RLIMIT_NOFILE, fd);
+
 	return fd;
 
 Ebusy:
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a11eb71..227997b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -630,8 +630,8 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 	/*
 	 * print the file header
 	 */
-       seq_printf(m, "%-25s %-20s %-20s %-10s\n",
-		  "Limit", "Soft Limit", "Hard Limit", "Units");
+	seq_printf(m, "%-25s %-20s %-20s %-10s %-20s\n",
+		   "Limit", "Soft Limit", "Hard Limit", "Units", "Max");
 
 	for (i = 0; i < RLIM_NLIMITS; i++) {
 		if (rlim[i].rlim_cur == RLIM_INFINITY)
@@ -647,9 +647,11 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%-20lu ", rlim[i].rlim_max);
 
 		if (lnames[i].unit)
-			seq_printf(m, "%-10s\n", lnames[i].unit);
+			seq_printf(m, "%-10s", lnames[i].unit);
 		else
-			seq_putc(m, '\n');
+			seq_printf(m, "%-10s", "");
+		seq_printf(m, "%-20lu\n",
+			   task->signal->rlim_curmax[i]);
 	}
 
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c48a08..0150380 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -782,6 +782,7 @@ struct signal_struct {
 	 * have no need to disable irqs.
 	 */
 	struct rlimit rlim[RLIM_NLIMITS];
+	unsigned long rlim_curmax[RLIM_NLIMITS];
 
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct pacct_struct pacct;	/* per-process accounting information */
@@ -3376,6 +3377,12 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+static inline void bump_rlimit(unsigned int limit, unsigned long r)
+{
+	if (READ_ONCE(current->signal->rlim_curmax[limit]) < r)
+		current->signal->rlim_curmax[limit] = r;
+}
+
 #ifdef CONFIG_CPU_FREQ
 struct update_util_data {
 	void (*func)(struct update_util_data *data,
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (4 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-14  9:14   ` Alexey Dobriyan
  2016-06-13 19:44 ` [RFC 07/18] limits: track RLIMIT_FSIZE actual max Topi Miettinen
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Kees Cook, Al Viro,
	Alexey Dobriyan, John Stultz, Janis Danisevskis, Calvin Owens,
	Jann Horn

Present current cputimer status in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/proc/base.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 227997b..1df4fc8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -650,8 +650,30 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%-10s", lnames[i].unit);
 		else
 			seq_printf(m, "%-10s", "");
-		seq_printf(m, "%-20lu\n",
-			   task->signal->rlim_curmax[i]);
+
+		switch (i) {
+		case RLIMIT_RTTIME:
+		case RLIMIT_CPU:
+			if (rlim[i].rlim_max == RLIM_INFINITY)
+				seq_printf(m, "%-20s\n", "-");
+			else {
+				unsigned long long utime, ptime;
+				unsigned long psecs;
+				struct task_cputime cputime;
+
+				thread_group_cputimer(task, &cputime);
+				utime = cputime_to_expires(cputime.utime);
+				ptime = utime + cputime_to_expires(cputime.stime);
+				psecs = cputime_to_secs(ptime);
+				if (i == RLIMIT_RTTIME)
+					psecs *= USEC_PER_SEC;
+				seq_printf(m, "%-20lu\n", psecs);
+			}
+			break;
+		default:
+			seq_printf(m, "%-20lu\n",
+				   task->signal->rlim_curmax[i]);
+		}
 	}
 
 	return 0;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 07/18] limits: track RLIMIT_FSIZE actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (5 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 08/18] limits: track RLIMIT_DATA " Topi Miettinen
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro, Andrew Morton, Jan Kara,
	Johannes Weiner, Michal Hocko, Ross Zwisler, Kirill A. Shutemov,
	Mel Gorman, Junichi Nomura, Matthew Wilcox,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:MEMORY MANAGEMENT

Track maximum file size, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/attr.c    | 2 ++
 mm/filemap.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0..1b620f7 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -116,6 +116,8 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 			return -ETXTBSY;
 	}
 
+	bump_rlimit(RLIMIT_FSIZE, offset);
+
 	return 0;
 out_sig:
 	send_sig(SIGXFSZ, current, 0);
diff --git a/mm/filemap.c b/mm/filemap.c
index 00ae878..1fa9864 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2447,6 +2447,7 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 			send_sig(SIGXFSZ, current, 0);
 			return -EFBIG;
 		}
+		bump_rlimit(RLIMIT_FSIZE, iocb->ki_pos);
 		iov_iter_truncate(from, limit - (unsigned long)pos);
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 08/18] limits: track RLIMIT_DATA actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (6 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 07/18] limits: track RLIMIT_FSIZE actual max Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 09/18] limits: track RLIMIT_CORE " Topi Miettinen
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Alexander Viro, Michal Hocko, Andrew Morton, Vlastimil Babka,
	Cyrill Gorcunov, Eric W. Biederman, Mateusz Guzik, John Stultz,
	Ben Segall, Alexey Dobriyan, Kirill A. Shutemov, Oleg Nesterov,
	Chen Gang, Konstantin Khlebnikov, Andrea Arcangeli,
	Andrey Ryabinin, open list:FILESYSTEMS (VFS and infrastructure),
	open list:MEMORY MANAGEMENT

Track maximum size of data VM, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 arch/x86/ia32/ia32_aout.c | 1 +
 fs/binfmt_aout.c          | 1 +
 fs/binfmt_flat.c          | 1 +
 kernel/sys.c              | 2 ++
 mm/mmap.c                 | 6 +++++-
 5 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index cb26f18..8a7d502 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -398,6 +398,7 @@ beyond_if:
 	regs->r8 = regs->r9 = regs->r10 = regs->r11 =
 	regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
 	set_fs(USER_DS);
+	bump_limit(RLIMIT_DATA, ex.a_data + ex.a_bss);
 	return 0;
 }
 
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index ae1b540..86c6548 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -330,6 +330,7 @@ beyond_if:
 	regs->gp = ex.a_gpvalue;
 #endif
 	start_thread(regs, ex.a_entry, current->mm->start_stack);
+	bump_limit(RLIMIT_DATA, ex.a_data + ex.a_bss);
 	return 0;
 }
 
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index caf9e39..e309dad 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -792,6 +792,7 @@ static int load_flat_file(struct linux_binprm * bprm,
 			libinfo->lib_list[id].start_brk) +	/* start brk */
 			stack_len);
 
+	bump_limit(RLIMIT_DATA, data_len + bss_len);
 	return 0;
 err:
 	return ret;
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..6629f6f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1896,6 +1896,8 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 	if (prctl_map.auxv_size)
 		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
 
+	bump_limit(RLIMIT_DATA, mm->end_data - mm->start_data);
+
 	up_write(&mm->mmap_sem);
 	return 0;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index de2c176..61867de 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -228,6 +228,8 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 		goto out;
 
 set_brk:
+	bump_rlimit(RLIMIT_DATA, (brk - mm->start_brk) +
+		    (mm->end_data - mm->start_data));
 	mm->brk = brk;
 	populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;
 	up_write(&mm->mmap_sem);
@@ -2924,8 +2926,10 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->exec_vm += npages;
 	else if (is_stack_mapping(flags))
 		mm->stack_vm += npages;
-	else if (is_data_mapping(flags))
+	else if (is_data_mapping(flags)) {
 		mm->data_vm += npages;
+		bump_rlimit(RLIMIT_DATA, mm->data_vm << PAGE_SHIFT);
+	}
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 09/18] limits: track RLIMIT_CORE actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (7 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 08/18] limits: track RLIMIT_DATA " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 10/18] limits: track RLIMIT_STACK " Topi Miettinen
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure)

Track maximum size of core dump written, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/coredump.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281b768..abedc99 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -784,20 +784,25 @@ int dump_emit(struct coredump_params *cprm, const void *addr, int nr)
 	struct file *file = cprm->file;
 	loff_t pos = file->f_pos;
 	ssize_t n;
+	int r = 0;
+
 	if (cprm->written + nr > cprm->limit)
 		return 0;
 	while (nr) {
 		if (dump_interrupted())
-			return 0;
+			goto err;
 		n = __kernel_write(file, addr, nr, &pos);
 		if (n <= 0)
-			return 0;
+			goto err;
 		file->f_pos = pos;
 		cprm->written += n;
 		cprm->pos += n;
 		nr -= n;
 	}
-	return 1;
+	r = 1;
+ err:
+	bump_rlimit(RLIMIT_CORE, cprm->written);
+	return r;
 }
 EXPORT_SYMBOL(dump_emit);
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 10/18] limits: track RLIMIT_STACK actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (8 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 09/18] limits: track RLIMIT_CORE " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 11/18] limits: track and present RLIMIT_NPROC " Topi Miettinen
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Oleg Nesterov, Kirill A. Shutemov,
	Chen Gang, Michal Hocko, Konstantin Khlebnikov, Andrea Arcangeli,
	Andrey Ryabinin, open list:MEMORY MANAGEMENT

Track maximum stack size, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 mm/mmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 61867de..0963e7f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2019,6 +2019,8 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
 	if (security_vm_enough_memory_mm(mm, grow))
 		return -ENOMEM;
 
+	bump_rlimit(RLIMIT_STACK, actual_size);
+
 	return 0;
 }
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (9 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 10/18] limits: track RLIMIT_STACK " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 22:27   ` Jann Horn
  2016-06-13 19:44 ` [RFC 13/18] limits: track RLIMIT_AS " Topi Miettinen
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Al Viro, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Jann Horn, Tejun Heo,
	Michal Hocko, Oleg Nesterov, Vladimir Davydov, Andrea Arcangeli,
	Josh Triplett, Eric W. Biederman, Aleksa Sarai, Cyrill Gorcunov,
	Ben Segall, Mateusz Guzik

Track maximum number of processes per user and present it
in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/proc/base.c        | 4 ++++
 include/linux/sched.h | 1 +
 kernel/fork.c         | 5 +++++
 kernel/sys.c          | 5 +++++
 4 files changed, 15 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1df4fc8..02576c6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -670,6 +670,10 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 				seq_printf(m, "%-20lu\n", psecs);
 			}
 			break;
+		case RLIMIT_NPROC:
+			seq_printf(m, "%-20d\n",
+				   atomic_read(&task->real_cred->user->max_processes));
+			break;
 		default:
 			seq_printf(m, "%-20lu\n",
 				   task->signal->rlim_curmax[i]);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0150380..feb9bb7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -838,6 +838,7 @@ static inline int signal_group_exit(const struct signal_struct *sig)
 struct user_struct {
 	atomic_t __count;	/* reference count */
 	atomic_t processes;	/* How many processes does this user have? */
+	atomic_t max_processes;	/* How many processes has this user had at the same time? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
 #ifdef CONFIG_INOTIFY_USER
 	atomic_t inotify_watches; /* How many inotify watches does this user have? */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355..667290f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1653,6 +1653,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	trace_task_newtask(p, clone_flags);
 	uprobe_copy_process(p, clone_flags);
 
+	if (atomic_read(&p->real_cred->user->max_processes) <
+	    atomic_read(&p->real_cred->user->processes))
+		atomic_set(&p->real_cred->user->max_processes,
+			   atomic_read(&p->real_cred->user->processes));
+
 	return p;
 
 bad_fork_cancel_cgroup:
diff --git a/kernel/sys.c b/kernel/sys.c
index 6629f6f..955cf21 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -439,6 +439,11 @@ static int set_user(struct cred *new)
 	else
 		current->flags &= ~PF_NPROC_EXCEEDED;
 
+	if (atomic_read(&new_user->max_processes) <
+	    atomic_read(&new_user->processes))
+		atomic_set(&new_user->max_processes,
+			   atomic_read(&new_user->processes));
+
 	free_uid(new->user);
 	new->user = new_user;
 	return 0;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 13/18] limits: track RLIMIT_AS actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (10 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 11/18] limits: track and present RLIMIT_NPROC " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 14/18] limits: track RLIMIT_SIGPENDING " Topi Miettinen
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Oleg Nesterov, Kirill A. Shutemov,
	Chen Gang, Michal Hocko, Konstantin Khlebnikov, Andrea Arcangeli,
	Andrey Ryabinin, David Rientjes, Vlastimil Babka, Hugh Dickins,
	Laurent Dufour, Alexander Kuleshov, open list:MEMORY MANAGEMENT

Track maximum size of address space, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 mm/mmap.c   | 4 ++++
 mm/mremap.c | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 4e683dd..4876c21 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2706,6 +2706,9 @@ static int do_brk(unsigned long addr, unsigned long len)
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
+
+	bump_rlimit(RLIMIT_AS, mm->total_vm << PAGE_SHIFT);
+
 	mm->data_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
@@ -2926,6 +2929,7 @@ bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages)
 void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 {
 	mm->total_vm += npages;
+	bump_rlimit(RLIMIT_AS, mm->total_vm << PAGE_SHIFT);
 
 	if (is_exec_mapping(flags))
 		mm->exec_vm += npages;
diff --git a/mm/mremap.c b/mm/mremap.c
index ade3e13..6be3c01 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -397,6 +397,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (vma->vm_flags & VM_LOCKED)
 		bump_rlimit(RLIMIT_MEMLOCK, (mm->locked_vm << PAGE_SHIFT) +
 			    new_len - old_len);
+	bump_rlimit(RLIMIT_AS, (mm->total_vm << PAGE_SHIFT) +
+		    new_len - old_len);
+
 	return vma;
 }
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (11 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 13/18] limits: track RLIMIT_AS " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-14 14:50   ` Oleg Nesterov
  2016-06-13 19:44 ` [RFC 15/18] limits: track RLIMIT_MSGQUEUE " Topi Miettinen
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Oleg Nesterov, Ingo Molnar,
	Amanieu d'Antras, Stas Sergeev, Dave Hansen, Wang Xiaoqiang,
	Helge Deller, Sasha Levin

Track maximum number of pending signals, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/signal.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc4..c8fbccd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -387,6 +387,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
 		INIT_LIST_HEAD(&q->list);
 		q->flags = 0;
 		q->user = user;
+		/* XXX resource limits apply per task, not per user */
+		bump_rlimit(RLIMIT_SIGPENDING, atomic_read(&user->sigpending));
 	}
 
 	return q;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (12 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 14/18] limits: track RLIMIT_SIGPENDING " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-17 19:52   ` Doug Ledford
  2016-06-13 19:44 ` [RFC 16/18] limits: track RLIMIT_NICE " Topi Miettinen
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Michal Hocko, Al Viro,
	Doug Ledford, Vladimir Davydov, Marcus Gelderie,
	Kirill A. Shutemov

Track maximum size of message queues, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 ipc/mqueue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index ade739f..edccf55 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -287,6 +287,8 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 
 		/* all is ok */
 		info->user = get_uid(u);
+		/* XXX resource limits apply per task, not per user */
+		bump_rlimit(RLIMIT_MSGQUEUE, u->mq_bytes);
 	} else if (S_ISDIR(mode)) {
 		inc_nlink(inode);
 		/* Some things misbehave if size == 0 on a directory */
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 16/18] limits: track RLIMIT_NICE actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (13 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 15/18] limits: track RLIMIT_MSGQUEUE " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 17/18] limits: track RLIMIT_RTPRIO " Topi Miettinen
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: Topi Miettinen, Ingo Molnar, Peter Zijlstra

Track maximum nice priority, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 017d539..817d720 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3692,6 +3692,8 @@ void set_user_nice(struct task_struct *p, long nice)
 		if (delta < 0 || (delta > 0 && task_running(rq, p)))
 			resched_curr(rq);
 	}
+	task_bump_rlimit(p, RLIMIT_NICE, nice_to_rlimit(nice));
+
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (14 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 16/18] limits: track RLIMIT_NICE " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 19:44 ` [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps Topi Miettinen
  2016-06-14 19:03 ` [RFC 00/18] Present useful limits to user Konstantin Khlebnikov
  17 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: Topi Miettinen, Ingo Molnar, Peter Zijlstra

Track maximum RT priority, presented in /proc/self/limits.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 817d720..d31a06a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4219,6 +4219,8 @@ change:
 	balance_callback(rq);
 	preempt_enable();
 
+	task_bump_rlimit(p, RLIMIT_RTPRIO, attr->sched_priority);
+
 	return 0;
 }
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (15 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 17/18] limits: track RLIMIT_RTPRIO " Topi Miettinen
@ 2016-06-13 19:44 ` Topi Miettinen
  2016-06-13 20:43   ` Kees Cook
  2016-06-14 19:03 ` [RFC 00/18] Present useful limits to user Konstantin Khlebnikov
  17 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 19:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Andrew Morton, Michal Hocko,
	Konstantin Khlebnikov, Vlastimil Babka, Kirill A. Shutemov,
	Jerome Marchand, Laurent Dufour, Naoya Horiguchi,
	Gerald Schaefer, Johannes Weiner

Add a flag to /proc/self/maps to show that the memory area is locked.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/proc/task_mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4648c7f..8229509 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -313,13 +313,14 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 		end -= PAGE_SIZE;
 
 	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
-	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+	seq_printf(m, "%08lx-%08lx %c%c%c%c%c %08llx %02x:%02x %lu ",
 			start,
 			end,
 			flags & VM_READ ? 'r' : '-',
 			flags & VM_WRITE ? 'w' : '-',
 			flags & VM_EXEC ? 'x' : '-',
 			flags & VM_MAYSHARE ? 's' : 'p',
+			flags & VM_LOCKED ? 'l' : '-',
 			pgoff,
 			MAJOR(dev), MINOR(dev), ino);
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 01/18] capabilities: track actually used capabilities
  2016-06-13 19:44 ` [RFC 01/18] capabilities: track actually used capabilities Topi Miettinen
@ 2016-06-13 20:32   ` Andy Lutomirski
  2016-06-13 20:45     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-06-13 20:32 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Serge Hallyn, Andrew Morton, Kees Cook, Christoph Lameter,
	Serge E. Hallyn, Andy Shevchenko, Richard W.M. Jones,
	Iago López Galeiras, Chris Metcalf, Andy Lutomirski,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure),
	open list:CAPABILITIES

On Mon, Jun 13, 2016 at 12:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> Track what capabilities are actually used and present the current
> situation in /proc/self/status.

What for?

What is the intended behavior on fork()?  Whatever the intended
behavior is, there should IMO be a selftest for it.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
  2016-06-13 19:44 ` [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max Topi Miettinen
@ 2016-06-13 20:40   ` Andy Lutomirski
  2016-06-13 21:13     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-06-13 20:40 UTC (permalink / raw)
  To: Topi Miettinen, linux-kernel
  Cc: Alexander Viro, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Cyrill Gorcunov, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Jann Horn,
	open list:FILESYSTEMS (VFS and infrastructure)

On 06/13/2016 12:44 PM, Topi Miettinen wrote:
> Track maximum number of files for the process, present current maximum
> in /proc/self/limits.

The core part should be its own patch.

Also, you have this weirdly named (and racy!) function bump_rlimit. 
Wouldn't this be nicer if you taught the rlimit code to track the 
*current* usage generically and to derive the max usage from that?

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index a11eb71..227997b 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -630,8 +630,8 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
>  	/*
>  	 * print the file header
>  	 */
> -       seq_printf(m, "%-25s %-20s %-20s %-10s\n",
> -		  "Limit", "Soft Limit", "Hard Limit", "Units");
> +	seq_printf(m, "%-25s %-20s %-20s %-10s %-20s\n",
> +		   "Limit", "Soft Limit", "Hard Limit", "Units", "Max");

What existing programs, if any, does this break?

>
>  	for (i = 0; i < RLIM_NLIMITS; i++) {
>  		if (rlim[i].rlim_cur == RLIM_INFINITY)
> @@ -647,9 +647,11 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
>  			seq_printf(m, "%-20lu ", rlim[i].rlim_max);
>
>  		if (lnames[i].unit)
> -			seq_printf(m, "%-10s\n", lnames[i].unit);
> +			seq_printf(m, "%-10s", lnames[i].unit);
>  		else
> -			seq_putc(m, '\n');
> +			seq_printf(m, "%-10s", "");
> +		seq_printf(m, "%-20lu\n",
> +			   task->signal->rlim_curmax[i]);
>  	}
>
>  	return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9c48a08..0150380 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -782,6 +782,7 @@ struct signal_struct {
>  	 * have no need to disable irqs.
>  	 */
>  	struct rlimit rlim[RLIM_NLIMITS];
> +	unsigned long rlim_curmax[RLIM_NLIMITS];
>
>  #ifdef CONFIG_BSD_PROCESS_ACCT
>  	struct pacct_struct pacct;	/* per-process accounting information */
> @@ -3376,6 +3377,12 @@ static inline unsigned long rlimit_max(unsigned int limit)
>  	return task_rlimit_max(current, limit);
>  }
>
> +static inline void bump_rlimit(unsigned int limit, unsigned long r)
> +{
> +	if (READ_ONCE(current->signal->rlim_curmax[limit]) < r)
> +		current->signal->rlim_curmax[limit] = r;
> +}
> +
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
>  	void (*func)(struct update_util_data *data,
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
  2016-06-13 19:44 ` [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps Topi Miettinen
@ 2016-06-13 20:43   ` Kees Cook
  2016-06-13 20:52     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Kees Cook @ 2016-06-13 20:43 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Andrew Morton, Michal Hocko, Konstantin Khlebnikov,
	Vlastimil Babka, Kirill A. Shutemov, Jerome Marchand,
	Laurent Dufour, Naoya Horiguchi, Gerald Schaefer,
	Johannes Weiner

On Mon, Jun 13, 2016 at 10:44:25PM +0300, Topi Miettinen wrote:
> Add a flag to /proc/self/maps to show that the memory area is locked.
> 
> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  fs/proc/task_mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4648c7f..8229509 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c

If you change the maps format, you'll need to update task_nommu.c too.

> @@ -313,13 +313,14 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
>  		end -= PAGE_SIZE;
>  
>  	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);

I think the width needs to be adjusted for the new character.

> -	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
> +	seq_printf(m, "%08lx-%08lx %c%c%c%c%c %08llx %02x:%02x %lu ",

Have you checked that no userspace tools that parse "maps" will break with
this flag addition?

>  			start,
>  			end,
>  			flags & VM_READ ? 'r' : '-',
>  			flags & VM_WRITE ? 'w' : '-',
>  			flags & VM_EXEC ? 'x' : '-',
>  			flags & VM_MAYSHARE ? 's' : 'p',
> +			flags & VM_LOCKED ? 'l' : '-',

IIUC, the smaps file already includes the locked information in VmFlags as
"lo" (see show_smap_vma_flags), so I think you probably don't want this
patch at all.

-Kees

>  			pgoff,
>  			MAJOR(dev), MINOR(dev), ino);
>  
> -- 
> 2.8.1

-- 
Kees Cook                                            @outflux.net

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 01/18] capabilities: track actually used capabilities
  2016-06-13 20:32   ` Andy Lutomirski
@ 2016-06-13 20:45     ` Topi Miettinen
  2016-06-13 21:12       ` Andy Lutomirski
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 20:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Serge Hallyn, Andrew Morton, Kees Cook, Christoph Lameter,
	Serge E. Hallyn, Andy Shevchenko, Richard W.M. Jones,
	Iago López Galeiras, Chris Metcalf, Andy Lutomirski,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure),
	open list:CAPABILITIES

On 06/13/16 20:32, Andy Lutomirski wrote:
> On Mon, Jun 13, 2016 at 12:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> Track what capabilities are actually used and present the current
>> situation in /proc/self/status.
> 
> What for?

Excerpt from the cover letter:

"There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find out
useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice starting
point from the actual maximum values. I looked where each limit is checked
and added a call to limit bump nearby.


Capabilities
[RFC 01/18] capabilities: track actually used capabilities

Currently, there is no way to know which capabilities are actually used.
Even
the source code is only implicit, in-depth knowledge of each capability must
be used when analyzing a program to judge which capabilities the program
will
exercise."

Should I perhaps cite some of this in the commit?

>
> What is the intended behavior on fork()?  Whatever the intended
> behavior is, there should IMO be a selftest for it.
>
> --Andy
>

The capabilities could be tracked from three points of daemon
initialization sequence onwards:
fork()
setpcap()
exec()

fork() case would be logical as the /proc entry is per task. But if you
consider the tools to set the capabilities (for example systemd unit
files), there can be between fork() and exec() further preparations
which need more capabilities than the program itself needs.

setpcap() is probably the real point after which we are interested if
the capabilities are enough.

The amount of setup between setpcap() and exec() is probably very low.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
  2016-06-13 20:43   ` Kees Cook
@ 2016-06-13 20:52     ` Topi Miettinen
  0 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 20:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Michal Hocko, Konstantin Khlebnikov,
	Vlastimil Babka, Kirill A. Shutemov, Jerome Marchand,
	Laurent Dufour, Naoya Horiguchi, Gerald Schaefer,
	Johannes Weiner

On 06/13/16 20:43, Kees Cook wrote:
> On Mon, Jun 13, 2016 at 10:44:25PM +0300, Topi Miettinen wrote:
>> Add a flag to /proc/self/maps to show that the memory area is locked.
>>
>> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
>> ---
>>  fs/proc/task_mmu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4648c7f..8229509 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
> 
> If you change the maps format, you'll need to update task_nommu.c too.
> 
>> @@ -313,13 +313,14 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
>>  		end -= PAGE_SIZE;
>>  
>>  	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
> 
> I think the width needs to be adjusted for the new character.
> 
>> -	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
>> +	seq_printf(m, "%08lx-%08lx %c%c%c%c%c %08llx %02x:%02x %lu ",
> 
> Have you checked that no userspace tools that parse "maps" will break with
> this flag addition?
> 
>>  			start,
>>  			end,
>>  			flags & VM_READ ? 'r' : '-',
>>  			flags & VM_WRITE ? 'w' : '-',
>>  			flags & VM_EXEC ? 'x' : '-',
>>  			flags & VM_MAYSHARE ? 's' : 'p',
>> +			flags & VM_LOCKED ? 'l' : '-',
> 
> IIUC, the smaps file already includes the locked information in VmFlags as
> "lo" (see show_smap_vma_flags), so I think you probably don't want this
> patch at all.

Yes. the amount of locked memory is also shown:
Locked:                8 kB
VmFlags: rd wr mr mw me lo ac sd

Sorry, I didn't notice that. I'll drop the patch.

-Topi

> 
> -Kees
> 
>>  			pgoff,
>>  			MAJOR(dev), MINOR(dev), ino);
>>  
>> -- 
>> 2.8.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 19:44 ` [RFC 02/18] cgroup_pids: track maximum pids Topi Miettinen
@ 2016-06-13 21:12   ` Tejun Heo
  2016-06-13 21:29     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2016-06-13 21:12 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

Hello,

On Mon, Jun 13, 2016 at 10:44:09PM +0300, Topi Miettinen wrote:
> Track maximum pids in the cgroup, present it in cgroup pids.current_max.

"max" is often used for maximum limits in cgroup.  I think "watermark"
or "high_watermark" would be a lot clearer.

> @@ -236,6 +246,14 @@ static void pids_free(struct task_struct *task)
>  	pids_uncharge(pids, 1);
>  }
>  
> +static void pids_fork(struct task_struct *task)
> +{
> +	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
> +
> +	if (atomic64_read(&pids->cur_max) < atomic64_read(&pids->counter))
> +		atomic64_set(&pids->cur_max, atomic64_read(&pids->counter));
> +}

Wouldn't it make more sense to track high watermark from the charge
functions instead?  I don't get why this requires a separate fork
callback.  Also, racing atomic64_set's are racy.  The counter can end
up with a lower number than it should be.

> @@ -300,6 +326,11 @@ static struct cftype pids_files[] = {
>  		.read_s64 = pids_current_read,
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  	},
> +	{
> +		.name = "current_max",

Please make this "high_watermark" field in pids.stats file.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 01/18] capabilities: track actually used capabilities
  2016-06-13 20:45     ` Topi Miettinen
@ 2016-06-13 21:12       ` Andy Lutomirski
  2016-06-13 21:48         ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-06-13 21:12 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Serge Hallyn, Andrew Morton, Kees Cook, Christoph Lameter,
	Serge E. Hallyn, Andy Shevchenko, Richard W.M. Jones,
	Iago López Galeiras, Chris Metcalf, Andy Lutomirski,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure),
	open list:CAPABILITIES

On Mon, Jun 13, 2016 at 1:45 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> On 06/13/16 20:32, Andy Lutomirski wrote:
>> On Mon, Jun 13, 2016 at 12:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>>> Track what capabilities are actually used and present the current
>>> situation in /proc/self/status.
>>
>> What for?
>

>
> Capabilities
> [RFC 01/18] capabilities: track actually used capabilities
>
> Currently, there is no way to know which capabilities are actually used.
> Even
> the source code is only implicit, in-depth knowledge of each capability must
> be used when analyzing a program to judge which capabilities the program
> will
> exercise."
>
> Should I perhaps cite some of this in the commit?

Yes, but you should also clarify what users are supposed to do with
this.  Given ambient capabilities, I suspect that you'll find that
your patch doesn't actually work very well.  For example, if you run a
shell script with ambient caps, then you won't notice caps used by
short-lived helper processes.

>
>>
>> What is the intended behavior on fork()?  Whatever the intended
>> behavior is, there should IMO be a selftest for it.
>>
>> --Andy
>>
>
> The capabilities could be tracked from three points of daemon
> initialization sequence onwards:
> fork()
> setpcap()
> exec()
>
> fork() case would be logical as the /proc entry is per task. But if you
> consider the tools to set the capabilities (for example systemd unit
> files), there can be between fork() and exec() further preparations
> which need more capabilities than the program itself needs.
>
> setpcap() is probably the real point after which we are interested if
> the capabilities are enough.
>
> The amount of setup between setpcap() and exec() is probably very low.

When I asked "what is the intended behavior on fork()?", I mean "what
should CapUsed be after fork()?".  The answer should be about four
words long and should have a test case.  There should maybe also be an
explanation of why the intended behavior is useful.

But, as I said above, I think that you may need to rethink this
entirely to make it useful.  You might need to do it per process tree
or per cgroup or something.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
  2016-06-13 20:40   ` Andy Lutomirski
@ 2016-06-13 21:13     ` Topi Miettinen
  2016-06-13 21:16       ` Andy Lutomirski
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 21:13 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel
  Cc: Alexander Viro, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Cyrill Gorcunov, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Jann Horn,
	open list:FILESYSTEMS (VFS and infrastructure)

On 06/13/16 20:40, Andy Lutomirski wrote:
> On 06/13/2016 12:44 PM, Topi Miettinen wrote:
>> Track maximum number of files for the process, present current maximum
>> in /proc/self/limits.
> 
> The core part should be its own patch.
> 
> Also, you have this weirdly named (and racy!) function bump_rlimit.

I can change the name if you have better suggestions. rlimit_track_max?

The max value is written often but read seldom, if ever. What kind of
locking should I use then?

> Wouldn't this be nicer if you taught the rlimit code to track the
> *current* usage generically and to derive the max usage from that?

Current rlimit code performs checks against current limits. These are
typically done early in the calling function and further checks could
also fail. Thus max should not be updated until much later. Maybe these
could be combined, but not easily if at all.

> 
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index a11eb71..227997b 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -630,8 +630,8 @@ static int proc_pid_limits(struct seq_file *m,
>> struct pid_namespace *ns,
>>      /*
>>       * print the file header
>>       */
>> -       seq_printf(m, "%-25s %-20s %-20s %-10s\n",
>> -          "Limit", "Soft Limit", "Hard Limit", "Units");
>> +    seq_printf(m, "%-25s %-20s %-20s %-10s %-20s\n",
>> +           "Limit", "Soft Limit", "Hard Limit", "Units", "Max");
> 
> What existing programs, if any, does this break?

Using Debian codesearch for /limits" string, I'd check pam_limits and
rtkit. The max values could be put into a new file if you prefer.

> 
>>
>>      for (i = 0; i < RLIM_NLIMITS; i++) {
>>          if (rlim[i].rlim_cur == RLIM_INFINITY)
>> @@ -647,9 +647,11 @@ static int proc_pid_limits(struct seq_file *m,
>> struct pid_namespace *ns,
>>              seq_printf(m, "%-20lu ", rlim[i].rlim_max);
>>
>>          if (lnames[i].unit)
>> -            seq_printf(m, "%-10s\n", lnames[i].unit);
>> +            seq_printf(m, "%-10s", lnames[i].unit);
>>          else
>> -            seq_putc(m, '\n');
>> +            seq_printf(m, "%-10s", "");
>> +        seq_printf(m, "%-20lu\n",
>> +               task->signal->rlim_curmax[i]);
>>      }
>>
>>      return 0;
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 9c48a08..0150380 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -782,6 +782,7 @@ struct signal_struct {
>>       * have no need to disable irqs.
>>       */
>>      struct rlimit rlim[RLIM_NLIMITS];
>> +    unsigned long rlim_curmax[RLIM_NLIMITS];
>>
>>  #ifdef CONFIG_BSD_PROCESS_ACCT
>>      struct pacct_struct pacct;    /* per-process accounting
>> information */
>> @@ -3376,6 +3377,12 @@ static inline unsigned long rlimit_max(unsigned
>> int limit)
>>      return task_rlimit_max(current, limit);
>>  }
>>
>> +static inline void bump_rlimit(unsigned int limit, unsigned long r)
>> +{
>> +    if (READ_ONCE(current->signal->rlim_curmax[limit]) < r)
>> +        current->signal->rlim_curmax[limit] = r;
>> +}
>> +
>>  #ifdef CONFIG_CPU_FREQ
>>  struct update_util_data {
>>      void (*func)(struct update_util_data *data,
>>
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
  2016-06-13 21:13     ` Topi Miettinen
@ 2016-06-13 21:16       ` Andy Lutomirski
  2016-06-14 15:21         ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-06-13 21:16 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Andy Lutomirski, linux-kernel, Alexander Viro, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Kees Cook, Cyrill Gorcunov,
	Alexey Dobriyan, John Stultz, Janis Danisevskis, Calvin Owens,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure)

On Mon, Jun 13, 2016 at 2:13 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> On 06/13/16 20:40, Andy Lutomirski wrote:
>> On 06/13/2016 12:44 PM, Topi Miettinen wrote:
>>> Track maximum number of files for the process, present current maximum
>>> in /proc/self/limits.
>>
>> The core part should be its own patch.
>>
>> Also, you have this weirdly named (and racy!) function bump_rlimit.
>
> I can change the name if you have better suggestions. rlimit_track_max?
>
> The max value is written often but read seldom, if ever. What kind of
> locking should I use then?

Possibly none, but WRITE_ONCE would be good as would a comment
indicating that your code in intentionally racy.  Or you could use
atomic_cmpxchg if that won't kill performance.

rlimit_track_max sounds like a better name to me.

>
>> Wouldn't this be nicer if you taught the rlimit code to track the
>> *current* usage generically and to derive the max usage from that?
>
> Current rlimit code performs checks against current limits. These are
> typically done early in the calling function and further checks could
> also fail. Thus max should not be updated until much later. Maybe these
> could be combined, but not easily if at all.

I mean:  why not actually show the current value in /proc/pid/limits
and track the max via whatever teaches proc about the current value?

>
>>
>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>> index a11eb71..227997b 100644
>>> --- a/fs/proc/base.c
>>> +++ b/fs/proc/base.c
>>> @@ -630,8 +630,8 @@ static int proc_pid_limits(struct seq_file *m,
>>> struct pid_namespace *ns,
>>>      /*
>>>       * print the file header
>>>       */
>>> -       seq_printf(m, "%-25s %-20s %-20s %-10s\n",
>>> -          "Limit", "Soft Limit", "Hard Limit", "Units");
>>> +    seq_printf(m, "%-25s %-20s %-20s %-10s %-20s\n",
>>> +           "Limit", "Soft Limit", "Hard Limit", "Units", "Max");
>>
>> What existing programs, if any, does this break?
>
> Using Debian codesearch for /limits" string, I'd check pam_limits and
> rtkit. The max values could be put into a new file if you prefer.

If it actually breaks them, then you need to change the patch so you
don't break them.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 21:12   ` Tejun Heo
@ 2016-06-13 21:29     ` Topi Miettinen
  2016-06-13 21:33       ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 21:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On 06/13/16 21:12, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 13, 2016 at 10:44:09PM +0300, Topi Miettinen wrote:
>> Track maximum pids in the cgroup, present it in cgroup pids.current_max.
> 
> "max" is often used for maximum limits in cgroup.  I think "watermark"
> or "high_watermark" would be a lot clearer.

OK, I have no preference.

> 
>> @@ -236,6 +246,14 @@ static void pids_free(struct task_struct *task)
>>  	pids_uncharge(pids, 1);
>>  }
>>  
>> +static void pids_fork(struct task_struct *task)
>> +{
>> +	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
>> +
>> +	if (atomic64_read(&pids->cur_max) < atomic64_read(&pids->counter))
>> +		atomic64_set(&pids->cur_max, atomic64_read(&pids->counter));
>> +}
> 
> Wouldn't it make more sense to track high watermark from the charge
> functions instead?  I don't get why this requires a separate fork
> callback.  Also, racing atomic64_set's are racy.  The counter can end
> up with a lower number than it should be.
> 

I used fork callback as I don't want to lower the watermark in all cases
where the charge can be lowered, so I'd update the watermark only when
the fork really happens.

Is there a better way to compare and set? I don't think atomic_cmpxchg()
does what's needed,

>> @@ -300,6 +326,11 @@ static struct cftype pids_files[] = {
>>  		.read_s64 = pids_current_read,
>>  		.flags = CFTYPE_NOT_ON_ROOT,
>>  	},
>> +	{
>> +		.name = "current_max",
> 
> Please make this "high_watermark" field in pids.stats file.
> 
> Thanks.
> 

OK.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 21:29     ` Topi Miettinen
@ 2016-06-13 21:33       ` Tejun Heo
  2016-06-13 21:59         ` Topi Miettinen
  2016-07-17 20:11         ` Topi Miettinen
  0 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2016-06-13 21:33 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

Hello,

On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
> I used fork callback as I don't want to lower the watermark in all cases
> where the charge can be lowered, so I'd update the watermark only when
> the fork really happens.

I don't think that would make a noticeable difference.  That's where
we decide whether to grant fork or not after all and thus where the
actual usage is.

> Is there a better way to compare and set? I don't think atomic_cmpxchg()
> does what's needed,

cmpxchg loop should do what's necessary although I'm not sure how much
being strictly correct matters here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 01/18] capabilities: track actually used capabilities
  2016-06-13 21:12       ` Andy Lutomirski
@ 2016-06-13 21:48         ` Topi Miettinen
  0 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Alexander Viro, Ingo Molnar, Peter Zijlstra,
	Serge Hallyn, Andrew Morton, Kees Cook, Christoph Lameter,
	Serge E. Hallyn, Andy Shevchenko, Richard W.M. Jones,
	Iago López Galeiras, Chris Metcalf, Andy Lutomirski,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure),
	open list:CAPABILITIES

On 06/13/16 21:12, Andy Lutomirski wrote:
> On Mon, Jun 13, 2016 at 1:45 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> On 06/13/16 20:32, Andy Lutomirski wrote:
>>> On Mon, Jun 13, 2016 at 12:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>>>> Track what capabilities are actually used and present the current
>>>> situation in /proc/self/status.
>>>
>>> What for?
>>
> 
>>
>> Capabilities
>> [RFC 01/18] capabilities: track actually used capabilities
>>
>> Currently, there is no way to know which capabilities are actually used.
>> Even
>> the source code is only implicit, in-depth knowledge of each capability must
>> be used when analyzing a program to judge which capabilities the program
>> will
>> exercise."
>>
>> Should I perhaps cite some of this in the commit?
> 
> Yes, but you should also clarify what users are supposed to do with
> this.  Given ambient capabilities, I suspect that you'll find that
> your patch doesn't actually work very well.  For example, if you run a
> shell script with ambient caps, then you won't notice caps used by
> short-lived helper processes.
> 

Right, I suppose this model works well only within a single process, or
where the helper processes are always unprivileged (like Xorg runs
xkbcomp) or less privileged.

>>
>>>
>>> What is the intended behavior on fork()?  Whatever the intended
>>> behavior is, there should IMO be a selftest for it.
>>>
>>> --Andy
>>>
>>
>> The capabilities could be tracked from three points of daemon
>> initialization sequence onwards:
>> fork()
>> setpcap()
>> exec()
>>
>> fork() case would be logical as the /proc entry is per task. But if you
>> consider the tools to set the capabilities (for example systemd unit
>> files), there can be between fork() and exec() further preparations
>> which need more capabilities than the program itself needs.
>>
>> setpcap() is probably the real point after which we are interested if
>> the capabilities are enough.
>>
>> The amount of setup between setpcap() and exec() is probably very low.
> 
> When I asked "what is the intended behavior on fork()?", I mean "what
> should CapUsed be after fork()?".  The answer should be about four
> words long and should have a test case.  There should maybe also be an
> explanation of why the intended behavior is useful.

In this model:
fork: no change
setpcap: no change
exec: reset

But I hadn't thought that much where the reset happens.

> 
> But, as I said above, I think that you may need to rethink this
> entirely to make it useful.  You might need to do it per process tree
> or per cgroup or something.
> 
> --Andy
> 

I'd actually prefer the cgroup approach. Though that's much more work
than this simple patch which already gives somewhat useful information
in limited cases (once the logic is correct).

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 21:33       ` Tejun Heo
@ 2016-06-13 21:59         ` Topi Miettinen
  2016-06-13 22:09           ` Tejun Heo
  2016-07-17 20:11         ` Topi Miettinen
  1 sibling, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-13 21:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On 06/13/16 21:33, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
>> I used fork callback as I don't want to lower the watermark in all cases
>> where the charge can be lowered, so I'd update the watermark only when
>> the fork really happens.
> 
> I don't think that would make a noticeable difference.  That's where
> we decide whether to grant fork or not after all and thus where the
> actual usage is.
> 

You mean, increment count on cgroup_can_fork()? But what if the fork()
fails after that (signal_pending case)?

>> Is there a better way to compare and set? I don't think atomic_cmpxchg()
>> does what's needed,
> 
> cmpxchg loop should do what's necessary although I'm not sure how much
> being strictly correct matters here.
> 
> Thanks.
> 

These are not used for any decisions taken by kernel, but by the user. I
have to say I don't know where's the line between strict correctness and
less strict.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 21:59         ` Topi Miettinen
@ 2016-06-13 22:09           ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2016-06-13 22:09 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On Mon, Jun 13, 2016 at 09:59:32PM +0000, Topi Miettinen wrote:
> On 06/13/16 21:33, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
> >> I used fork callback as I don't want to lower the watermark in all cases
> >> where the charge can be lowered, so I'd update the watermark only when
> >> the fork really happens.
> > 
> > I don't think that would make a noticeable difference.  That's where
> > we decide whether to grant fork or not after all and thus where the
> > actual usage is.
> > 
> 
> You mean, increment count on cgroup_can_fork()? But what if the fork()
> fails after that (signal_pending case)?

That number isn't gonna deviate by any significant amount and the
counter is to estimate what the limit should be set to to begin with.
It's logical to collect how close the usage got to can_attach failure
due to limit breach.

> >> Is there a better way to compare and set? I don't think atomic_cmpxchg()
> >> does what's needed,
> > 
> > cmpxchg loop should do what's necessary although I'm not sure how much
> > being strictly correct matters here.
> 
> These are not used for any decisions taken by kernel, but by the user. I
> have to say I don't know where's the line between strict correctness and
> less strict.

Provided that cmpxchg is done only when the counter needs to be
actually updated, it's not gonna be noticeably expensive.  Might as
well make it correct.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
  2016-06-13 19:44 ` [RFC 11/18] limits: track and present RLIMIT_NPROC " Topi Miettinen
@ 2016-06-13 22:27   ` Jann Horn
  2016-06-14 15:40     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Jann Horn @ 2016-06-13 22:27 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Al Viro, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Tejun Heo, Michal Hocko,
	Oleg Nesterov, Vladimir Davydov, Andrea Arcangeli, Josh Triplett,
	Eric W. Biederman, Aleksa Sarai, Cyrill Gorcunov, Ben Segall,
	Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 2868 bytes --]

On Mon, Jun 13, 2016 at 10:44:18PM +0300, Topi Miettinen wrote:
> Track maximum number of processes per user and present it
> in /proc/self/limits.
> 
> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  fs/proc/base.c        | 4 ++++
>  include/linux/sched.h | 1 +
>  kernel/fork.c         | 5 +++++
>  kernel/sys.c          | 5 +++++
>  4 files changed, 15 insertions(+)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 1df4fc8..02576c6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -670,6 +670,10 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
>  				seq_printf(m, "%-20lu\n", psecs);
>  			}
>  			break;
> +		case RLIMIT_NPROC:
> +			seq_printf(m, "%-20d\n",
> +				   atomic_read(&task->real_cred->user->max_processes));

Don't you have to take an RCU read lock before dereferencing task->real_cred?
And shouldn't this be done with __task_cred(task) instead of task->real_cred?


> +			break;
>  		default:
>  			seq_printf(m, "%-20lu\n",
>  				   task->signal->rlim_curmax[i]);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0150380..feb9bb7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -838,6 +838,7 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>  struct user_struct {
>  	atomic_t __count;	/* reference count */
>  	atomic_t processes;	/* How many processes does this user have? */
> +	atomic_t max_processes;	/* How many processes has this user had at the same time? */
>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>  #ifdef CONFIG_INOTIFY_USER
>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355..667290f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1653,6 +1653,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	trace_task_newtask(p, clone_flags);
>  	uprobe_copy_process(p, clone_flags);
>  
> +	if (atomic_read(&p->real_cred->user->max_processes) <
> +	    atomic_read(&p->real_cred->user->processes))
> +		atomic_set(&p->real_cred->user->max_processes,
> +			   atomic_read(&p->real_cred->user->processes));
> +
>  	return p;
>  
>  bad_fork_cancel_cgroup:
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 6629f6f..955cf21 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -439,6 +439,11 @@ static int set_user(struct cred *new)
>  	else
>  		current->flags &= ~PF_NPROC_EXCEEDED;
>  
> +	if (atomic_read(&new_user->max_processes) <
> +	    atomic_read(&new_user->processes))
> +		atomic_set(&new_user->max_processes,
> +			   atomic_read(&new_user->processes));
> +

Is this intentionally slightly racy? If so, it might be nice to have a comment
here that documents that.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-13 19:44 ` [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2 Topi Miettinen
@ 2016-06-14  7:01   ` Michal Hocko
  2016-06-14 15:47     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2016-06-14  7:01 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Mon 13-06-16 22:44:10, Topi Miettinen wrote:
> Present maximum used memory in cgroup memory.current_max.

It would be really much more preferable to present the usecase in the
patch description. It is true that this information is presented in the
v1 API but the current policy is to export new knobs only when there is
a reasonable usecase for it.

> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  include/linux/page_counter.h |  7 ++++++-
>  mm/memcontrol.c              | 13 +++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 7e62920..be4de17 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -9,9 +9,9 @@ struct page_counter {
>  	atomic_long_t count;
>  	unsigned long limit;
>  	struct page_counter *parent;
> +	unsigned long watermark;
>  
>  	/* legacy */
> -	unsigned long watermark;
>  	unsigned long failcnt;
>  };
>  
> @@ -34,6 +34,11 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
>  	return atomic_long_read(&counter->count);
>  }
>  
> +static inline unsigned long page_counter_read_watermark(struct page_counter *counter)
> +{
> +	return counter->watermark;
> +}
> +
>  void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
>  bool page_counter_try_charge(struct page_counter *counter,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e7440..5513771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4966,6 +4966,14 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
>  	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
>  }
>  
> +static u64 memory_current_max_read(struct cgroup_subsys_state *css,
> +				   struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	return (u64)page_counter_read_watermark(&memcg->memory) * PAGE_SIZE;
> +}
> +
>  static int memory_low_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> @@ -5179,6 +5187,11 @@ static struct cftype memory_files[] = {
>  		.read_u64 = memory_current_read,
>  	},
>  	{
> +		.name = "current_max",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.read_u64 = memory_current_max_read,
> +	},
> +	{
>  		.name = "low",
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.seq_show = memory_low_show,
> -- 
> 2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status
  2016-06-13 19:44 ` [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status Topi Miettinen
@ 2016-06-14  9:14   ` Alexey Dobriyan
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Dobriyan @ 2016-06-14  9:14 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Linux Kernel, Andrew Morton, Kees Cook, Al Viro, John Stultz,
	Janis Danisevskis, Calvin Owens, Jann Horn

On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> Present current cputimer status in /proc/self/limits.

> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -650,8 +650,30 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
> +               switch (i) {
> +               case RLIMIT_RTTIME:
> +               case RLIMIT_CPU:
> +                       if (rlim[i].rlim_max == RLIM_INFINITY)
> +                               seq_printf(m, "%-20s\n", "-");
> +                       else {
> +                               unsigned long long utime, ptime;
> +                               unsigned long psecs;
> +                               struct task_cputime cputime;
> +
> +                               thread_group_cputimer(task, &cputime);
> +                               utime = cputime_to_expires(cputime.utime);
> +                               ptime = utime + cputime_to_expires(cputime.stime);
> +                               psecs = cputime_to_secs(ptime);
> +                               if (i == RLIMIT_RTTIME)
> +                                       psecs *= USEC_PER_SEC;
> +                               seq_printf(m, "%-20lu\n", psecs);
> +                       }
> +                       break;

Let's keep rlimits file for rlimits.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
  2016-06-13 19:44 ` [RFC 14/18] limits: track RLIMIT_SIGPENDING " Topi Miettinen
@ 2016-06-14 14:50   ` Oleg Nesterov
  2016-06-14 15:51     ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Oleg Nesterov @ 2016-06-14 14:50 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Andrew Morton, Ingo Molnar, Amanieu d'Antras,
	Stas Sergeev, Dave Hansen, Wang Xiaoqiang, Helge Deller,
	Sasha Levin

On 06/13, Topi Miettinen wrote:
>
> Track maximum number of pending signals, presented in /proc/self/limits.
>
> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  kernel/signal.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 96e9bc4..c8fbccd 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -387,6 +387,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
>  		INIT_LIST_HEAD(&q->list);
>  		q->flags = 0;
>  		q->user = user;
> +		/* XXX resource limits apply per task, not per user */
> +		bump_rlimit(RLIMIT_SIGPENDING, atomic_read(&user->sigpending));

Well, I have to admit that I too dislike the very idea of these changes...

But this particular patch looks wrong in any case. I wasn't cc'ed on the
previous patches which add bump_rlimit(), but I have found

	"[RFC 05/18] limits: track and present RLIMIT_NOFILE actual max"
	http://marc.info/?l=linux-fsdevel&m=146584742331072&w=2

and bump_rlimit() changes current->signal->rlim_curmax, while in this case
you need to bump t->signal->rlim_curmax.

Oleg.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
  2016-06-13 21:16       ` Andy Lutomirski
@ 2016-06-14 15:21         ` Topi Miettinen
  0 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 15:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, Alexander Viro, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Kees Cook, Cyrill Gorcunov,
	Alexey Dobriyan, John Stultz, Janis Danisevskis, Calvin Owens,
	Jann Horn, open list:FILESYSTEMS (VFS and infrastructure)

On 06/13/16 21:16, Andy Lutomirski wrote:
> On Mon, Jun 13, 2016 at 2:13 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> On 06/13/16 20:40, Andy Lutomirski wrote:
>>> On 06/13/2016 12:44 PM, Topi Miettinen wrote:
>>>> Track maximum number of files for the process, present current maximum
>>>> in /proc/self/limits.
>>>
>>> The core part should be its own patch.
>>>
>>> Also, you have this weirdly named (and racy!) function bump_rlimit.
>>
>> I can change the name if you have better suggestions. rlimit_track_max?
>>
>> The max value is written often but read seldom, if ever. What kind of
>> locking should I use then?
> 
> Possibly none, but WRITE_ONCE would be good as would a comment
> indicating that your code in intentionally racy.  Or you could use
> atomic_cmpxchg if that won't kill performance.
> 
> rlimit_track_max sounds like a better name to me.
> 
>>
>>> Wouldn't this be nicer if you taught the rlimit code to track the
>>> *current* usage generically and to derive the max usage from that?
>>
>> Current rlimit code performs checks against current limits. These are
>> typically done early in the calling function and further checks could
>> also fail. Thus max should not be updated until much later. Maybe these
>> could be combined, but not easily if at all.
> 
> I mean:  why not actually show the current value in /proc/pid/limits
> and track the max via whatever teaches proc about the current value?
> 

That could be interesting data too. In other comments, a new file was
proposed and then your model would be good choice.

>>
>>>
>>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>>> index a11eb71..227997b 100644
>>>> --- a/fs/proc/base.c
>>>> +++ b/fs/proc/base.c
>>>> @@ -630,8 +630,8 @@ static int proc_pid_limits(struct seq_file *m,
>>>> struct pid_namespace *ns,
>>>>      /*
>>>>       * print the file header
>>>>       */
>>>> -       seq_printf(m, "%-25s %-20s %-20s %-10s\n",
>>>> -          "Limit", "Soft Limit", "Hard Limit", "Units");
>>>> +    seq_printf(m, "%-25s %-20s %-20s %-10s %-20s\n",
>>>> +           "Limit", "Soft Limit", "Hard Limit", "Units", "Max");
>>>
>>> What existing programs, if any, does this break?
>>
>> Using Debian codesearch for /limits" string, I'd check pam_limits and
>> rtkit. The max values could be put into a new file if you prefer.
> 
> If it actually breaks them, then you need to change the patch so you
> don't break them.
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
  2016-06-13 22:27   ` Jann Horn
@ 2016-06-14 15:40     ` Topi Miettinen
  2016-06-14 23:15       ` Jann Horn
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 15:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Al Viro, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Tejun Heo, Michal Hocko,
	Oleg Nesterov, Vladimir Davydov, Andrea Arcangeli, Josh Triplett,
	Eric W. Biederman, Aleksa Sarai, Cyrill Gorcunov, Ben Segall,
	Mateusz Guzik

On 06/13/16 22:27, Jann Horn wrote:
> On Mon, Jun 13, 2016 at 10:44:18PM +0300, Topi Miettinen wrote:
>> Track maximum number of processes per user and present it
>> in /proc/self/limits.
>>
>> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
>> ---
>>  fs/proc/base.c        | 4 ++++
>>  include/linux/sched.h | 1 +
>>  kernel/fork.c         | 5 +++++
>>  kernel/sys.c          | 5 +++++
>>  4 files changed, 15 insertions(+)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 1df4fc8..02576c6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -670,6 +670,10 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
>>  				seq_printf(m, "%-20lu\n", psecs);
>>  			}
>>  			break;
>> +		case RLIMIT_NPROC:
>> +			seq_printf(m, "%-20d\n",
>> +				   atomic_read(&task->real_cred->user->max_processes));
> 
> Don't you have to take an RCU read lock before dereferencing task->real_cred?

In other comments in the series, cmpxchg loop was suggested, would that
work here?

> And shouldn't this be done with __task_cred(task) instead of task->real_cred?

How about atomic_read(task_cred_xxx(task, user)->max_processes)?

> 
> 
>> +			break;
>>  		default:
>>  			seq_printf(m, "%-20lu\n",
>>  				   task->signal->rlim_curmax[i]);
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 0150380..feb9bb7 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -838,6 +838,7 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>>  struct user_struct {
>>  	atomic_t __count;	/* reference count */
>>  	atomic_t processes;	/* How many processes does this user have? */
>> +	atomic_t max_processes;	/* How many processes has this user had at the same time? */
>>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>>  #ifdef CONFIG_INOTIFY_USER
>>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 5c2c355..667290f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1653,6 +1653,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>  	trace_task_newtask(p, clone_flags);
>>  	uprobe_copy_process(p, clone_flags);
>>  
>> +	if (atomic_read(&p->real_cred->user->max_processes) <
>> +	    atomic_read(&p->real_cred->user->processes))
>> +		atomic_set(&p->real_cred->user->max_processes,
>> +			   atomic_read(&p->real_cred->user->processes));
>> +
>>  	return p;
>>  
>>  bad_fork_cancel_cgroup:
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 6629f6f..955cf21 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -439,6 +439,11 @@ static int set_user(struct cred *new)
>>  	else
>>  		current->flags &= ~PF_NPROC_EXCEEDED;
>>  
>> +	if (atomic_read(&new_user->max_processes) <
>> +	    atomic_read(&new_user->processes))
>> +		atomic_set(&new_user->max_processes,
>> +			   atomic_read(&new_user->processes));
>> +
> 
> Is this intentionally slightly racy? If so, it might be nice to have a comment
> here that documents that.
> 

I'd suppose cmpxchg loop could be used to avoid races.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-14  7:01   ` Michal Hocko
@ 2016-06-14 15:47     ` Topi Miettinen
  2016-06-14 16:04       ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On 06/14/16 07:01, Michal Hocko wrote:
> On Mon 13-06-16 22:44:10, Topi Miettinen wrote:
>> Present maximum used memory in cgroup memory.current_max.
> 
> It would be really much more preferable to present the usecase in the
> patch description. It is true that this information is presented in the
> v1 API but the current policy is to export new knobs only when there is
> a reasonable usecase for it.
> 

This was stated in the cover letter:
https://lkml.org/lkml/2016/6/13/857

"There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find out
useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice starting
point from the actual maximum values. I looked where each limit is checked
and added a call to limit bump nearby."

"Cgroups
[RFC 02/18] cgroup_pids: track maximum pids
[RFC 03/18] memcontrol: present maximum used memory also for
[RFC 04/18] device_cgroup: track and present accessed devices

For tasks and memory cgroup limits the situation is somewhat better as the
current tasks and memory status can be easily seen with ps(1). However, any
transient tasks or temporary higher memory use might slip from the view.
Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
universal method. Program sources typically give no useful indication about
memory use or how many tasks there could be."

I can add some of this to the commit message, is that sufficient for you?

>> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
>> ---
>>  include/linux/page_counter.h |  7 ++++++-
>>  mm/memcontrol.c              | 13 +++++++++++++
>>  2 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
>> index 7e62920..be4de17 100644
>> --- a/include/linux/page_counter.h
>> +++ b/include/linux/page_counter.h
>> @@ -9,9 +9,9 @@ struct page_counter {
>>  	atomic_long_t count;
>>  	unsigned long limit;
>>  	struct page_counter *parent;
>> +	unsigned long watermark;
>>  
>>  	/* legacy */
>> -	unsigned long watermark;
>>  	unsigned long failcnt;
>>  };
>>  
>> @@ -34,6 +34,11 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
>>  	return atomic_long_read(&counter->count);
>>  }
>>  
>> +static inline unsigned long page_counter_read_watermark(struct page_counter *counter)
>> +{
>> +	return counter->watermark;
>> +}
>> +
>>  void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
>>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
>>  bool page_counter_try_charge(struct page_counter *counter,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 75e7440..5513771 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4966,6 +4966,14 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
>>  	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
>>  }
>>  
>> +static u64 memory_current_max_read(struct cgroup_subsys_state *css,
>> +				   struct cftype *cft)
>> +{
>> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>> +
>> +	return (u64)page_counter_read_watermark(&memcg->memory) * PAGE_SIZE;
>> +}
>> +
>>  static int memory_low_show(struct seq_file *m, void *v)
>>  {
>>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
>> @@ -5179,6 +5187,11 @@ static struct cftype memory_files[] = {
>>  		.read_u64 = memory_current_read,
>>  	},
>>  	{
>> +		.name = "current_max",
>> +		.flags = CFTYPE_NOT_ON_ROOT,
>> +		.read_u64 = memory_current_max_read,
>> +	},
>> +	{
>>  		.name = "low",
>>  		.flags = CFTYPE_NOT_ON_ROOT,
>>  		.seq_show = memory_low_show,
>> -- 
>> 2.8.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
  2016-06-14 14:50   ` Oleg Nesterov
@ 2016-06-14 15:51     ` Topi Miettinen
  0 siblings, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 15:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Andrew Morton, Ingo Molnar, Amanieu d'Antras,
	Stas Sergeev, Dave Hansen, Wang Xiaoqiang, Helge Deller,
	Sasha Levin

On 06/14/16 14:50, Oleg Nesterov wrote:
> On 06/13, Topi Miettinen wrote:
>>
>> Track maximum number of pending signals, presented in /proc/self/limits.
>>
>> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
>> ---
>>  kernel/signal.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 96e9bc4..c8fbccd 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -387,6 +387,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
>>  		INIT_LIST_HEAD(&q->list);
>>  		q->flags = 0;
>>  		q->user = user;
>> +		/* XXX resource limits apply per task, not per user */
>> +		bump_rlimit(RLIMIT_SIGPENDING, atomic_read(&user->sigpending));
> 
> Well, I have to admit that I too dislike the very idea of these changes...
> 
> But this particular patch looks wrong in any case. I wasn't cc'ed on the
> previous patches which add bump_rlimit(), but I have found
> 
> 	"[RFC 05/18] limits: track and present RLIMIT_NOFILE actual max"
> 	http://marc.info/?l=linux-fsdevel&m=146584742331072&w=2
> 

I used git send-email --cc-cmd=scripts/get_maintainer.pl to generate the
CC lists. Is there a better way?

> and bump_rlimit() changes current->signal->rlim_curmax, while in this case
> you need to bump t->signal->rlim_curmax.
> 
> Oleg.
> 

Yes, I also added task_bump_rlimit() which would be better choice here.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-14 15:47     ` Topi Miettinen
@ 2016-06-14 16:04       ` Johannes Weiner
  2016-06-14 17:15         ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2016-06-14 16:04 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Michal Hocko, linux-kernel, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue, Jun 14, 2016 at 03:47:20PM +0000, Topi Miettinen wrote:
> On 06/14/16 07:01, Michal Hocko wrote:
> > On Mon 13-06-16 22:44:10, Topi Miettinen wrote:
> >> Present maximum used memory in cgroup memory.current_max.
> > 
> > It would be really much more preferable to present the usecase in the
> > patch description. It is true that this information is presented in the
> > v1 API but the current policy is to export new knobs only when there is
> > a reasonable usecase for it.
> > 
> 
> This was stated in the cover letter:
> https://lkml.org/lkml/2016/6/13/857
> 
> "There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find out
> useful values for the limits, except blind trial and error.
> 
> This patch series attempts to fix that by giving at least a nice starting
> point from the actual maximum values. I looked where each limit is checked
> and added a call to limit bump nearby."
> 
> "Cgroups
> [RFC 02/18] cgroup_pids: track maximum pids
> [RFC 03/18] memcontrol: present maximum used memory also for
> [RFC 04/18] device_cgroup: track and present accessed devices
> 
> For tasks and memory cgroup limits the situation is somewhat better as the
> current tasks and memory status can be easily seen with ps(1). However, any
> transient tasks or temporary higher memory use might slip from the view.
> Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
> universal method. Program sources typically give no useful indication about
> memory use or how many tasks there could be."
> 
> I can add some of this to the commit message, is that sufficient for you?

It's useful to have a short summary of the justification in each patch
as well. Other than that it's fine to be broader and more detailed
about your motivation in the coverletter.

I didn't catch the coverletter, though. It makes sense to CC
recipients of any of those patches on the full series, including the
cover, since even though we are specialized in certain areas of the
code, many of us are interested in the whole picture of addressing a
problem, and not just the few bits in our area without more context.

As far as the memcg part of this series goes, one concern is that page
cache is trimmed back only when there is pressure, so in all but very
few cases the high watermark you are introducing will be pegged to the
configured limit. It doesn't give a whole lot of insight.

But there are consumers that are less/not compressible than cache,
such as anonymous memory, unreclaimable slab, maybe socket buffers
etc. Having spikes in those slip through two sampling points is an
issue, indeed. Adding consumer-specific watermarks might be useful.

Thanks

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-14 16:04       ` Johannes Weiner
@ 2016-06-14 17:15         ` Topi Miettinen
  2016-06-16 10:27           ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-kernel, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On 06/14/16 16:04, Johannes Weiner wrote:
> On Tue, Jun 14, 2016 at 03:47:20PM +0000, Topi Miettinen wrote:
>> On 06/14/16 07:01, Michal Hocko wrote:
>>> On Mon 13-06-16 22:44:10, Topi Miettinen wrote:
>>>> Present maximum used memory in cgroup memory.current_max.
>>>
>>> It would be really much more preferable to present the usecase in the
>>> patch description. It is true that this information is presented in the
>>> v1 API but the current policy is to export new knobs only when there is
>>> a reasonable usecase for it.
>>>
>>
>> This was stated in the cover letter:
>> https://lkml.org/lkml/2016/6/13/857
>>
>> "There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the actual maximum values. I looked where each limit is checked
>> and added a call to limit bump nearby."
>>
>> "Cgroups
>> [RFC 02/18] cgroup_pids: track maximum pids
>> [RFC 03/18] memcontrol: present maximum used memory also for
>> [RFC 04/18] device_cgroup: track and present accessed devices
>>
>> For tasks and memory cgroup limits the situation is somewhat better as the
>> current tasks and memory status can be easily seen with ps(1). However, any
>> transient tasks or temporary higher memory use might slip from the view.
>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
>> universal method. Program sources typically give no useful indication about
>> memory use or how many tasks there could be."
>>
>> I can add some of this to the commit message, is that sufficient for you?
> 
> It's useful to have a short summary of the justification in each patch
> as well. Other than that it's fine to be broader and more detailed
> about your motivation in the coverletter.
> 
> I didn't catch the coverletter, though. It makes sense to CC
> recipients of any of those patches on the full series, including the
> cover, since even though we are specialized in certain areas of the
> code, many of us are interested in the whole picture of addressing a
> problem, and not just the few bits in our area without more context.
> 

Thank you for this nice explanation. I suppose "git send-email
--cc-cmd=scripts/get_maintainer.pl" doesn't do this.

> As far as the memcg part of this series goes, one concern is that page
> cache is trimmed back only when there is pressure, so in all but very
> few cases the high watermark you are introducing will be pegged to the
> configured limit. It doesn't give a whole lot of insight.
> 

So using the high watermark would not give a very useful starting point
for the user who wished to configure the memory limit? What else could
be used instead?

> But there are consumers that are less/not compressible than cache,
> such as anonymous memory, unreclaimable slab, maybe socket buffers
> etc. Having spikes in those slip through two sampling points is an
> issue, indeed. Adding consumer-specific watermarks might be useful.
> 
> Thanks
> 

OK, but there's no limiting or tuning mechanism in place for now for
those, or is there? How could the results be used?

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
                   ` (16 preceding siblings ...)
  2016-06-13 19:44 ` [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps Topi Miettinen
@ 2016-06-14 19:03 ` Konstantin Khlebnikov
  2016-06-14 19:46   ` Topi Miettinen
  2016-06-15 14:47   ` Austin S. Hemmelgarn
  17 siblings, 2 replies; 56+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-14 19:03 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: Linux Kernel Mailing List

I don't like the idea of this patchset.

All limitations are context dependent and that context changes rapidly.
You'll never dump enough information for predicting future errors or
investigating reson of errors in past. You could try to reproduce all
kernel logic but model always will be aproximate.

If you want to track origin of failures in user space applications when it hits
some limit you should track errors. For example rlimits and other limitation
subsystems could provide resonable amount of tracepoints which could
tell what exactly happened before error. If you need highwater of some
values you could track it in userspace, or maybe tracing subsystem could
provide postpocessing for tracepoint parameters. Anyway, systemtap and
other monsters can do this right now.

On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> Hello,
>
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find out
> useful values for the limits, except blind trial and error.
>
> This patch series attempts to fix that by giving at least a nice starting
> point from the actual maximum values. I looked where each limit is checked
> and added a call to limit bump nearby.
>
>
> Capabilities
> [RFC 01/18] capabilities: track actually used capabilities
>
> Currently, there is no way to know which capabilities are actually used. Even
> the source code is only implicit, in-depth knowledge of each capability must
> be used when analyzing a program to judge which capabilities the program will
> exercise.
>
> Cgroups
> [RFC 02/18] cgroup_pids: track maximum pids
> [RFC 03/18] memcontrol: present maximum used memory also for
> [RFC 04/18] device_cgroup: track and present accessed devices
>
> For tasks and memory cgroup limits the situation is somewhat better as the
> current tasks and memory status can be easily seen with ps(1). However, any
> transient tasks or temporary higher memory use might slip from the view.
> Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
> universal method. Program sources typically give no useful indication about
> memory use or how many tasks there could be.
>
> Resource limits
> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
> [RFC 08/18] limits: track RLIMIT_DATA actual max
> [RFC 09/18] limits: track RLIMIT_CORE actual max
> [RFC 10/18] limits: track RLIMIT_STACK actual max
> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
> [RFC 13/18] limits: track RLIMIT_AS actual max
> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
> [RFC 16/18] limits: track RLIMIT_NICE actual max
> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>
> Current number of files and current VM usage (data pages, address space size)
> could be calculated from available /proc files. Again, any temporarily higher
> values could be easily missed. For many limits, there is no way to see what
> is the current situation and source code is mostly useless.
>
> As a side note, the resouce limits seem to be in bad shape. For example,
> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
> some changes. Adding RLIMIT_CODE could be useful.
>
> The current maximum values for the resource limits are now shown in
> /proc/task/limits. If this is deemed too confusing for the existing
> programs which rely on the exact format, I can change that to a new file.
>
>
> Finally, the patches work in my testing but I have probably missed finer
> lock/RCU details.
>
> -Topi
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-14 19:03 ` [RFC 00/18] Present useful limits to user Konstantin Khlebnikov
@ 2016-06-14 19:46   ` Topi Miettinen
  2016-06-15 14:47   ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-14 19:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Linux Kernel Mailing List

On 06/14/16 19:03, Konstantin Khlebnikov wrote:
> I don't like the idea of this patchset.
> 
> All limitations are context dependent and that context changes rapidly.
> You'll never dump enough information for predicting future errors or
> investigating reson of errors in past. You could try to reproduce all
> kernel logic but model always will be aproximate.
> 

But that is true regardless of how the starting point for the limits was
determined. There will be always a possibility of setting too tight
limits which may work for a couple of test runs but which could also
eventually fail. The opposite is also possible, to use too loose limits
which are not effective. That's the way with limits in any case.

> If you want to track origin of failures in user space applications when it hits
> some limit you should track errors. For example rlimits and other limitation
> subsystems could provide resonable amount of tracepoints which could
> tell what exactly happened before error. If you need highwater of some
> values you could track it in userspace, or maybe tracing subsystem could
> provide postpocessing for tracepoint parameters. Anyway, systemtap and
> other monsters can do this right now.
> 

Those tools could help improving the starting point. But how could they
give exact value for that?

With this patch set, the user can just look at files in /proc and simply
copy the values to a config file as a starting point. What would be the
work flow with the tracepoint approach?

> On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the actual maximum values. I looked where each limit is checked
>> and added a call to limit bump nearby.
>>
>>
>> Capabilities
>> [RFC 01/18] capabilities: track actually used capabilities
>>
>> Currently, there is no way to know which capabilities are actually used. Even
>> the source code is only implicit, in-depth knowledge of each capability must
>> be used when analyzing a program to judge which capabilities the program will
>> exercise.
>>
>> Cgroups
>> [RFC 02/18] cgroup_pids: track maximum pids
>> [RFC 03/18] memcontrol: present maximum used memory also for
>> [RFC 04/18] device_cgroup: track and present accessed devices
>>
>> For tasks and memory cgroup limits the situation is somewhat better as the
>> current tasks and memory status can be easily seen with ps(1). However, any
>> transient tasks or temporary higher memory use might slip from the view.
>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
>> universal method. Program sources typically give no useful indication about
>> memory use or how many tasks there could be.
>>
>> Resource limits
>> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
>> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
>> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
>> [RFC 08/18] limits: track RLIMIT_DATA actual max
>> [RFC 09/18] limits: track RLIMIT_CORE actual max
>> [RFC 10/18] limits: track RLIMIT_STACK actual max
>> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
>> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
>> [RFC 13/18] limits: track RLIMIT_AS actual max
>> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
>> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
>> [RFC 16/18] limits: track RLIMIT_NICE actual max
>> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
>> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>>
>> Current number of files and current VM usage (data pages, address space size)
>> could be calculated from available /proc files. Again, any temporarily higher
>> values could be easily missed. For many limits, there is no way to see what
>> is the current situation and source code is mostly useless.
>>
>> As a side note, the resouce limits seem to be in bad shape. For example,
>> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
>> some changes. Adding RLIMIT_CODE could be useful.
>>
>> The current maximum values for the resource limits are now shown in
>> /proc/task/limits. If this is deemed too confusing for the existing
>> programs which rely on the exact format, I can change that to a new file.
>>
>>
>> Finally, the patches work in my testing but I have probably missed finer
>> lock/RCU details.
>>
>> -Topi
>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
  2016-06-14 15:40     ` Topi Miettinen
@ 2016-06-14 23:15       ` Jann Horn
  0 siblings, 0 replies; 56+ messages in thread
From: Jann Horn @ 2016-06-14 23:15 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Kees Cook, Al Viro, Alexey Dobriyan, John Stultz,
	Janis Danisevskis, Calvin Owens, Tejun Heo, Michal Hocko,
	Oleg Nesterov, Vladimir Davydov, Andrea Arcangeli, Josh Triplett,
	Eric W. Biederman, Aleksa Sarai, Cyrill Gorcunov, Ben Segall,
	Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 1746 bytes --]

On Tue, Jun 14, 2016 at 03:40:35PM +0000, Topi Miettinen wrote:
> On 06/13/16 22:27, Jann Horn wrote:
> > On Mon, Jun 13, 2016 at 10:44:18PM +0300, Topi Miettinen wrote:
> >> Track maximum number of processes per user and present it
> >> in /proc/self/limits.
> >>
> >> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> >> ---
> >>  fs/proc/base.c        | 4 ++++
> >>  include/linux/sched.h | 1 +
> >>  kernel/fork.c         | 5 +++++
> >>  kernel/sys.c          | 5 +++++
> >>  4 files changed, 15 insertions(+)
> >>
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index 1df4fc8..02576c6 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -670,6 +670,10 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
> >>  				seq_printf(m, "%-20lu\n", psecs);
> >>  			}
> >>  			break;
> >> +		case RLIMIT_NPROC:
> >> +			seq_printf(m, "%-20d\n",
> >> +				   atomic_read(&task->real_cred->user->max_processes));
> > 
> > Don't you have to take an RCU read lock before dereferencing task->real_cred?
> 
> In other comments in the series, cmpxchg loop was suggested, would that
> work here?

What would a cmpxchg loop have to do with missing RCU locking?

> > And shouldn't this be done with __task_cred(task) instead of task->real_cred?
> 
> How about atomic_read(task_cred_xxx(task, user)->max_processes)?

No. You'd still end up dereferencing max_processes in the user_struct without
any guarantee that it hasn't been freed. I think the code should look this way:

    case RLIMIT_NPROC:
        rcu_read_lock();
        seq_printf(m, "%-20d\n",
            atomic_read(&__task_cred(task)->user->max_processes));
        rcu_read_unlock();
        break;

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-14 19:03 ` [RFC 00/18] Present useful limits to user Konstantin Khlebnikov
  2016-06-14 19:46   ` Topi Miettinen
@ 2016-06-15 14:47   ` Austin S. Hemmelgarn
  2016-06-18 14:45     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-15 14:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Topi Miettinen; +Cc: Linux Kernel Mailing List

On 2016-06-14 15:03, Konstantin Khlebnikov wrote:
> I don't like the idea of this patchset.
>
> All limitations are context dependent and that context changes rapidly.
> You'll never dump enough information for predicting future errors or
> investigating reson of errors in past. You could try to reproduce all
> kernel logic but model always will be aproximate.
It's still better than what we have now, and there is one particular use 
for the cgroup stuff that I find intriguing, you can create a cgroup, 
populate it, set no limits, and then run a simulated workload against it 
and see how it reacts.  This in general will probably provide a better 
starting point for what to actually set the limits to than just making 
an arbitrary guess.  Certain applications in particular come to mind 
which will just hang when they can't start a new thread or process 
(Dropbox is particularly guilty of this).  In such cases, setting the 
limit too low doesn't result in a crash, it results in the program just 
not appearing to work yet still running otherwise normally.

In general, I could see the rlimit stuff being in the same situation, 
it's not for figuring out why something failed (good software will tell 
you somewhere), but figuring out limits so it doesn't fail but still is 
reasonably contained.  A lot of things that seem at face value like they 
shouldn't need specific exceptions to limits do.  Most normal users 
probably wouldn't guess that acpid needs a RLIMIT_NPROC count of at 
least 4 or more to work with the default rules.  Similarly, there's 
probably not many normal users who know that the Dropbox daemon spawns 
an insanely large thread pool and preallocates significant amounts of 
memory and will just hang if either of these fail.  By having a way to 
get running max counts of resource usage, it makes it easier for people 
to know what the minimum limit they need to put on something is.
> If you want to track origin of failures in user space applications when it hits
> some limit you should track errors. For example rlimits and other limitation
> subsystems could provide resonable amount of tracepoints which could
> tell what exactly happened before error. If you need highwater of some
> values you could track it in userspace, or maybe tracing subsystem could
> provide postpocessing for tracepoint parameters. Anyway, systemtap and
> other monsters can do this right now.
Userspace tracking of some things just isn't practical.  Take 
RLIMIT_NPROC for example.  There's not really any reliable way to track 
this from userspace without modifying the process which is being 
tracked, which is not a user friendly way of doing things, and in some 
cases is functionally impossible for an end user to do.
>
> On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the actual maximum values. I looked where each limit is checked
>> and added a call to limit bump nearby.
>>
>>
>> Capabilities
>> [RFC 01/18] capabilities: track actually used capabilities
>>
>> Currently, there is no way to know which capabilities are actually used. Even
>> the source code is only implicit, in-depth knowledge of each capability must
>> be used when analyzing a program to judge which capabilities the program will
>> exercise.
>>
>> Cgroups
>> [RFC 02/18] cgroup_pids: track maximum pids
>> [RFC 03/18] memcontrol: present maximum used memory also for
>> [RFC 04/18] device_cgroup: track and present accessed devices
>>
>> For tasks and memory cgroup limits the situation is somewhat better as the
>> current tasks and memory status can be easily seen with ps(1). However, any
>> transient tasks or temporary higher memory use might slip from the view.
>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is no
>> universal method. Program sources typically give no useful indication about
>> memory use or how many tasks there could be.
>>
>> Resource limits
>> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
>> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
>> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
>> [RFC 08/18] limits: track RLIMIT_DATA actual max
>> [RFC 09/18] limits: track RLIMIT_CORE actual max
>> [RFC 10/18] limits: track RLIMIT_STACK actual max
>> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
>> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
>> [RFC 13/18] limits: track RLIMIT_AS actual max
>> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
>> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
>> [RFC 16/18] limits: track RLIMIT_NICE actual max
>> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
>> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>>
>> Current number of files and current VM usage (data pages, address space size)
>> could be calculated from available /proc files. Again, any temporarily higher
>> values could be easily missed. For many limits, there is no way to see what
>> is the current situation and source code is mostly useless.
>>
>> As a side note, the resouce limits seem to be in bad shape. For example,
>> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
>> some changes. Adding RLIMIT_CODE could be useful.
>>
>> The current maximum values for the resource limits are now shown in
>> /proc/task/limits. If this is deemed too confusing for the existing
>> programs which rely on the exact format, I can change that to a new file.
>>
>>
>> Finally, the patches work in my testing but I have probably missed finer
>> lock/RCU details.
>>
>> -Topi
>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2
  2016-06-14 17:15         ` Topi Miettinen
@ 2016-06-16 10:27           ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-06-16 10:27 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Johannes Weiner, linux-kernel, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)

On Tue 14-06-16 17:15:06, Topi Miettinen wrote:
> On 06/14/16 16:04, Johannes Weiner wrote:
[...]
> > I didn't catch the coverletter, though. It makes sense to CC
> > recipients of any of those patches on the full series, including the
> > cover, since even though we are specialized in certain areas of the
> > code, many of us are interested in the whole picture of addressing a
> > problem, and not just the few bits in our area without more context.
> > 
> 
> Thank you for this nice explanation. I suppose "git send-email
> --cc-cmd=scripts/get_maintainer.pl" doesn't do this.

No it doesn't. What I do for this kind of series is the following. Put
an explicit CC (acked, reviews etc...) to each patch. git format-patch
$RANGE and then
$ git send-email --cc-cmd=./cc-cmd-only-cover.sh $DEFAULT_TO_CC --compose *.patch

$ cat cc-cmd-only-cover.sh
#!/bin/bash

if [[ $1 == *gitsendemail.msg* || $1 == *cover-letter* ]]; then
        grep '<.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq
fi

A bit error prone because you have to cleanup any previous patch files
from the directory but works more or less well for me.

s 
> > As far as the memcg part of this series goes, one concern is that page
> > cache is trimmed back only when there is pressure, so in all but very
> > few cases the high watermark you are introducing will be pegged to the
> > configured limit. It doesn't give a whole lot of insight.
> > 
> 
> So using the high watermark would not give a very useful starting point
> for the user who wished to configure the memory limit? What else could
> be used instead?

we have an event notification mechanism. In v1 it is vmpressure and v2
you will get a notification when the high/max limit is hit or when we
hit the oom.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 04/18] device_cgroup: track and present accessed devices
  2016-06-13 19:44 ` [RFC 04/18] device_cgroup: track and present accessed devices Topi Miettinen
@ 2016-06-17 15:22   ` Serge E. Hallyn
  0 siblings, 0 replies; 56+ messages in thread
From: Serge E. Hallyn @ 2016-06-17 15:22 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, James Morris, Serge E. Hallyn,
	open list:SECURITY SUBSYSTEM

Quoting Topi Miettinen (toiwoton@gmail.com):
> Track what devices are accessed and present them cgroup devices.accessed.
> 
> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  security/device_cgroup.c | 70 +++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 03c1652..45aa730 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -48,6 +48,7 @@ struct dev_exception_item {
>  struct dev_cgroup {
>  	struct cgroup_subsys_state css;
>  	struct list_head exceptions;
> +	struct list_head accessed;
>  	enum devcg_behavior behavior;
>  };
>  
> @@ -90,7 +91,7 @@ free_and_exit:
>  /*
>   * called under devcgroup_mutex
>   */
> -static int dev_exception_add(struct dev_cgroup *dev_cgroup,
> +static int dev_exception_add(struct list_head *exceptions,
>  			     struct dev_exception_item *ex)

If you're going to re-use this function for the accessed list, then it
should be renamed, bc as it is it's misleading.

It also should be restructured.  The add-exceptions case was rare, so
doing kmemdup before checking for duplicates was ok.  But for the
accessed list I think we want to check for duplicates before we kmemdup.

>  {
>  	struct dev_exception_item *excopy, *walk;
> @@ -101,7 +102,7 @@ static int dev_exception_add(struct dev_cgroup *dev_cgroup,
>  	if (!excopy)
>  		return -ENOMEM;
>  
> -	list_for_each_entry(walk, &dev_cgroup->exceptions, list) {
> +	list_for_each_entry(walk, exceptions, list) {
>  		if (walk->type != ex->type)
>  			continue;
>  		if (walk->major != ex->major)
> @@ -115,7 +116,7 @@ static int dev_exception_add(struct dev_cgroup *dev_cgroup,
>  	}
>  
>  	if (excopy != NULL)
> -		list_add_tail_rcu(&excopy->list, &dev_cgroup->exceptions);
> +		list_add_tail_rcu(&excopy->list, exceptions);
>  	return 0;
>  }
>  
> @@ -155,6 +156,16 @@ static void __dev_exception_clean(struct dev_cgroup *dev_cgroup)
>  	}
>  }
>  
> +static void dev_accessed_clean(struct dev_cgroup *dev_cgroup)
> +{
> +	struct dev_exception_item *ex, *tmp;
> +
> +	list_for_each_entry_safe(ex, tmp, &dev_cgroup->accessed, list) {
> +		list_del_rcu(&ex->list);
> +		kfree_rcu(ex, rcu);
> +	}
> +}
> +
>  /**
>   * dev_exception_clean - frees all entries of the exception list
>   * @dev_cgroup: dev_cgroup with the exception list to be cleaned
> @@ -221,6 +232,7 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (!dev_cgroup)
>  		return ERR_PTR(-ENOMEM);
>  	INIT_LIST_HEAD(&dev_cgroup->exceptions);
> +	INIT_LIST_HEAD(&dev_cgroup->accessed);
>  	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
>  
>  	return &dev_cgroup->css;
> @@ -231,6 +243,7 @@ static void devcgroup_css_free(struct cgroup_subsys_state *css)
>  	struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
>  
>  	__dev_exception_clean(dev_cgroup);
> +	dev_accessed_clean(dev_cgroup);
>  	kfree(dev_cgroup);
>  }
>  
> @@ -272,9 +285,9 @@ static void set_majmin(char *str, unsigned m)
>  		sprintf(str, "%u", m);
>  }
>  
> -static int devcgroup_seq_show(struct seq_file *m, void *v)
> +static int devcgroup_seq_show_list(struct seq_file *m, struct dev_cgroup *devcgroup,
> +				   struct list_head *exceptions, bool allow)
>  {
> -	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
>  	struct dev_exception_item *ex;
>  	char maj[MAJMINLEN], min[MAJMINLEN], acc[ACCLEN];
>  
> @@ -285,14 +298,14 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
>  	 * - List the exceptions in case the default policy is to deny
>  	 * This way, the file remains as a "whitelist of devices"
>  	 */
> -	if (devcgroup->behavior == DEVCG_DEFAULT_ALLOW) {
> +	if (allow) {
>  		set_access(acc, ACC_MASK);
>  		set_majmin(maj, ~0);
>  		set_majmin(min, ~0);
>  		seq_printf(m, "%c %s:%s %s\n", type_to_char(DEV_ALL),
>  			   maj, min, acc);
>  	} else {
> -		list_for_each_entry_rcu(ex, &devcgroup->exceptions, list) {
> +		list_for_each_entry_rcu(ex, exceptions, list) {
>  			set_access(acc, ex->access);
>  			set_majmin(maj, ex->major);
>  			set_majmin(min, ex->minor);
> @@ -305,6 +318,36 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
>  	return 0;
>  }
>  
> +static int devcgroup_seq_show(struct seq_file *m, void *v)
> +{
> +	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
> +
> +	return devcgroup_seq_show_list(m, devcgroup, &devcgroup->exceptions,
> +				       devcgroup->behavior == DEVCG_DEFAULT_ALLOW);
> +}
> +
> +static int devcgroup_seq_show_accessed(struct seq_file *m, void *v)
> +{
> +	struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m));
> +
> +	return devcgroup_seq_show_list(m, devcgroup, &devcgroup->accessed, false);
> +}
> +
> +static void devcgroup_add_accessed(struct dev_cgroup *dev_cgroup, short type,
> +				   u32 major, u32 minor, short access)
> +{
> +	struct dev_exception_item ex;
> +
> +	ex.type = type;
> +	ex.major = major;
> +	ex.minor = minor;
> +	ex.access = access;
> +
> +	mutex_lock(&devcgroup_mutex);
> +	dev_exception_add(&dev_cgroup->accessed, &ex);
> +	mutex_unlock(&devcgroup_mutex);
> +}
> +
>  /**
>   * match_exception	- iterates the exception list trying to find a complete match
>   * @exceptions: list of exceptions
> @@ -566,7 +609,7 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
>  		 */
>  		if (devcg_root->behavior == DEVCG_DEFAULT_ALLOW &&
>  		    devcg->behavior == DEVCG_DEFAULT_ALLOW) {
> -			rc = dev_exception_add(devcg, ex);
> +			rc = dev_exception_add(&devcg->exceptions, ex);
>  			if (rc)
>  				break;
>  		} else {
> @@ -736,7 +779,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
>  
>  		if (!parent_has_perm(devcgroup, &ex))
>  			return -EPERM;
> -		rc = dev_exception_add(devcgroup, &ex);
> +		rc = dev_exception_add(&devcgroup->exceptions, &ex);
>  		break;
>  	case DEVCG_DENY:
>  		/*
> @@ -747,7 +790,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
>  		if (devcgroup->behavior == DEVCG_DEFAULT_DENY)
>  			dev_exception_rm(devcgroup, &ex);
>  		else
> -			rc = dev_exception_add(devcgroup, &ex);
> +			rc = dev_exception_add(&devcgroup->exceptions, &ex);
>  
>  		if (rc)
>  			break;
> @@ -788,6 +831,11 @@ static struct cftype dev_cgroup_files[] = {
>  		.seq_show = devcgroup_seq_show,
>  		.private = DEVCG_LIST,
>  	},
> +	{
> +		.name = "accessed",
> +		.seq_show = devcgroup_seq_show_accessed,
> +		.private = DEVCG_LIST,
> +	},
>  	{ }	/* terminate */
>  };
>  
> @@ -830,6 +878,8 @@ static int __devcgroup_check_permission(short type, u32 major, u32 minor,
>  	if (!rc)
>  		return -EPERM;
>  
> +	devcgroup_add_accessed(dev_cgroup, type, major, minor, access);
> +
>  	return 0;
>  }
>  
> -- 
> 2.8.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
  2016-06-13 19:44 ` [RFC 15/18] limits: track RLIMIT_MSGQUEUE " Topi Miettinen
@ 2016-06-17 19:52   ` Doug Ledford
  0 siblings, 0 replies; 56+ messages in thread
From: Doug Ledford @ 2016-06-17 19:52 UTC (permalink / raw)
  To: Topi Miettinen, linux-kernel
  Cc: Andrew Morton, Michal Hocko, Al Viro, Vladimir Davydov,
	Marcus Gelderie, Kirill A. Shutemov


[-- Attachment #1.1: Type: text/plain, Size: 1767 bytes --]

On 6/13/2016 3:44 PM, Topi Miettinen wrote:
> Track maximum size of message queues, presented in /proc/self/limits.
> 
> Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
> ---
>  ipc/mqueue.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index ade739f..edccf55 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -287,6 +287,8 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
>  
>  		/* all is ok */
>  		info->user = get_uid(u);
> +		/* XXX resource limits apply per task, not per user */
> +		bump_rlimit(RLIMIT_MSGQUEUE, u->mq_bytes);
>  	} else if (S_ISDIR(mode)) {
>  		inc_nlink(inode);
>  		/* Some things misbehave if size == 0 on a directory */
> 

This patch looks all sorts of wrong to me.

In a current linus tree I can't find a single instance of bump_rlimit.
Where is this magical function coming from?

Second, u->mq_bytes is the current size of all message queues for a
given user.  It is not per-task.  So your message about limits being
per-task is wrong (at least partially, the actual byte count is per-user
not per-task, but the limit we check when we create a new queue is
per-task and not per-user).  So your comment is wrong, the one
functional line you added appears to be a non-existent function, and
even if those two things are resolved, why in the world would the fact
that we created a new message queue mean we should bump our rlimit?
That makes no sense, because would *never* have a working rlimit any
more, we would simply increase our rlimit by the size of our existing
queues every time we made a queue.

This is just a totally broken patch.  Major NAK.

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-15 14:47   ` Austin S. Hemmelgarn
@ 2016-06-18 14:45     ` Konstantin Khlebnikov
  2016-06-19  6:38       ` Topi Miettinen
  2016-06-20 17:37       ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 56+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-18 14:45 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Topi Miettinen, Linux Kernel Mailing List

On Wed, Jun 15, 2016 at 5:47 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-06-14 15:03, Konstantin Khlebnikov wrote:
>>
>> I don't like the idea of this patchset.
>>
>> All limitations are context dependent and that context changes rapidly.
>> You'll never dump enough information for predicting future errors or
>> investigating reson of errors in past. You could try to reproduce all
>> kernel logic but model always will be aproximate.
>
> It's still better than what we have now, and there is one particular use for
> the cgroup stuff that I find intriguing, you can create a cgroup, populate
> it, set no limits, and then run a simulated workload against it and see how
> it reacts.  This in general will probably provide a better starting point
> for what to actually set the limits to than just making an arbitrary guess.
> Certain applications in particular come to mind which will just hang when
> they can't start a new thread or process (Dropbox is particularly guilty of
> this).  In such cases, setting the limit too low doesn't result in a crash,
> it results in the program just not appearing to work yet still running
> otherwise normally.
>
> In general, I could see the rlimit stuff being in the same situation, it's
> not for figuring out why something failed (good software will tell you
> somewhere), but figuring out limits so it doesn't fail but still is
> reasonably contained.  A lot of things that seem at face value like they
> shouldn't need specific exceptions to limits do.  Most normal users probably
> wouldn't guess that acpid needs a RLIMIT_NPROC count of at least 4 or more
> to work with the default rules.  Similarly, there's probably not many normal
> users who know that the Dropbox daemon spawns an insanely large thread pool
> and preallocates significant amounts of memory and will just hang if either
> of these fail.  By having a way to get running max counts of resource usage,
> it makes it easier for people to know what the minimum limit they need to
> put on something is.

Rlimits work only if resource usage could be estimated apriori.
They allows app limit itself to prevent failures is something goes wrong.

Rlimits are useless for controlling resource destribition: just use
cgroups for that.

>>
>> If you want to track origin of failures in user space applications when it
>> hits
>> some limit you should track errors. For example rlimits and other
>> limitation
>> subsystems could provide resonable amount of tracepoints which could
>> tell what exactly happened before error. If you need highwater of some
>> values you could track it in userspace, or maybe tracing subsystem could
>> provide postpocessing for tracepoint parameters. Anyway, systemtap and
>> other monsters can do this right now.
>
> Userspace tracking of some things just isn't practical.  Take RLIMIT_NPROC
> for example.  There's not really any reliable way to track this from
> userspace without modifying the process which is being tracked, which is not
> a user friendly way of doing things, and in some cases is functionally
> impossible for an end user to do.

You cannot get reliable upper bound for nr-proc from black box observations.
Highwater mark is very racy - tiny timing shifts can change it drammaticaly.

>
>>
>> On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com>
>> wrote:
>>>
>>> Hello,
>>>
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find
>>> out
>>> useful values for the limits, except blind trial and error.
>>>
>>> This patch series attempts to fix that by giving at least a nice starting
>>> point from the actual maximum values. I looked where each limit is
>>> checked
>>> and added a call to limit bump nearby.
>>>
>>>
>>> Capabilities
>>> [RFC 01/18] capabilities: track actually used capabilities
>>>
>>> Currently, there is no way to know which capabilities are actually used.
>>> Even
>>> the source code is only implicit, in-depth knowledge of each capability
>>> must
>>> be used when analyzing a program to judge which capabilities the program
>>> will
>>> exercise.
>>>
>>> Cgroups
>>> [RFC 02/18] cgroup_pids: track maximum pids
>>> [RFC 03/18] memcontrol: present maximum used memory also for
>>> [RFC 04/18] device_cgroup: track and present accessed devices
>>>
>>> For tasks and memory cgroup limits the situation is somewhat better as
>>> the
>>> current tasks and memory status can be easily seen with ps(1). However,
>>> any
>>> transient tasks or temporary higher memory use might slip from the view.
>>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is
>>> no
>>> universal method. Program sources typically give no useful indication
>>> about
>>> memory use or how many tasks there could be.
>>>
>>> Resource limits
>>> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
>>> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
>>> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
>>> [RFC 08/18] limits: track RLIMIT_DATA actual max
>>> [RFC 09/18] limits: track RLIMIT_CORE actual max
>>> [RFC 10/18] limits: track RLIMIT_STACK actual max
>>> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
>>> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
>>> [RFC 13/18] limits: track RLIMIT_AS actual max
>>> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
>>> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
>>> [RFC 16/18] limits: track RLIMIT_NICE actual max
>>> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
>>> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>>>
>>> Current number of files and current VM usage (data pages, address space
>>> size)
>>> could be calculated from available /proc files. Again, any temporarily
>>> higher
>>> values could be easily missed. For many limits, there is no way to see
>>> what
>>> is the current situation and source code is mostly useless.
>>>
>>> As a side note, the resouce limits seem to be in bad shape. For example,
>>> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
>>> some changes. Adding RLIMIT_CODE could be useful.
>>>
>>> The current maximum values for the resource limits are now shown in
>>> /proc/task/limits. If this is deemed too confusing for the existing
>>> programs which rely on the exact format, I can change that to a new file.
>>>
>>>
>>> Finally, the patches work in my testing but I have probably missed finer
>>> lock/RCU details.
>>>
>>> -Topi
>>>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-18 14:45     ` Konstantin Khlebnikov
@ 2016-06-19  6:38       ` Topi Miettinen
  2016-06-20 17:37       ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 56+ messages in thread
From: Topi Miettinen @ 2016-06-19  6:38 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Austin S. Hemmelgarn; +Cc: Linux Kernel Mailing List

On 06/18/16 14:45, Konstantin Khlebnikov wrote:
> On Wed, Jun 15, 2016 at 5:47 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-06-14 15:03, Konstantin Khlebnikov wrote:
>>>
>>> I don't like the idea of this patchset.
>>>
>>> All limitations are context dependent and that context changes rapidly.
>>> You'll never dump enough information for predicting future errors or
>>> investigating reson of errors in past. You could try to reproduce all
>>> kernel logic but model always will be aproximate.
>>
>> It's still better than what we have now, and there is one particular use for
>> the cgroup stuff that I find intriguing, you can create a cgroup, populate
>> it, set no limits, and then run a simulated workload against it and see how
>> it reacts.  This in general will probably provide a better starting point
>> for what to actually set the limits to than just making an arbitrary guess.
>> Certain applications in particular come to mind which will just hang when
>> they can't start a new thread or process (Dropbox is particularly guilty of
>> this).  In such cases, setting the limit too low doesn't result in a crash,
>> it results in the program just not appearing to work yet still running
>> otherwise normally.
>>
>> In general, I could see the rlimit stuff being in the same situation, it's
>> not for figuring out why something failed (good software will tell you
>> somewhere), but figuring out limits so it doesn't fail but still is
>> reasonably contained.  A lot of things that seem at face value like they
>> shouldn't need specific exceptions to limits do.  Most normal users probably
>> wouldn't guess that acpid needs a RLIMIT_NPROC count of at least 4 or more
>> to work with the default rules.  Similarly, there's probably not many normal
>> users who know that the Dropbox daemon spawns an insanely large thread pool
>> and preallocates significant amounts of memory and will just hang if either
>> of these fail.  By having a way to get running max counts of resource usage,
>> it makes it easier for people to know what the minimum limit they need to
>> put on something is.
> 
> Rlimits work only if resource usage could be estimated apriori.

Now there's no way to estimate resource usage except analyzing source
code and even that does give any estimate for memory limits. With the
patches there's an easy way. Another way could be trial and error with
binary search. Any way, even those estimates can be too small.

Perhaps the problem is that the limiting mechanisms are too discrete,
either everything is OK or system calls fail (or the process is killed),
there's no throttling in between. Also, the applications could in some
cases tolerate the failures better.

> They allows app limit itself to prevent failures is something goes wrong.
> 

The application is not in any better position to estimate the limits.
Typically the limits are set elsewhere.

> Rlimits are useless for controlling resource destribition: just use
> cgroups for that.
> 

There are no direct cgroup equivalents for most rlimits. I think the
memory rlimits also make more sense per task. But otherwise cgroup
approach would be more flexible and adding further cgroups doesn't look
very difficult.

>>>
>>> If you want to track origin of failures in user space applications when it
>>> hits
>>> some limit you should track errors. For example rlimits and other
>>> limitation
>>> subsystems could provide resonable amount of tracepoints which could
>>> tell what exactly happened before error. If you need highwater of some
>>> values you could track it in userspace, or maybe tracing subsystem could
>>> provide postpocessing for tracepoint parameters. Anyway, systemtap and
>>> other monsters can do this right now.
>>
>> Userspace tracking of some things just isn't practical.  Take RLIMIT_NPROC
>> for example.  There's not really any reliable way to track this from
>> userspace without modifying the process which is being tracked, which is not
>> a user friendly way of doing things, and in some cases is functionally
>> impossible for an end user to do.
> 
> You cannot get reliable upper bound for nr-proc from black box observations.
> Highwater mark is very racy - tiny timing shifts can change it drammaticaly.
> 

The estimates are imperfect, but does this make the highwater mark
tracking any less valuable? How would you make the less imperfect
estimate with tracepoints?

>>
>>>
>>> On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find
>>>> out
>>>> useful values for the limits, except blind trial and error.
>>>>
>>>> This patch series attempts to fix that by giving at least a nice starting
>>>> point from the actual maximum values. I looked where each limit is
>>>> checked
>>>> and added a call to limit bump nearby.
>>>>
>>>>
>>>> Capabilities
>>>> [RFC 01/18] capabilities: track actually used capabilities
>>>>
>>>> Currently, there is no way to know which capabilities are actually used.
>>>> Even
>>>> the source code is only implicit, in-depth knowledge of each capability
>>>> must
>>>> be used when analyzing a program to judge which capabilities the program
>>>> will
>>>> exercise.
>>>>
>>>> Cgroups
>>>> [RFC 02/18] cgroup_pids: track maximum pids
>>>> [RFC 03/18] memcontrol: present maximum used memory also for
>>>> [RFC 04/18] device_cgroup: track and present accessed devices
>>>>
>>>> For tasks and memory cgroup limits the situation is somewhat better as
>>>> the
>>>> current tasks and memory status can be easily seen with ps(1). However,
>>>> any
>>>> transient tasks or temporary higher memory use might slip from the view.
>>>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is
>>>> no
>>>> universal method. Program sources typically give no useful indication
>>>> about
>>>> memory use or how many tasks there could be.
>>>>
>>>> Resource limits
>>>> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
>>>> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
>>>> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
>>>> [RFC 08/18] limits: track RLIMIT_DATA actual max
>>>> [RFC 09/18] limits: track RLIMIT_CORE actual max
>>>> [RFC 10/18] limits: track RLIMIT_STACK actual max
>>>> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
>>>> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
>>>> [RFC 13/18] limits: track RLIMIT_AS actual max
>>>> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
>>>> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
>>>> [RFC 16/18] limits: track RLIMIT_NICE actual max
>>>> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
>>>> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>>>>
>>>> Current number of files and current VM usage (data pages, address space
>>>> size)
>>>> could be calculated from available /proc files. Again, any temporarily
>>>> higher
>>>> values could be easily missed. For many limits, there is no way to see
>>>> what
>>>> is the current situation and source code is mostly useless.
>>>>
>>>> As a side note, the resouce limits seem to be in bad shape. For example,
>>>> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
>>>> some changes. Adding RLIMIT_CODE could be useful.
>>>>
>>>> The current maximum values for the resource limits are now shown in
>>>> /proc/task/limits. If this is deemed too confusing for the existing
>>>> programs which rely on the exact format, I can change that to a new file.
>>>>
>>>>
>>>> Finally, the patches work in my testing but I have probably missed finer
>>>> lock/RCU details.
>>>>
>>>> -Topi
>>>>
>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 00/18] Present useful limits to user
  2016-06-18 14:45     ` Konstantin Khlebnikov
  2016-06-19  6:38       ` Topi Miettinen
@ 2016-06-20 17:37       ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-20 17:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Topi Miettinen, Linux Kernel Mailing List

On 2016-06-18 10:45, Konstantin Khlebnikov wrote:
> On Wed, Jun 15, 2016 at 5:47 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-06-14 15:03, Konstantin Khlebnikov wrote:
>>>
>>> I don't like the idea of this patchset.
>>>
>>> All limitations are context dependent and that context changes rapidly.
>>> You'll never dump enough information for predicting future errors or
>>> investigating reson of errors in past. You could try to reproduce all
>>> kernel logic but model always will be aproximate.
>>
>> It's still better than what we have now, and there is one particular use for
>> the cgroup stuff that I find intriguing, you can create a cgroup, populate
>> it, set no limits, and then run a simulated workload against it and see how
>> it reacts.  This in general will probably provide a better starting point
>> for what to actually set the limits to than just making an arbitrary guess.
>> Certain applications in particular come to mind which will just hang when
>> they can't start a new thread or process (Dropbox is particularly guilty of
>> this).  In such cases, setting the limit too low doesn't result in a crash,
>> it results in the program just not appearing to work yet still running
>> otherwise normally.
>>
>> In general, I could see the rlimit stuff being in the same situation, it's
>> not for figuring out why something failed (good software will tell you
>> somewhere), but figuring out limits so it doesn't fail but still is
>> reasonably contained.  A lot of things that seem at face value like they
>> shouldn't need specific exceptions to limits do.  Most normal users probably
>> wouldn't guess that acpid needs a RLIMIT_NPROC count of at least 4 or more
>> to work with the default rules.  Similarly, there's probably not many normal
>> users who know that the Dropbox daemon spawns an insanely large thread pool
>> and preallocates significant amounts of memory and will just hang if either
>> of these fail.  By having a way to get running max counts of resource usage,
>> it makes it easier for people to know what the minimum limit they need to
>> put on something is.
>
> Rlimits work only if resource usage could be estimated apriori.
> They allows app limit itself to prevent failures is something goes wrong.
And yet many apps allow the _user_ to specify rlimits.  Avahi has the 
option for the user to set every single rlimit, ntpd (the reference 
implementation) lets the user configure MEMLOCK, and quite a few other 
daemons I've seen that are very widely used allow similar manual 
configuration of limits.  Most of these are network service daemons, 
which _can't_ reasonably limit themselves, because they can't know what 
type of workload they'll run against.
>
> Rlimits are useless for controlling resource destribition: just use
> cgroups for that.
The only rlimit that has a cgroup specifically for managing it is NPROC. 
  There's a bunch of memory ones that can't be individually controlled 
in the memcg.  MEMLOCK is actually pretty widely used from what I've 
seen, but there is no way to control it at all with cgroups right now. 
NOFILE, LOCKS, FSIZE, and CORE all deal with the filesystem and have no 
cgroup that controls such resources (the only two that might be useful 
this way are NOFILE and LOCKS, but I doubt that those will get in, 
because they technically tie in with the kernel memory accounting in 
memcg).  NICE and RTPRIO are nonsensical in a cgroup context, although I 
don't think I've ever talked to anyone who actually uses them.  CPU and 
RTTIME have no equivalent in cgroups, but could in theory be tacked onto 
the cpu controller, but they haven't been and until that happens, people 
still have to use them instead of cgroups.
>
>>>
>>> If you want to track origin of failures in user space applications when it
>>> hits
>>> some limit you should track errors. For example rlimits and other
>>> limitation
>>> subsystems could provide resonable amount of tracepoints which could
>>> tell what exactly happened before error. If you need highwater of some
>>> values you could track it in userspace, or maybe tracing subsystem could
>>> provide postpocessing for tracepoint parameters. Anyway, systemtap and
>>> other monsters can do this right now.
>>
>> Userspace tracking of some things just isn't practical.  Take RLIMIT_NPROC
>> for example.  There's not really any reliable way to track this from
>> userspace without modifying the process which is being tracked, which is not
>> a user friendly way of doing things, and in some cases is functionally
>> impossible for an end user to do.
>
> You cannot get reliable upper bound for nr-proc from black box observations.
> Highwater mark is very racy - tiny timing shifts can change it drammaticaly.
You can't get a perfectly reliable upper bound for any type of resource 
usage with just black box observations, period.  You also can't do so 
with tracing without some significant secondary work either for _exactly 
the same reason_.  The thing to remember though is that in a majority of 
cases, what most people need is simply a reasonable estimate which is 
guaranteed to not be below the actual usage.  They don't care exactly 
how many processes application Y uses at most, they just care that it 
uses fewer than some reasonable limit under normal usage.  To go back to 
the NPROC example, most people want to be able to set a limit that will 
catch things if they start to get out of hand, but absolutely have to 
estimate high because almost nothing handles a fork failure gracefully 
without completely shutting down.  In such a situation, it doesn't 
matter if it's a bit racy, as long as they have some reasonable lower 
bound to base the estimate off of and the specifics of it not being 100% 
reliable are properly documented.
>
>>
>>>
>>> On Mon, Jun 13, 2016 at 10:44 PM, Topi Miettinen <toiwoton@gmail.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find
>>>> out
>>>> useful values for the limits, except blind trial and error.
>>>>
>>>> This patch series attempts to fix that by giving at least a nice starting
>>>> point from the actual maximum values. I looked where each limit is
>>>> checked
>>>> and added a call to limit bump nearby.
>>>>
>>>>
>>>> Capabilities
>>>> [RFC 01/18] capabilities: track actually used capabilities
>>>>
>>>> Currently, there is no way to know which capabilities are actually used.
>>>> Even
>>>> the source code is only implicit, in-depth knowledge of each capability
>>>> must
>>>> be used when analyzing a program to judge which capabilities the program
>>>> will
>>>> exercise.
>>>>
>>>> Cgroups
>>>> [RFC 02/18] cgroup_pids: track maximum pids
>>>> [RFC 03/18] memcontrol: present maximum used memory also for
>>>> [RFC 04/18] device_cgroup: track and present accessed devices
>>>>
>>>> For tasks and memory cgroup limits the situation is somewhat better as
>>>> the
>>>> current tasks and memory status can be easily seen with ps(1). However,
>>>> any
>>>> transient tasks or temporary higher memory use might slip from the view.
>>>> Device use may be seen with advanced MAC tools, like TOMOYO, but there is
>>>> no
>>>> universal method. Program sources typically give no useful indication
>>>> about
>>>> memory use or how many tasks there could be.
>>>>
>>>> Resource limits
>>>> [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max
>>>> [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current
>>>> [RFC 07/18] limits: track RLIMIT_FSIZE actual max
>>>> [RFC 08/18] limits: track RLIMIT_DATA actual max
>>>> [RFC 09/18] limits: track RLIMIT_CORE actual max
>>>> [RFC 10/18] limits: track RLIMIT_STACK actual max
>>>> [RFC 11/18] limits: track and present RLIMIT_NPROC actual max
>>>> [RFC 12/18] limits: track RLIMIT_MEMLOCK actual max
>>>> [RFC 13/18] limits: track RLIMIT_AS actual max
>>>> [RFC 14/18] limits: track RLIMIT_SIGPENDING actual max
>>>> [RFC 15/18] limits: track RLIMIT_MSGQUEUE actual max
>>>> [RFC 16/18] limits: track RLIMIT_NICE actual max
>>>> [RFC 17/18] limits: track RLIMIT_RTPRIO actual max
>>>> [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps
>>>>
>>>> Current number of files and current VM usage (data pages, address space
>>>> size)
>>>> could be calculated from available /proc files. Again, any temporarily
>>>> higher
>>>> values could be easily missed. For many limits, there is no way to see
>>>> what
>>>> is the current situation and source code is mostly useless.
>>>>
>>>> As a side note, the resouce limits seem to be in bad shape. For example,
>>>> RLIMIT_MEMLOCK is used incoherently and I think VM statistics can miss
>>>> some changes. Adding RLIMIT_CODE could be useful.
>>>>
>>>> The current maximum values for the resource limits are now shown in
>>>> /proc/task/limits. If this is deemed too confusing for the existing
>>>> programs which rely on the exact format, I can change that to a new file.
>>>>
>>>>
>>>> Finally, the patches work in my testing but I have probably missed finer
>>>> lock/RCU details.
>>>>
>>>> -Topi
>>>>
>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-06-13 21:33       ` Tejun Heo
  2016-06-13 21:59         ` Topi Miettinen
@ 2016-07-17 20:11         ` Topi Miettinen
  2016-07-19  1:09           ` Tejun Heo
  1 sibling, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-07-17 20:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On 06/13/16 21:33, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
>> I used fork callback as I don't want to lower the watermark in all cases
>> where the charge can be lowered, so I'd update the watermark only when
>> the fork really happens.
> 
> I don't think that would make a noticeable difference.  That's where
> we decide whether to grant fork or not after all and thus where the
> actual usage is.

I tried using only charge functions, but then the result was too low.
With fork callback, the result was as expected.

-Topi

> 
>> Is there a better way to compare and set? I don't think atomic_cmpxchg()
>> does what's needed,
> 
> cmpxchg loop should do what's necessary although I'm not sure how much
> being strictly correct matters here.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-07-17 20:11         ` Topi Miettinen
@ 2016-07-19  1:09           ` Tejun Heo
  2016-07-19 16:59             ` Topi Miettinen
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2016-07-19  1:09 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On Sun, Jul 17, 2016 at 08:11:31PM +0000, Topi Miettinen wrote:
> On 06/13/16 21:33, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
> >> I used fork callback as I don't want to lower the watermark in all cases
> >> where the charge can be lowered, so I'd update the watermark only when
> >> the fork really happens.
> > 
> > I don't think that would make a noticeable difference.  That's where
> > we decide whether to grant fork or not after all and thus where the
> > actual usage is.
> 
> I tried using only charge functions, but then the result was too low.
> With fork callback, the result was as expected.

Can you please elaborate in more details?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-07-19  1:09           ` Tejun Heo
@ 2016-07-19 16:59             ` Topi Miettinen
  2016-07-19 18:13               ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Topi Miettinen @ 2016-07-19 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On 07/19/16 01:09, Tejun Heo wrote:
> On Sun, Jul 17, 2016 at 08:11:31PM +0000, Topi Miettinen wrote:
>> On 06/13/16 21:33, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Mon, Jun 13, 2016 at 09:29:32PM +0000, Topi Miettinen wrote:
>>>> I used fork callback as I don't want to lower the watermark in all cases
>>>> where the charge can be lowered, so I'd update the watermark only when
>>>> the fork really happens.
>>>
>>> I don't think that would make a noticeable difference.  That's where
>>> we decide whether to grant fork or not after all and thus where the
>>> actual usage is.
>>
>> I tried using only charge functions, but then the result was too low.
>> With fork callback, the result was as expected.
> 
> Can you please elaborate in more details?

With the example systemd-timesyncd case, I was only getting 1 as the
highwatermark, but there were already two tasks.

-Topi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC 02/18] cgroup_pids: track maximum pids
  2016-07-19 16:59             ` Topi Miettinen
@ 2016-07-19 18:13               ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2016-07-19 18:13 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Li Zefan, Johannes Weiner,
	open list:CONTROL GROUP (CGROUP)

On Tue, Jul 19, 2016 at 04:59:18PM +0000, Topi Miettinen wrote:
> With the example systemd-timesyncd case, I was only getting 1 as the
> highwatermark, but there were already two tasks.

Can you please find out why that is so?  Given that that's where we
charge pid usage, it doesn't make sense to me that you're getting
lower numbers than actual usage there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-07-19 18:13 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 19:44 [RFC 00/18] Present useful limits to user Topi Miettinen
2016-06-13 19:44 ` [RFC 01/18] capabilities: track actually used capabilities Topi Miettinen
2016-06-13 20:32   ` Andy Lutomirski
2016-06-13 20:45     ` Topi Miettinen
2016-06-13 21:12       ` Andy Lutomirski
2016-06-13 21:48         ` Topi Miettinen
2016-06-13 19:44 ` [RFC 02/18] cgroup_pids: track maximum pids Topi Miettinen
2016-06-13 21:12   ` Tejun Heo
2016-06-13 21:29     ` Topi Miettinen
2016-06-13 21:33       ` Tejun Heo
2016-06-13 21:59         ` Topi Miettinen
2016-06-13 22:09           ` Tejun Heo
2016-07-17 20:11         ` Topi Miettinen
2016-07-19  1:09           ` Tejun Heo
2016-07-19 16:59             ` Topi Miettinen
2016-07-19 18:13               ` Tejun Heo
2016-06-13 19:44 ` [RFC 03/18] memcontrol: present maximum used memory also for cgroup-v2 Topi Miettinen
2016-06-14  7:01   ` Michal Hocko
2016-06-14 15:47     ` Topi Miettinen
2016-06-14 16:04       ` Johannes Weiner
2016-06-14 17:15         ` Topi Miettinen
2016-06-16 10:27           ` Michal Hocko
2016-06-13 19:44 ` [RFC 04/18] device_cgroup: track and present accessed devices Topi Miettinen
2016-06-17 15:22   ` Serge E. Hallyn
2016-06-13 19:44 ` [RFC 05/18] limits: track and present RLIMIT_NOFILE actual max Topi Miettinen
2016-06-13 20:40   ` Andy Lutomirski
2016-06-13 21:13     ` Topi Miettinen
2016-06-13 21:16       ` Andy Lutomirski
2016-06-14 15:21         ` Topi Miettinen
2016-06-13 19:44 ` [RFC 06/18] limits: present RLIMIT_CPU and RLIMIT_RTTIMER current status Topi Miettinen
2016-06-14  9:14   ` Alexey Dobriyan
2016-06-13 19:44 ` [RFC 07/18] limits: track RLIMIT_FSIZE actual max Topi Miettinen
2016-06-13 19:44 ` [RFC 08/18] limits: track RLIMIT_DATA " Topi Miettinen
2016-06-13 19:44 ` [RFC 09/18] limits: track RLIMIT_CORE " Topi Miettinen
2016-06-13 19:44 ` [RFC 10/18] limits: track RLIMIT_STACK " Topi Miettinen
2016-06-13 19:44 ` [RFC 11/18] limits: track and present RLIMIT_NPROC " Topi Miettinen
2016-06-13 22:27   ` Jann Horn
2016-06-14 15:40     ` Topi Miettinen
2016-06-14 23:15       ` Jann Horn
2016-06-13 19:44 ` [RFC 13/18] limits: track RLIMIT_AS " Topi Miettinen
2016-06-13 19:44 ` [RFC 14/18] limits: track RLIMIT_SIGPENDING " Topi Miettinen
2016-06-14 14:50   ` Oleg Nesterov
2016-06-14 15:51     ` Topi Miettinen
2016-06-13 19:44 ` [RFC 15/18] limits: track RLIMIT_MSGQUEUE " Topi Miettinen
2016-06-17 19:52   ` Doug Ledford
2016-06-13 19:44 ` [RFC 16/18] limits: track RLIMIT_NICE " Topi Miettinen
2016-06-13 19:44 ` [RFC 17/18] limits: track RLIMIT_RTPRIO " Topi Miettinen
2016-06-13 19:44 ` [RFC 18/18] proc: present VM_LOCKED memory in /proc/self/maps Topi Miettinen
2016-06-13 20:43   ` Kees Cook
2016-06-13 20:52     ` Topi Miettinen
2016-06-14 19:03 ` [RFC 00/18] Present useful limits to user Konstantin Khlebnikov
2016-06-14 19:46   ` Topi Miettinen
2016-06-15 14:47   ` Austin S. Hemmelgarn
2016-06-18 14:45     ` Konstantin Khlebnikov
2016-06-19  6:38       ` Topi Miettinen
2016-06-20 17:37       ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).