linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] cgroup notifications API and memory thresholds
@ 2009-12-30 15:57 Kirill A. Shutemov
  2009-12-30 15:57 ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Kirill A. Shutemov
  2010-01-04  0:36 ` [PATCH v5 0/4] cgroup notifications API and memory thresholds Balbir Singh
  0 siblings, 2 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2009-12-30 15:57 UTC (permalink / raw)
  To: containers, linux-mm
  Cc: Paul Menage, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel,
	Kirill A. Shutemov

This patchset introduces eventfd-based API for notifications in cgroups and
implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total

Any comments?

v4 -> v5:
 - rework  __mem_cgroup_threshold() a bit;
 - drop __mem_cgroup_stat_reset_safe();
 - fixes based on comments.

v3 -> v4:
 - documentation.

v2 -> v3:
 - remove [RFC];
 - rebased to 2.6.33-rc2;
 - fixes based on comments;
 - fixed potential race on event removing;
 - use RCU-protected arrays to track trasholds.

v1 -> v2:
 - use statistics instead of res_counter to track resource usage;
 - fix bugs with locking.

v0 -> v1:
 - memsw support implemented.

Kirill A. Shutemov (4):
  cgroup: implement eventfd-based generic API for notifications
  memcg: extract mem_group_usage() from mem_cgroup_read()
  memcg: rework usage of stats by soft limit
  memcg: implement memory thresholds

 Documentation/cgroups/cgroups.txt |   20 ++
 Documentation/cgroups/memory.txt  |   19 ++-
 include/linux/cgroup.h            |   24 +++
 kernel/cgroup.c                   |  208 ++++++++++++++++++++-
 mm/memcontrol.c                   |  384 ++++++++++++++++++++++++++++++++++---
 5 files changed, 623 insertions(+), 32 deletions(-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications
  2009-12-30 15:57 [PATCH v5 0/4] cgroup notifications API and memory thresholds Kirill A. Shutemov
@ 2009-12-30 15:57 ` Kirill A. Shutemov
  2009-12-30 15:57   ` [PATCH v5 2/4] memcg: extract mem_group_usage() from mem_cgroup_read() Kirill A. Shutemov
  2010-01-07  1:01   ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Paul Menage
  2010-01-04  0:36 ` [PATCH v5 0/4] cgroup notifications API and memory thresholds Balbir Singh
  1 sibling, 2 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2009-12-30 15:57 UTC (permalink / raw)
  To: containers, linux-mm
  Cc: Paul Menage, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel,
	Kirill A. Shutemov

This patch introduces write-only file "cgroup.event_control" in every
cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
  unregister_event() must be defined for the control file;
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
  Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |   20 ++++
 include/linux/cgroup.h            |   24 +++++
 kernel/cgroup.c                   |  208 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 251 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 3bda601..b73c306 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -23,6 +23,7 @@ CONTENTS:
   2.1 Basic Usage
   2.2 Attaching processes
   2.3 Mounting hierarchies by name
+  2.4 Notification API
 3. Kernel API
   3.1 Overview
   3.2 Synchronization
@@ -435,6 +436,25 @@ you give a subsystem a name.
 The name of the subsystem appears as part of the hierarchy description
 in /proc/mounts and /proc/<pid>/cgroups.
 
+2.4 Notification API
+--------------------
+
+There is mechanism which allows to get notifications about changing
+status of a cgroup.
+
+To register new notification handler you need:
+ - create a file descriptor for event notification using eventfd(2);
+ - open a control file to be monitored (e.g. memory.usage_in_bytes);
+ - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
+   Interpretation of args is defined by control file implementation;
+
+eventfd will be woken up by control file implementation or when the
+cgroup is removed.
+
+To unregister notification handler just close eventfd.
+
+NOTE: Support of notifications should be implemented for the control
+file. See documentation for the subsystem.
 
 3. Kernel API
 =============
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0008dee..b078409 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -220,6 +220,10 @@ struct cgroup {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+
+	/* List of events which userspace want to recieve */
+	struct list_head event_list;
+	spinlock_t event_list_lock;
 };
 
 /*
@@ -362,6 +366,26 @@ struct cftype {
 	int (*trigger)(struct cgroup *cgrp, unsigned int event);
 
 	int (*release)(struct inode *inode, struct file *file);
+
+	/*
+	 * register_event() callback will be used to add new userspace
+	 * waiter for changes related to the cftype. Implement it if
+	 * you want to provide this functionality. Use eventfd_signal()
+	 * on eventfd to send notification to userspace.
+	 */
+	int (*register_event)(struct cgroup *cgrp, struct cftype *cft,
+			struct eventfd_ctx *eventfd, const char *args);
+	/*
+	 * unregister_event() callback will be called when userspace
+	 * closes the eventfd or on cgroup removing.
+	 * This callback must be implemented, if you want provide
+	 * notification functionality.
+	 *
+	 * Be careful. It can be called after destroy(), so you have
+	 * to keep all nesessary data, until all events are removed.
+	 */
+	int (*unregister_event)(struct cgroup *cgrp, struct cftype *cft,
+			struct eventfd_ctx *eventfd);
 };
 
 struct cgroup_scanner {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0249f4b..6017e7a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4,6 +4,10 @@
  *  Based originally on the cpuset system, extracted by Paul Menage
  *  Copyright (C) 2006 Google, Inc
  *
+ *  Notifications support
+ *  Copyright (C) 2009 Nokia Corporation
+ *  Author: Kirill A. Shutemov
+ *
  *  Copyright notices from the original cpuset code:
  *  --------------------------------------------------
  *  Copyright (C) 2003 BULL SA.
@@ -51,6 +55,8 @@
 #include <linux/pid_namespace.h>
 #include <linux/idr.h>
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
+#include <linux/eventfd.h>
+#include <linux/poll.h>
 
 #include <asm/atomic.h>
 
@@ -146,6 +152,35 @@ struct css_id {
 	unsigned short stack[0]; /* Array of Length (depth+1) */
 };
 
+/*
+ * cgroup_event represents events which userspace want to recieve.
+ */
+struct cgroup_event {
+	/*
+	 * Cgroup which the event belongs to.
+	 */
+	struct cgroup *cgrp;
+	/*
+	 * Control file which the event associated.
+	 */
+	struct cftype *cft;
+	/*
+	 * eventfd to signal userspace about the event.
+	 */
+	struct eventfd_ctx *eventfd;
+	/*
+	 * Each of these stored in a list by the cgroup.
+	 */
+	struct list_head list;
+	/*
+	 * All fields below needed to unregister event when
+	 * userspace closes eventfd.
+	 */
+	poll_table pt;
+	wait_queue_head_t *wqh;
+	wait_queue_t wait;
+	struct work_struct remove;
+};
 
 /* The list of hierarchy roots */
 
@@ -734,14 +769,28 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 static int cgroup_call_pre_destroy(struct cgroup *cgrp)
 {
 	struct cgroup_subsys *ss;
+	struct cgroup_event *event, *tmp;
 	int ret = 0;
 
 	for_each_subsys(cgrp->root, ss)
 		if (ss->pre_destroy) {
 			ret = ss->pre_destroy(ss, cgrp);
 			if (ret)
-				break;
+				goto out;
 		}
+
+	/*
+	 * Unregister events and notify userspace.
+	 */
+	spin_lock(&cgrp->event_list_lock);
+	list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+		list_del(&event->list);
+		eventfd_signal(event->eventfd, 1);
+		schedule_work(&event->remove);
+	}
+	spin_unlock(&cgrp->event_list_lock);
+
+out:
 	return ret;
 }
 
@@ -1136,6 +1185,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->release_list);
 	INIT_LIST_HEAD(&cgrp->pidlists);
 	mutex_init(&cgrp->pidlist_mutex);
+	INIT_LIST_HEAD(&cgrp->event_list);
+	spin_lock_init(&cgrp->event_list_lock);
 }
 
 static void init_cgroup_root(struct cgroupfs_root *root)
@@ -1935,6 +1986,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
 	.rename = cgroup_rename,
 };
 
+/*
+ * Check if a file is a control file
+ */
+static inline struct cftype *__file_cft(struct file *file)
+{
+	if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
+		return ERR_PTR(-EINVAL);
+	return __d_cft(file->f_dentry);
+}
+
 static int cgroup_create_file(struct dentry *dentry, mode_t mode,
 				struct super_block *sb)
 {
@@ -2789,6 +2850,146 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
 	return 0;
 }
 
+static void cgroup_event_remove(struct work_struct *work)
+{
+	struct cgroup_event *event = container_of(work, struct cgroup_event,
+			remove);
+	struct cgroup *cgrp = event->cgrp;
+
+	/* TODO: check return code*/
+	event->cft->unregister_event(cgrp, event->cft, event->eventfd);
+
+	eventfd_ctx_put(event->eventfd);
+	remove_wait_queue(event->wqh, &event->wait);
+	kfree(event);
+}
+
+static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
+		int sync, void *key)
+{
+	struct cgroup_event *event = container_of(wait,
+			struct cgroup_event, wait);
+	struct cgroup *cgrp = event->cgrp;
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLHUP) {
+		spin_lock(&cgrp->event_list_lock);
+		list_del(&event->list);
+		spin_unlock(&cgrp->event_list_lock);
+		schedule_work(&event->remove);
+	}
+
+	return 0;
+}
+
+static void cgroup_event_ptable_queue_proc(struct file *file,
+		wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct cgroup_event *event = container_of(pt,
+			struct cgroup_event, pt);
+
+	event->wqh = wqh;
+	add_wait_queue(wqh, &event->wait);
+}
+
+static int cgroup_write_event_control(struct cgroup *cgrp, struct cftype *cft,
+				      const char *buffer)
+{
+	struct cgroup_event *event = NULL;
+	unsigned int efd, cfd;
+	struct file *efile = NULL;
+	struct file *cfile = NULL;
+	char *endp;
+	int ret;
+
+	efd = simple_strtoul(buffer, &endp, 10);
+	if (*endp != ' ')
+		return -EINVAL;
+	buffer = endp + 1;
+
+	cfd = simple_strtoul(buffer, &endp, 10);
+	if ((*endp != ' ') && (*endp != '\0'))
+		return -EINVAL;
+	buffer = endp + 1;
+
+	event = kzalloc(sizeof(*event), GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+	event->cgrp = cgrp;
+	INIT_LIST_HEAD(&event->list);
+	init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
+	init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
+	INIT_WORK(&event->remove, cgroup_event_remove);
+
+	efile = eventfd_fget(efd);
+	if (IS_ERR(efile)) {
+		ret = PTR_ERR(efile);
+		goto fail;
+	}
+
+	event->eventfd = eventfd_ctx_fileget(efile);
+	if (IS_ERR(event->eventfd)) {
+		ret = PTR_ERR(event->eventfd);
+		goto fail;
+	}
+
+	cfile = fget(cfd);
+	if (!cfile) {
+		ret = -EBADF;
+		goto fail;
+	}
+
+	/* the process need read permission on control file */
+	ret = file_permission(cfile, MAY_READ);
+	if (ret < 0)
+		goto fail;
+
+	event->cft = __file_cft(cfile);
+	if (IS_ERR(event->cft)) {
+		ret = PTR_ERR(event->cft);
+		goto fail;
+	}
+
+	if (!event->cft->register_event || !event->cft->unregister_event) {
+		ret = -EINVAL;
+		goto fail;
+	}
+
+	ret = event->cft->register_event(cgrp, event->cft,
+			event->eventfd, buffer);
+	if (ret)
+		goto fail;
+
+	if (efile->f_op->poll(efile, &event->pt) & POLLHUP) {
+		event->cft->unregister_event(cgrp, event->cft, event->eventfd);
+		ret = 0;
+		goto fail;
+	}
+
+	spin_lock(&cgrp->event_list_lock);
+	list_add(&event->list, &cgrp->event_list);
+	spin_unlock(&cgrp->event_list_lock);
+
+	fput(cfile);
+	fput(efile);
+
+	return 0;
+
+fail:
+	if (!IS_ERR(cfile))
+		fput(cfile);
+
+	if (event && event->eventfd && !IS_ERR(event->eventfd))
+		eventfd_ctx_put(event->eventfd);
+
+	if (!IS_ERR(efile))
+		fput(efile);
+
+	kfree(event);
+
+	return ret;
+}
+
 /*
  * for the common functions, 'private' gives the type of file
  */
@@ -2814,6 +3015,11 @@ static struct cftype files[] = {
 		.read_u64 = cgroup_read_notify_on_release,
 		.write_u64 = cgroup_write_notify_on_release,
 	},
+	{
+		.name = CGROUP_FILE_GENERIC_PREFIX "event_control",
+		.write_string = cgroup_write_event_control,
+		.mode = S_IWUGO,
+	},
 };
 
 static struct cftype cft_release_agent = {
-- 
1.6.5.7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] memcg: extract mem_group_usage() from mem_cgroup_read()
  2009-12-30 15:57 ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Kirill A. Shutemov
@ 2009-12-30 15:57   ` Kirill A. Shutemov
  2009-12-30 15:57     ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit Kirill A. Shutemov
  2010-01-07  1:01   ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Paul Menage
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2009-12-30 15:57 UTC (permalink / raw)
  To: containers, linux-mm
  Cc: Paul Menage, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel,
	Kirill A. Shutemov

Helper to get memory or mem+swap usage of the cgroup.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   54 ++++++++++++++++++++++++++++++++----------------------
 1 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 488b644..1d71cb4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2722,40 +2722,50 @@ mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
 	*val = d.val;
 }
 
+static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
+{
+	u64 idx_val, val;
+
+	if (!mem_cgroup_is_root(mem)) {
+		if (!swap)
+			return res_counter_read_u64(&mem->res, RES_USAGE);
+		else
+			return res_counter_read_u64(&mem->memsw, RES_USAGE);
+	}
+
+	mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE, &idx_val);
+	val = idx_val;
+	mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS, &idx_val);
+	val += idx_val;
+
+	if (swap) {
+		mem_cgroup_get_recursive_idx_stat(mem,
+				MEM_CGROUP_STAT_SWAPOUT, &idx_val);
+		val += idx_val;
+	}
+
+	return val << PAGE_SHIFT;
+}
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	u64 idx_val, val;
+	u64 val;
 	int type, name;
 
 	type = MEMFILE_TYPE(cft->private);
 	name = MEMFILE_ATTR(cft->private);
 	switch (type) {
 	case _MEM:
-		if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
-			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_CACHE, &idx_val);
-			val = idx_val;
-			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_RSS, &idx_val);
-			val += idx_val;
-			val <<= PAGE_SHIFT;
-		} else
+		if (name == RES_USAGE)
+			val = mem_cgroup_usage(mem, false);
+		else
 			val = res_counter_read_u64(&mem->res, name);
 		break;
 	case _MEMSWAP:
-		if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
-			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_CACHE, &idx_val);
-			val = idx_val;
-			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_RSS, &idx_val);
-			val += idx_val;
-			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_SWAPOUT, &idx_val);
-			val += idx_val;
-			val <<= PAGE_SHIFT;
-		} else
+		if (name == RES_USAGE)
+			val = mem_cgroup_usage(mem, true);
+		else
 			val = res_counter_read_u64(&mem->memsw, name);
 		break;
 	default:
-- 
1.6.5.7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/4] memcg: rework usage of stats by soft limit
  2009-12-30 15:57   ` [PATCH v5 2/4] memcg: extract mem_group_usage() from mem_cgroup_read() Kirill A. Shutemov
@ 2009-12-30 15:57     ` Kirill A. Shutemov
  2009-12-30 15:57       ` [PATCH v5 4/4] memcg: implement memory thresholds Kirill A. Shutemov
  2010-01-03 23:56       ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2009-12-30 15:57 UTC (permalink / raw)
  To: containers, linux-mm
  Cc: Paul Menage, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel,
	Kirill A. Shutemov

Instead of incrementing counter on each page in/out and comparing it
with constant, we set counter to constant, decrement counter on each
page in/out and compare it with zero. We want to make comparing as fast
as possible. On many RISC systems (probably not only RISC) comparing
with zero is more effective than comparing with a constant, since not
every constant can be immediate operand for compare instruction.

Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
since really it's not a generic counter.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>

---

KAMEZAWA-san, I've changed the patch a bit. Can I reuse your Acked-by?

---
 mm/memcontrol.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d71cb4..c36d4f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+					used by soft limit implementation */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -84,10 +85,10 @@ struct mem_cgroup_stat {
 };
 
 static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
+__mem_cgroup_stat_set_safe(struct mem_cgroup_stat_cpu *stat,
+				enum mem_cgroup_stat_index idx, s64 val)
 {
-	stat->count[idx] = 0;
+	stat->count[idx] = val;
 }
 
 static inline s64
@@ -380,9 +381,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 
 	cpu = get_cpu();
 	cpustat = &mem->stat.cpustat[cpu];
-	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
-	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
-		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
+	if (unlikely(val < 0)) {
+		__mem_cgroup_stat_set_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
+				SOFTLIMIT_EVENTS_THRESH);
 		ret = true;
 	}
 	put_cpu();
@@ -515,7 +517,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 	else
 		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
 	put_cpu();
 }
 
-- 
1.6.5.7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/4] memcg: implement memory thresholds
  2009-12-30 15:57     ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit Kirill A. Shutemov
@ 2009-12-30 15:57       ` Kirill A. Shutemov
  2010-01-04  0:00         ` KAMEZAWA Hiroyuki
  2010-01-03 23:56       ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2009-12-30 15:57 UTC (permalink / raw)
  To: containers, linux-mm
  Cc: Paul Menage, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel,
	Kirill A. Shutemov

It allows to register multiple memory and memsw thresholds and gets
notifications when it crosses.

To register a threshold application need:
- create an eventfd;
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
  cgroup.event_control.

Application will be notified through eventfd when memory usage crosses
threshold in any direction.

It's applicable for root and non-root cgroup.

It uses stats to track memory usage, simmilar to soft limits. It checks
if we need to send event to userspace on every 100 page in/out. I guess
it's good compromise between performance and accuracy of thresholds.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
---
 Documentation/cgroups/memory.txt |   19 +++-
 mm/memcontrol.c                  |  312 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 330 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..195af07 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -414,7 +414,24 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. TODO
+8. Memory thresholds
+
+Memory controler implements memory thresholds using cgroups notification
+API (see cgroups.txt). It allows to register multiple memory and memsw
+thresholds and gets notifications when it crosses.
+
+To register a threshold application need:
+ - create an eventfd using eventfd(2);
+ - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
+ - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
+   cgroup.event_control.
+
+Application will be notified through eventfd when memory usage crosses
+threshold in any direction.
+
+It's applicable for root and non-root cgroup.
+
+9. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c36d4f3..5d4bd0b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6,6 +6,10 @@
  * Copyright 2007 OpenVZ SWsoft Inc
  * Author: Pavel Emelianov <xemul@openvz.org>
  *
+ * Memory thresholds
+ * Copyright (C) 2009 Nokia Corporation
+ * Author: Kirill A. Shutemov
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -39,6 +43,8 @@
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
+#include <linux/eventfd.h>
+#include <linux/sort.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -56,6 +62,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #endif
 
 #define SOFTLIMIT_EVENTS_THRESH (1000)
+#define THRESHOLDS_EVENTS_THRESH (100)
 
 /*
  * Statistics for memory cgroup.
@@ -72,6 +79,8 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
 					used by soft limit implementation */
+	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+					used by threshold implementation */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -175,6 +184,23 @@ struct mem_cgroup_tree {
 
 static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 
+struct mem_cgroup_threshold {
+	struct eventfd_ctx *eventfd;
+	u64 threshold;
+};
+
+struct mem_cgroup_threshold_ary {
+	/* An array index points to threshold just below usage. */
+	atomic_t current_threshold;
+	/* Size of entries[] */
+	unsigned int size;
+	/* Array of thresholds */
+	struct mem_cgroup_threshold entries[0];
+};
+
+static bool mem_cgroup_threshold_check(struct mem_cgroup* mem);
+static void mem_cgroup_threshold(struct mem_cgroup* mem);
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -226,6 +252,15 @@ struct mem_cgroup {
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
+	/* protect arrays of thresholds */
+	struct mutex thresholds_lock;
+
+	/* thresholds for memory usage. RCU-protected */
+	struct mem_cgroup_threshold_ary *thresholds;
+
+	/* thresholds for mem+swap usage. RCU-protected */
+	struct mem_cgroup_threshold_ary *memsw_thresholds;
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -518,6 +553,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
 	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS, -1);
+
 	put_cpu();
 }
 
@@ -1503,6 +1540,8 @@ charged:
 	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 done:
+	if (mem_cgroup_threshold_check(mem))
+		mem_cgroup_threshold(mem);
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -2068,6 +2107,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
+	if (mem_cgroup_threshold_check(mem))
+		mem_cgroup_threshold(mem);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -3064,12 +3105,280 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
+{
+	bool ret = false;
+	int cpu;
+	s64 val;
+	struct mem_cgroup_stat_cpu *cpustat;
+
+	cpu = get_cpu();
+	cpustat = &mem->stat.cpustat[cpu];
+	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_THRESHOLDS);
+	if (unlikely(val < 0)) {
+		__mem_cgroup_stat_set_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS,
+				THRESHOLDS_EVENTS_THRESH);
+		ret = true;
+	}
+	put_cpu();
+	return ret;
+}
+
+static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
+{
+	struct mem_cgroup_threshold_ary *t;
+	u64 usage;
+	int i;
+
+	rcu_read_lock();
+	if (!swap) {
+		t = rcu_dereference(memcg->thresholds);
+	} else {
+		t = rcu_dereference(memcg->memsw_thresholds);
+	}
+
+	if (!t)
+		goto unlock;
+
+	usage = mem_cgroup_usage(memcg, swap);
+
+	/*
+	 * current_threshold points to threshold just below usage.
+	 * If it's not true, a threshold was crossed after last
+	 * call of __mem_cgroup_threshold().
+	 */
+	i = atomic_read(&t->current_threshold);
+
+	/*
+	 * Iterate backward over array of thresholds starting from
+	 * current_threshold and check if a threshold is crossed.
+	 * If none of thresholds below usage is crossed, we read
+	 * only one element of the array here.
+	 */
+	for(; i > 0 && unlikely(t->entries[i].threshold > usage); i--) {
+		eventfd_signal(t->entries[i].eventfd, 1);
+	}
+
+	/* i = current_threshold + 1 */
+	i++;
+
+	/*
+	 * Iterate forward over array of thresholds starting from
+	 * current_threshold+1 and check if a threshold is crossed.
+	 * If none of thresholds above usage is crossed, we read
+	 * only one element of the array here.
+	 */
+	for(; i < t->size && unlikely(t->entries[i].threshold <= usage); i++) {
+		eventfd_signal(t->entries[i].eventfd, 1);
+	}
+
+	/* Update current_threshold */
+	atomic_set(&t->current_threshold, i - 1);
+unlock:
+	rcu_read_unlock();
+}
+
+static void mem_cgroup_threshold(struct mem_cgroup *memcg)
+{
+	__mem_cgroup_threshold(memcg, false);
+	if (do_swap_account)
+		__mem_cgroup_threshold(memcg, true);
+}
+
+static int compare_thresholds(const void *a, const void *b)
+{
+	const struct mem_cgroup_threshold *_a = a;
+	const struct mem_cgroup_threshold *_b = b;
+
+	return _a->threshold - _b->threshold;
+}
+
+static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
+		struct eventfd_ctx *eventfd, const char *args)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
+	int type = MEMFILE_TYPE(cft->private);
+	u64 threshold, usage;
+	int size;
+	int i, ret;
+
+	ret = res_counter_memparse_write_strategy(args, &threshold);
+	if (ret)
+		return ret;
+
+	mutex_lock(&memcg->thresholds_lock);
+	if (type == _MEM)
+		thresholds = memcg->thresholds;
+	else if (type == _MEMSWAP)
+		thresholds = memcg->memsw_thresholds;
+	else
+		BUG();
+
+	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
+
+	/* Check if a threshold crossed before adding a new one */
+	if (thresholds)
+		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+	if (thresholds)
+		size = thresholds->size + 1;
+	else
+		size = 1;
+
+	/* Allocate memory for new array of thresholds */
+	thresholds_new = kmalloc(sizeof(*thresholds_new) +
+			size * sizeof(struct mem_cgroup_threshold),
+			GFP_KERNEL);
+	if (!thresholds_new) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	thresholds_new->size = size;
+
+	/* Copy thresholds (if any) to new array */
+	if (thresholds)
+		memcpy(thresholds_new->entries, thresholds->entries,
+				thresholds->size *
+				sizeof(struct mem_cgroup_threshold));
+	/* Add new threshold */
+	thresholds_new->entries[size - 1].eventfd = eventfd;
+	thresholds_new->entries[size - 1].threshold = threshold;
+
+	/* Sort thresholds. Registering of new threshold isn't time-critical */
+	sort(thresholds_new->entries, size,
+			sizeof(struct mem_cgroup_threshold),
+			compare_thresholds, NULL);
+
+	/* Find current threshold */
+	atomic_set(&thresholds_new->current_threshold, -1);
+	for(i = 0; i < size; i++) {
+		if (thresholds_new->entries[i].threshold < usage) {
+			/*
+			 * thresholds_new->current_threshold will not be used
+			 * until rcu_assign_pointer(), so it's safe to increment
+			 * it here.
+			 */
+			atomic_inc(&thresholds_new->current_threshold);
+		}
+	}
+
+	/*
+	 * We need to increment refcnt to be sure that all thresholds
+	 * will be unregistered before calling __mem_cgroup_free()
+	 */
+	mem_cgroup_get(memcg);
+
+	if (type == _MEM)
+		rcu_assign_pointer(memcg->thresholds, thresholds_new);
+	else
+		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
+
+	/* To be sure that nobody uses thresholds before freeing it */
+	synchronize_rcu();
+
+	kfree(thresholds);
+unlock:
+	mutex_unlock(&memcg->thresholds_lock);
+
+	return ret;
+}
+
+static int mem_cgroup_unregister_event(struct cgroup *cgrp, struct cftype *cft,
+		struct eventfd_ctx *eventfd)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
+	int type = MEMFILE_TYPE(cft->private);
+	u64 usage;
+	int size = 0;
+	int i, j, ret;
+
+	mutex_lock(&memcg->thresholds_lock);
+	if (type == _MEM)
+		thresholds = memcg->thresholds;
+	else if (type == _MEMSWAP)
+		thresholds = memcg->memsw_thresholds;
+	else
+		BUG();
+
+	/*
+	 * Something went wrong if we trying to unregister a threshold
+	 * if we don't have thresholds
+	 */
+	BUG_ON(!thresholds);
+
+	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
+
+	/* Check if a threshold crossed before removing */
+	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+	/* Calculate new number of threshold */
+	for(i = 0; i < thresholds->size; i++) {
+		if (thresholds->entries[i].eventfd != eventfd)
+			size++;
+	}
+
+	/* Set thresholds array to NULL if we don't have thresholds */
+	if (!size) {
+		thresholds_new = NULL;
+		goto assign;
+	}
+
+	/* Allocate memory for new array of thresholds */
+	thresholds_new = kmalloc(sizeof(*thresholds_new) +
+			size * sizeof(struct mem_cgroup_threshold),
+			GFP_KERNEL);
+	if (!thresholds_new) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	thresholds_new->size = size;
+
+	/* Copy thresholds and find current threshold */
+	atomic_set(&thresholds_new->current_threshold, -1);
+	for(i = 0, j = 0; i < thresholds->size; i++) {
+		if (thresholds->entries[i].eventfd == eventfd)
+			continue;
+
+		thresholds_new->entries[j] = thresholds->entries[i];
+		if (thresholds_new->entries[j].threshold < usage) {
+			/*
+			 * thresholds_new->current_threshold will not be used
+			 * until rcu_assign_pointer(), so it's safe to increment
+			 * it here.
+			 */
+			atomic_inc(&thresholds_new->current_threshold);
+		}
+		j++;
+	}
+
+assign:
+	if (type == _MEM)
+		rcu_assign_pointer(memcg->thresholds, thresholds_new);
+	else
+		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
+
+	/* To be sure that nobody uses thresholds before freeing it */
+	synchronize_rcu();
+
+	for(i = 0; i < thresholds->size - size; i++)
+		mem_cgroup_put(memcg);
+
+	kfree(thresholds);
+unlock:
+	mutex_unlock(&memcg->thresholds_lock);
+
+	return ret;
+}
 
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
+		.register_event = mem_cgroup_register_event,
+		.unregister_event = mem_cgroup_unregister_event,
 	},
 	{
 		.name = "max_usage_in_bytes",
@@ -3121,6 +3430,8 @@ static struct cftype memsw_cgroup_files[] = {
 		.name = "memsw.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
+		.register_event = mem_cgroup_register_event,
+		.unregister_event = mem_cgroup_unregister_event,
 	},
 	{
 		.name = "memsw.max_usage_in_bytes",
@@ -3360,6 +3671,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
+	mutex_init(&mem->thresholds_lock);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
-- 
1.6.5.7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/4] memcg: rework usage of stats by soft limit
  2009-12-30 15:57     ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit Kirill A. Shutemov
  2009-12-30 15:57       ` [PATCH v5 4/4] memcg: implement memory thresholds Kirill A. Shutemov
@ 2010-01-03 23:56       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-03 23:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers, linux-mm, Paul Menage, Li Zefan, Andrew Morton,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

On Wed, 30 Dec 2009 17:57:58 +0200
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> Instead of incrementing counter on each page in/out and comparing it
> with constant, we set counter to constant, decrement counter on each
> page in/out and compare it with zero. We want to make comparing as fast
> as possible. On many RISC systems (probably not only RISC) comparing
> with zero is more effective than comparing with a constant, since not
> every constant can be immediate operand for compare instruction.
> 
> Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
> since really it's not a generic counter.
> 
> Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
> 
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
> 
> KAMEZAWA-san, I've changed the patch a bit. Can I reuse your Acked-by?
> 
> ---
>  mm/memcontrol.c |   18 ++++++++++--------
>  1 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1d71cb4..c36d4f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -84,10 +85,10 @@ struct mem_cgroup_stat {
>  };
>  
>  static inline void
> -__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
> -				enum mem_cgroup_stat_index idx)
> +__mem_cgroup_stat_set_safe(struct mem_cgroup_stat_cpu *stat,
> +				enum mem_cgroup_stat_index idx, s64 val)
>  {
> -	stat->count[idx] = 0;
> +	stat->count[idx] = val;
>  }
>  
>  static inline s64
> @@ -380,9 +381,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
>  
>  	cpu = get_cpu();
>  	cpustat = &mem->stat.cpustat[cpu];
> -	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
> -	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> -		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
> +	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
> +	if (unlikely(val < 0)) {
> +		__mem_cgroup_stat_set_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
> +				SOFTLIMIT_EVENTS_THRESH);
>  		ret = true;
>  	}
>  	put_cpu();
> @@ -515,7 +517,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  	else
>  		__mem_cgroup_stat_add_safe(cpustat,
>  				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> -	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
> +	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
>  	put_cpu();
>  }
>  
> -- 
> 1.6.5.7
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] memcg: implement memory thresholds
  2009-12-30 15:57       ` [PATCH v5 4/4] memcg: implement memory thresholds Kirill A. Shutemov
@ 2010-01-04  0:00         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-04  0:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers, linux-mm, Paul Menage, Li Zefan, Andrew Morton,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

On Wed, 30 Dec 2009 17:57:59 +0200
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> It allows to register multiple memory and memsw thresholds and gets
> notifications when it crosses.
> 
> To register a threshold application need:
> - create an eventfd;
> - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
> - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
>   cgroup.event_control.
> 
> Application will be notified through eventfd when memory usage crosses
> threshold in any direction.
> 
> It's applicable for root and non-root cgroup.
> 
> It uses stats to track memory usage, simmilar to soft limits. It checks
> if we need to send event to userspace on every 100 page in/out. I guess
> it's good compromise between performance and accuracy of thresholds.
> 
> Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>

Changelog per patch is recommended. But ok. Much easier to read than
previous one. Thank you!

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>  Documentation/cgroups/memory.txt |   19 +++-
>  mm/memcontrol.c                  |  312 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 330 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..195af07 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -414,7 +414,24 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
>  NOTE2: It is recommended to set the soft limit always below the hard limit,
>         otherwise the hard limit will take precedence.
>  
> -8. TODO
> +8. Memory thresholds
> +
> +Memory controler implements memory thresholds using cgroups notification
> +API (see cgroups.txt). It allows to register multiple memory and memsw
> +thresholds and gets notifications when it crosses.
> +
> +To register a threshold application need:
> + - create an eventfd using eventfd(2);
> + - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
> + - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
> +   cgroup.event_control.
> +
> +Application will be notified through eventfd when memory usage crosses
> +threshold in any direction.
> +
> +It's applicable for root and non-root cgroup.
> +
> +9. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c36d4f3..5d4bd0b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
>   * Copyright 2007 OpenVZ SWsoft Inc
>   * Author: Pavel Emelianov <xemul@openvz.org>
>   *
> + * Memory thresholds
> + * Copyright (C) 2009 Nokia Corporation
> + * Author: Kirill A. Shutemov
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -39,6 +43,8 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_cgroup.h>
>  #include <linux/cpu.h>
> +#include <linux/eventfd.h>
> +#include <linux/sort.h>
>  #include "internal.h"
>  
>  #include <asm/uaccess.h>
> @@ -56,6 +62,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #endif
>  
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
> +#define THRESHOLDS_EVENTS_THRESH (100)
>  
>  /*
>   * Statistics for memory cgroup.
> @@ -72,6 +79,8 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
>  					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -175,6 +184,23 @@ struct mem_cgroup_tree {
>  
>  static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>  
> +struct mem_cgroup_threshold {
> +	struct eventfd_ctx *eventfd;
> +	u64 threshold;
> +};
> +
> +struct mem_cgroup_threshold_ary {
> +	/* An array index points to threshold just below usage. */
> +	atomic_t current_threshold;
> +	/* Size of entries[] */
> +	unsigned int size;
> +	/* Array of thresholds */
> +	struct mem_cgroup_threshold entries[0];
> +};
> +
> +static bool mem_cgroup_threshold_check(struct mem_cgroup* mem);
> +static void mem_cgroup_threshold(struct mem_cgroup* mem);
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -226,6 +252,15 @@ struct mem_cgroup {
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> +	/* protect arrays of thresholds */
> +	struct mutex thresholds_lock;
> +
> +	/* thresholds for memory usage. RCU-protected */
> +	struct mem_cgroup_threshold_ary *thresholds;
> +
> +	/* thresholds for mem+swap usage. RCU-protected */
> +	struct mem_cgroup_threshold_ary *memsw_thresholds;
> +
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
> @@ -518,6 +553,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>  		__mem_cgroup_stat_add_safe(cpustat,
>  				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
>  	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
> +	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS, -1);
> +
>  	put_cpu();
>  }
>  
> @@ -1503,6 +1540,8 @@ charged:
>  	if (mem_cgroup_soft_limit_check(mem))
>  		mem_cgroup_update_tree(mem, page);
>  done:
> +	if (mem_cgroup_threshold_check(mem))
> +		mem_cgroup_threshold(mem);
>  	return 0;
>  nomem:
>  	css_put(&mem->css);
> @@ -2068,6 +2107,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  
>  	if (mem_cgroup_soft_limit_check(mem))
>  		mem_cgroup_update_tree(mem, page);
> +	if (mem_cgroup_threshold_check(mem))
> +		mem_cgroup_threshold(mem);
>  	/* at swapout, this memcg will be accessed to record to swap */
>  	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		css_put(&mem->css);
> @@ -3064,12 +3105,280 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
> +{
> +	bool ret = false;
> +	int cpu;
> +	s64 val;
> +	struct mem_cgroup_stat_cpu *cpustat;
> +
> +	cpu = get_cpu();
> +	cpustat = &mem->stat.cpustat[cpu];
> +	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_THRESHOLDS);
> +	if (unlikely(val < 0)) {
> +		__mem_cgroup_stat_set_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS,
> +				THRESHOLDS_EVENTS_THRESH);
> +		ret = true;
> +	}
> +	put_cpu();
> +	return ret;
> +}
> +
> +static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> +{
> +	struct mem_cgroup_threshold_ary *t;
> +	u64 usage;
> +	int i;
> +
> +	rcu_read_lock();
> +	if (!swap) {
> +		t = rcu_dereference(memcg->thresholds);
> +	} else {
> +		t = rcu_dereference(memcg->memsw_thresholds);
> +	}
> +
> +	if (!t)
> +		goto unlock;
> +
> +	usage = mem_cgroup_usage(memcg, swap);
> +
> +	/*
> +	 * current_threshold points to threshold just below usage.
> +	 * If it's not true, a threshold was crossed after last
> +	 * call of __mem_cgroup_threshold().
> +	 */
> +	i = atomic_read(&t->current_threshold);
> +
> +	/*
> +	 * Iterate backward over array of thresholds starting from
> +	 * current_threshold and check if a threshold is crossed.
> +	 * If none of thresholds below usage is crossed, we read
> +	 * only one element of the array here.
> +	 */
> +	for(; i > 0 && unlikely(t->entries[i].threshold > usage); i--) {
> +		eventfd_signal(t->entries[i].eventfd, 1);
> +	}
> +
> +	/* i = current_threshold + 1 */
> +	i++;
> +
> +	/*
> +	 * Iterate forward over array of thresholds starting from
> +	 * current_threshold+1 and check if a threshold is crossed.
> +	 * If none of thresholds above usage is crossed, we read
> +	 * only one element of the array here.
> +	 */
> +	for(; i < t->size && unlikely(t->entries[i].threshold <= usage); i++) {
> +		eventfd_signal(t->entries[i].eventfd, 1);
> +	}
> +
> +	/* Update current_threshold */
> +	atomic_set(&t->current_threshold, i - 1);
> +unlock:
> +	rcu_read_unlock();
> +}
> +
> +static void mem_cgroup_threshold(struct mem_cgroup *memcg)
> +{
> +	__mem_cgroup_threshold(memcg, false);
> +	if (do_swap_account)
> +		__mem_cgroup_threshold(memcg, true);
> +}
> +
> +static int compare_thresholds(const void *a, const void *b)
> +{
> +	const struct mem_cgroup_threshold *_a = a;
> +	const struct mem_cgroup_threshold *_b = b;
> +
> +	return _a->threshold - _b->threshold;
> +}
> +
> +static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
> +		struct eventfd_ctx *eventfd, const char *args)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
> +	int type = MEMFILE_TYPE(cft->private);
> +	u64 threshold, usage;
> +	int size;
> +	int i, ret;
> +
> +	ret = res_counter_memparse_write_strategy(args, &threshold);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&memcg->thresholds_lock);
> +	if (type == _MEM)
> +		thresholds = memcg->thresholds;
> +	else if (type == _MEMSWAP)
> +		thresholds = memcg->memsw_thresholds;
> +	else
> +		BUG();
> +
> +	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> +
> +	/* Check if a threshold crossed before adding a new one */
> +	if (thresholds)
> +		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> +	if (thresholds)
> +		size = thresholds->size + 1;
> +	else
> +		size = 1;
> +
> +	/* Allocate memory for new array of thresholds */
> +	thresholds_new = kmalloc(sizeof(*thresholds_new) +
> +			size * sizeof(struct mem_cgroup_threshold),
> +			GFP_KERNEL);
> +	if (!thresholds_new) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +	thresholds_new->size = size;
> +
> +	/* Copy thresholds (if any) to new array */
> +	if (thresholds)
> +		memcpy(thresholds_new->entries, thresholds->entries,
> +				thresholds->size *
> +				sizeof(struct mem_cgroup_threshold));
> +	/* Add new threshold */
> +	thresholds_new->entries[size - 1].eventfd = eventfd;
> +	thresholds_new->entries[size - 1].threshold = threshold;
> +
> +	/* Sort thresholds. Registering of new threshold isn't time-critical */
> +	sort(thresholds_new->entries, size,
> +			sizeof(struct mem_cgroup_threshold),
> +			compare_thresholds, NULL);
> +
> +	/* Find current threshold */
> +	atomic_set(&thresholds_new->current_threshold, -1);
> +	for(i = 0; i < size; i++) {
> +		if (thresholds_new->entries[i].threshold < usage) {
> +			/*
> +			 * thresholds_new->current_threshold will not be used
> +			 * until rcu_assign_pointer(), so it's safe to increment
> +			 * it here.
> +			 */
> +			atomic_inc(&thresholds_new->current_threshold);
> +		}
> +	}
> +
> +	/*
> +	 * We need to increment refcnt to be sure that all thresholds
> +	 * will be unregistered before calling __mem_cgroup_free()
> +	 */
> +	mem_cgroup_get(memcg);
> +
> +	if (type == _MEM)
> +		rcu_assign_pointer(memcg->thresholds, thresholds_new);
> +	else
> +		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> +
> +	/* To be sure that nobody uses thresholds before freeing it */
> +	synchronize_rcu();
> +
> +	kfree(thresholds);
> +unlock:
> +	mutex_unlock(&memcg->thresholds_lock);
> +
> +	return ret;
> +}
> +
> +static int mem_cgroup_unregister_event(struct cgroup *cgrp, struct cftype *cft,
> +		struct eventfd_ctx *eventfd)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
> +	int type = MEMFILE_TYPE(cft->private);
> +	u64 usage;
> +	int size = 0;
> +	int i, j, ret;
> +
> +	mutex_lock(&memcg->thresholds_lock);
> +	if (type == _MEM)
> +		thresholds = memcg->thresholds;
> +	else if (type == _MEMSWAP)
> +		thresholds = memcg->memsw_thresholds;
> +	else
> +		BUG();
> +
> +	/*
> +	 * Something went wrong if we trying to unregister a threshold
> +	 * if we don't have thresholds
> +	 */
> +	BUG_ON(!thresholds);
> +
> +	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> +
> +	/* Check if a threshold crossed before removing */
> +	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> +	/* Calculate new number of threshold */
> +	for(i = 0; i < thresholds->size; i++) {
> +		if (thresholds->entries[i].eventfd != eventfd)
> +			size++;
> +	}
> +
> +	/* Set thresholds array to NULL if we don't have thresholds */
> +	if (!size) {
> +		thresholds_new = NULL;
> +		goto assign;
> +	}
> +
> +	/* Allocate memory for new array of thresholds */
> +	thresholds_new = kmalloc(sizeof(*thresholds_new) +
> +			size * sizeof(struct mem_cgroup_threshold),
> +			GFP_KERNEL);
> +	if (!thresholds_new) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +	thresholds_new->size = size;
> +
> +	/* Copy thresholds and find current threshold */
> +	atomic_set(&thresholds_new->current_threshold, -1);
> +	for(i = 0, j = 0; i < thresholds->size; i++) {
> +		if (thresholds->entries[i].eventfd == eventfd)
> +			continue;
> +
> +		thresholds_new->entries[j] = thresholds->entries[i];
> +		if (thresholds_new->entries[j].threshold < usage) {
> +			/*
> +			 * thresholds_new->current_threshold will not be used
> +			 * until rcu_assign_pointer(), so it's safe to increment
> +			 * it here.
> +			 */
> +			atomic_inc(&thresholds_new->current_threshold);
> +		}
> +		j++;
> +	}
> +
> +assign:
> +	if (type == _MEM)
> +		rcu_assign_pointer(memcg->thresholds, thresholds_new);
> +	else
> +		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> +
> +	/* To be sure that nobody uses thresholds before freeing it */
> +	synchronize_rcu();
> +
> +	for(i = 0; i < thresholds->size - size; i++)
> +		mem_cgroup_put(memcg);
> +
> +	kfree(thresholds);
> +unlock:
> +	mutex_unlock(&memcg->thresholds_lock);
> +
> +	return ret;
> +}
>  
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>  		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
>  		.read_u64 = mem_cgroup_read,
> +		.register_event = mem_cgroup_register_event,
> +		.unregister_event = mem_cgroup_unregister_event,
>  	},
>  	{
>  		.name = "max_usage_in_bytes",
> @@ -3121,6 +3430,8 @@ static struct cftype memsw_cgroup_files[] = {
>  		.name = "memsw.usage_in_bytes",
>  		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
>  		.read_u64 = mem_cgroup_read,
> +		.register_event = mem_cgroup_register_event,
> +		.unregister_event = mem_cgroup_unregister_event,
>  	},
>  	{
>  		.name = "memsw.max_usage_in_bytes",
> @@ -3360,6 +3671,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
>  	atomic_set(&mem->refcnt, 1);
> +	mutex_init(&mem->thresholds_lock);
>  	return &mem->css;
>  free_out:
>  	__mem_cgroup_free(mem);
> -- 
> 1.6.5.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] cgroup notifications API and memory thresholds
  2009-12-30 15:57 [PATCH v5 0/4] cgroup notifications API and memory thresholds Kirill A. Shutemov
  2009-12-30 15:57 ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Kirill A. Shutemov
@ 2010-01-04  0:36 ` Balbir Singh
  2010-01-04 10:15   ` Kirill A. Shutemov
  1 sibling, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2010-01-04  0:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers, linux-mm, Paul Menage, Li Zefan, Andrew Morton,
	KAMEZAWA Hiroyuki, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

* Kirill A. Shutemov <kirill@shutemov.name> [2009-12-30 17:57:55]:

> This patchset introduces eventfd-based API for notifications in cgroups and
> implements memory notifications on top of it.
> 
> It uses statistics in memory controler to track memory usage.
> 
> Output of time(1) on building kernel on tmpfs:
> 
> Root cgroup before changes:
> 	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
> Non-root cgroup before changes:
> 	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
> Root cgroup after changes (0 thresholds):
> 	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
> Non-root cgroup after changes (0 thresholds):
> 	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
> Root cgroup after changes (1 thresholds, never crossed):
> 	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
> Non-root cgroup after changes (1 thresholds, never crossed):
> 	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
> 
> Any comments?

Hi,

I just saw that the notification work for me using the tool you
supplied. One strange thing was that I got notified even though
the amount of data I was using was reducing, so I hit the notification
two ways

        +------------+-----------
                    1G
                ----> (got notified on increase)
                <---- (got notified on decrease)

I am not against the behaviour, but it can be confusing unless
clarified with the event.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] cgroup notifications API and memory thresholds
  2010-01-04  0:36 ` [PATCH v5 0/4] cgroup notifications API and memory thresholds Balbir Singh
@ 2010-01-04 10:15   ` Kirill A. Shutemov
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2010-01-04 10:15 UTC (permalink / raw)
  To: balbir
  Cc: containers, linux-mm, Paul Menage, Li Zefan, Andrew Morton,
	KAMEZAWA Hiroyuki, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

On Mon, Jan 4, 2010 at 2:36 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Kirill A. Shutemov <kirill@shutemov.name> [2009-12-30 17:57:55]:
>
>> This patchset introduces eventfd-based API for notifications in cgroups and
>> implements memory notifications on top of it.
>>
>> It uses statistics in memory controler to track memory usage.
>>
>> Output of time(1) on building kernel on tmpfs:
>>
>> Root cgroup before changes:
>>       make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
>> Non-root cgroup before changes:
>>       make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
>> Root cgroup after changes (0 thresholds):
>>       make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
>> Non-root cgroup after changes (0 thresholds):
>>       make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
>> Root cgroup after changes (1 thresholds, never crossed):
>>       make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
>> Non-root cgroup after changes (1 thresholds, never crossed):
>>       make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
>>
>> Any comments?
>
> Hi,
>
> I just saw that the notification work for me using the tool you
> supplied. One strange thing was that I got notified even though
> the amount of data I was using was reducing, so I hit the notification
> two ways
>
>        +------------+-----------
>                    1G
>                ----> (got notified on increase)
>                <---- (got notified on decrease)
>
> I am not against the behaviour, but it can be confusing unless
> clarified with the event.

IIUC, you've got two events. One on crossing the threshold up and
one on crossing it down. It's Ok.

By design, notification means that the threshold probably crossed
(it can be crossed twice -- up and down) or the cgroup was removed.
To understand what really happen you need to check if the cgroup
exists and read current usage of the resource.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] cgroup: implement eventfd-based generic API for  notifications
  2009-12-30 15:57 ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Kirill A. Shutemov
  2009-12-30 15:57   ` [PATCH v5 2/4] memcg: extract mem_group_usage() from mem_cgroup_read() Kirill A. Shutemov
@ 2010-01-07  1:01   ` Paul Menage
  2010-01-07 12:36     ` Kirill A. Shutemov
  1 sibling, 1 reply; 13+ messages in thread
From: Paul Menage @ 2010-01-07  1:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers, linux-mm, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

On Wed, Dec 30, 2009 at 7:57 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> This patch introduces write-only file "cgroup.event_control" in every
> cgroup.

This looks like a nice generic API for doing event notifications - thanks!

Sorry I hadn't had a chance to review it before now, due to travelling
and day-job pressures.


> +}
> +
> +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
> +               int sync, void *key)

Maybe some comments here indicating how/when it gets called? (And more
comments for each function generally?)

> +       if (flags & POLLHUP) {
> +               spin_lock(&cgrp->event_list_lock);
> +               list_del(&event->list);
> +               spin_unlock(&cgrp->event_list_lock);
> +               schedule_work(&event->remove);

Comment saying why we can't do the remove immediately in this context?

> +
> +fail:
> +       if (!IS_ERR(cfile))
> +               fput(cfile);

cfile is either valid or NULL - it never contains an error value.

> +
> +       if (!IS_ERR(efile))
> +               fput(efile);

While this is OK currently, it's a bit fragile. efile starts as NULL,
and IS_ERR(NULL) is false. So if we jump to fail: before trying to do
the eventfd_fget() then we'll try to fput(NULL), which will oops. This
works because we don't currently jump to fail: until after
eventfd_fget(), but someone could add an extra setup step between the
kzalloc() and the eventfd_fget() which could fail.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] cgroup: implement eventfd-based generic API for  notifications
  2010-01-07  1:01   ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Paul Menage
@ 2010-01-07 12:36     ` Kirill A. Shutemov
  2010-01-08  0:55       ` Li Zefan
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2010-01-07 12:36 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers, linux-mm, Li Zefan, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, Pavel Emelyanov, Dan Malek, Vladislav Buzov,
	Daisuke Nishimura, Alexander Shishkin, linux-kernel

On Thu, Jan 7, 2010 at 3:01 AM, Paul Menage <menage@google.com> wrote:
> On Wed, Dec 30, 2009 at 7:57 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
>> +
>> +       if (!IS_ERR(efile))
>> +               fput(efile);
>
> While this is OK currently, it's a bit fragile. efile starts as NULL,
> and IS_ERR(NULL) is false. So if we jump to fail: before trying to do
> the eventfd_fget() then we'll try to fput(NULL), which will oops. This
> works because we don't currently jump to fail: until after
> eventfd_fget(), but someone could add an extra setup step between the
> kzalloc() and the eventfd_fget() which could fail.

So we need to use IS_ERR_OR_NULL here instread of IS_ERR, don't we?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications
  2010-01-07 12:36     ` Kirill A. Shutemov
@ 2010-01-08  0:55       ` Li Zefan
  2010-01-08  1:05         ` Paul Menage
  0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2010-01-08  0:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Paul Menage, containers, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Balbir Singh, Pavel Emelyanov, Dan Malek,
	Vladislav Buzov, Daisuke Nishimura, Alexander Shishkin,
	linux-kernel

Kirill A. Shutemov wrote:
> On Thu, Jan 7, 2010 at 3:01 AM, Paul Menage <menage@google.com> wrote:
>> On Wed, Dec 30, 2009 at 7:57 AM, Kirill A. Shutemov
>> <kirill@shutemov.name> wrote:
>>> +
>>> +       if (!IS_ERR(efile))
>>> +               fput(efile);
>> While this is OK currently, it's a bit fragile. efile starts as NULL,
>> and IS_ERR(NULL) is false. So if we jump to fail: before trying to do
>> the eventfd_fget() then we'll try to fput(NULL), which will oops. This
>> works because we don't currently jump to fail: until after
>> eventfd_fget(), but someone could add an extra setup step between the
>> kzalloc() and the eventfd_fget() which could fail.
> 
> So we need to use IS_ERR_OR_NULL here instread of IS_ERR, don't we?
> 

Use multi labels is much better:

label4::
	fput(cfile);
label3:
	eventfd_ctx_put(event->eventfd);
label2:
	fput(efile);
label1:
	kfree(event);

compared to:

+fail:
+	if (!IS_ERR(cfile))
+		fput(cfile);
+
+	if (event && event->eventfd && !IS_ERR(event->eventfd))
+		eventfd_ctx_put(event->eventfd);
+
+	if (!IS_ERR(efile))
+		fput(efile);
+
+	kfree(event);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] cgroup: implement eventfd-based generic API for  notifications
  2010-01-08  0:55       ` Li Zefan
@ 2010-01-08  1:05         ` Paul Menage
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Menage @ 2010-01-08  1:05 UTC (permalink / raw)
  To: Li Zefan
  Cc: Kirill A. Shutemov, containers, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Balbir Singh, Pavel Emelyanov, Dan Malek,
	Vladislav Buzov, Daisuke Nishimura, Alexander Shishkin,
	linux-kernel

On Thu, Jan 7, 2010 at 4:55 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>
> Use multi labels is much better:

I disagree with that - in the absence of a language that can do proper
destructor-based cleanup (i.e. a strictly controlled subset of C++ :-)
) I think it's clearer to have a single failure path where you can
clean up anything that needs to be cleaned up, without excessive
dependencies on exactly when the failure occurred. Changes then become
less error-prone.

Paul

>
> label4::
>        fput(cfile);
> label3:
>        eventfd_ctx_put(event->eventfd);
> label2:
>        fput(efile);
> label1:
>        kfree(event);
>
> compared to:
>
> +fail:
> +       if (!IS_ERR(cfile))
> +               fput(cfile);
> +
> +       if (event && event->eventfd && !IS_ERR(event->eventfd))
> +               eventfd_ctx_put(event->eventfd);
> +
> +       if (!IS_ERR(efile))
> +               fput(efile);
> +
> +       kfree(event);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-01-08  1:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-30 15:57 [PATCH v5 0/4] cgroup notifications API and memory thresholds Kirill A. Shutemov
2009-12-30 15:57 ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Kirill A. Shutemov
2009-12-30 15:57   ` [PATCH v5 2/4] memcg: extract mem_group_usage() from mem_cgroup_read() Kirill A. Shutemov
2009-12-30 15:57     ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit Kirill A. Shutemov
2009-12-30 15:57       ` [PATCH v5 4/4] memcg: implement memory thresholds Kirill A. Shutemov
2010-01-04  0:00         ` KAMEZAWA Hiroyuki
2010-01-03 23:56       ` [PATCH v5 3/4] memcg: rework usage of stats by soft limit KAMEZAWA Hiroyuki
2010-01-07  1:01   ` [PATCH v5 1/4] cgroup: implement eventfd-based generic API for notifications Paul Menage
2010-01-07 12:36     ` Kirill A. Shutemov
2010-01-08  0:55       ` Li Zefan
2010-01-08  1:05         ` Paul Menage
2010-01-04  0:36 ` [PATCH v5 0/4] cgroup notifications API and memory thresholds Balbir Singh
2010-01-04 10:15   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).