linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][0/3] Virtual address space control for cgroups (v2)
@ 2008-03-26 18:49 Balbir Singh
  2008-03-26 18:50 ` [RFC][1/3] Add user interface for virtual address space control (v2) Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-26 18:49 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, taka, linux-mm, David Rientjes, Balbir Singh,
	KAMEZAWA Hiroyuki

This is the second version of the virtual address space control patchset
for cgroups.  The patches are against 2.6.25-rc5-mm1 and have been tested on
top of User Mode Linux, both with and without the config enabled.

The first patch adds the user interface. The second patch adds accounting
and control. The third patch updates documentation.

The changelog in each patchset documents what has changed in version 2.
The most important one being that virtual address space accounting is
now a config option.

Reviews, Comments?

series

memory-controller-virtual-address-space-control-user-interface
memory-controller-virtual-address-space-accounting-and-control
memory-controller-virtual-address-control-documentation



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][1/3] Add user interface for virtual address space control (v2)
  2008-03-26 18:49 [RFC][0/3] Virtual address space control for cgroups (v2) Balbir Singh
@ 2008-03-26 18:50 ` Balbir Singh
  2008-03-27  9:14   ` KAMEZAWA Hiroyuki
  2008-03-26 18:50 ` [RFC][2/3] Account and control virtual address space allocations (v2) Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-26 18:50 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, taka, linux-mm, David Rientjes, Balbir Singh,
	KAMEZAWA Hiroyuki



Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
control over the total address space that the processes combined together
in the cgroup can grow upto. This functionality is analogous to
the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
A as_res resource counter is added to the mem_cgroup structure. The
as_res counter handles all the accounting associated with the virtual
address space accounting and control of cgroups.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 init/Kconfig    |   10 ++++++++++
 mm/memcontrol.c |   39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface	2008-03-26 16:00:42.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-26 16:07:56.000000000 +0530
@@ -127,6 +127,12 @@ struct mem_cgroup {
 	 * the counter to account for memory usage
 	 */
 	struct res_counter res;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+	/*
+	 * Address space limits
+	 */
+	struct res_counter as_res;
+#endif
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -870,6 +876,23 @@ static ssize_t mem_cgroup_write(struct c
 				mem_cgroup_write_strategy);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+static u64 mem_cgroup_as_read(struct cgroup *cont, struct cftype *cft)
+{
+	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->as_res,
+				    cft->private);
+}
+
+static ssize_t mem_cgroup_as_write(struct cgroup *cont, struct cftype *cft,
+				struct file *file, const char __user *userbuf,
+				size_t nbytes, loff_t *ppos)
+{
+	return res_counter_write(&mem_cgroup_from_cont(cont)->as_res,
+				cft->private, userbuf, nbytes, ppos,
+				mem_cgroup_write_strategy);
+}
+#endif
+
 static ssize_t mem_force_empty_write(struct cgroup *cont,
 				struct cftype *cft, struct file *file,
 				const char __user *userbuf,
@@ -943,6 +966,19 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+	{
+		.name = "as_usage_in_bytes",
+		.private = RES_USAGE,
+		.read_u64 = mem_cgroup_as_read,
+	},
+	{
+		.name = "as_limit_in_bytes",
+		.private = RES_LIMIT,
+		.write = mem_cgroup_as_write,
+		.read_u64 = mem_cgroup_as_read,
+	},
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -999,6 +1035,9 @@ mem_cgroup_create(struct cgroup_subsys *
 		return ERR_PTR(-ENOMEM);
 
 	res_counter_init(&mem->res);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+	res_counter_init(&mem->as_res);
+#endif
 
 	memset(&mem->info, 0, sizeof(mem->info));
 
diff -puN include/linux/memcontrol.h~memory-controller-virtual-address-space-control-user-interface include/linux/memcontrol.h
diff -puN init/Kconfig~memory-controller-virtual-address-space-control-user-interface init/Kconfig
--- linux-2.6.25-rc5/init/Kconfig~memory-controller-virtual-address-space-control-user-interface	2008-03-26 16:06:34.000000000 +0530
+++ linux-2.6.25-rc5-balbir/init/Kconfig	2008-03-26 16:13:06.000000000 +0530
@@ -379,6 +379,16 @@ config CGROUP_MEM_RES_CTLR
 	  Only enable when you're ok with these trade offs and really
 	  sure you need the memory resource controller.
 
+confg CGROUP_MEM_RES_CTLR_AS
+	bool "Virtual Address Space Controller for Control Groups"
+	depends on CGROUP_MEM_RES_CTLR
+	help
+	  Provides control over the maximum amount of virtual address space
+	  that can be consumed by the tasks in the cgroup. Setting a reasonable
+	  address limit will allow applications to fail more gracefully and
+	  avoid forceful reclaim or OOM when a cgroup exceeds it's memory
+	  limit.
+
 config SYSFS_DEPRECATED
 	bool
 
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-26 18:49 [RFC][0/3] Virtual address space control for cgroups (v2) Balbir Singh
  2008-03-26 18:50 ` [RFC][1/3] Add user interface for virtual address space control (v2) Balbir Singh
@ 2008-03-26 18:50 ` Balbir Singh
  2008-03-26 19:10   ` Balbir Singh
  2008-03-27  7:19   ` Pavel Emelyanov
  2008-03-26 18:50 ` [RFC][3/3] Update documentation for virtual address space control (v2) Balbir Singh
  2008-03-26 22:22 ` [RFC][0/3] Virtual address space control for cgroups (v2) Paul Menage
  3 siblings, 2 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-26 18:50 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, taka, linux-mm, David Rientjes, Balbir Singh,
	KAMEZAWA Hiroyuki



Changelog v2
------------
Change the accounting to what is already present in the kernel. Split
the address space accounting into mem_cgroup_charge_as and
mem_cgroup_uncharge_as. At the time of VM expansion, call
mem_cgroup_cannot_expand_as to check if the new allocation will push
us over the limit

This patch implements accounting and control of virtual address space.
Accounting is done when the virtual address space of any task/mm_struct
belonging to the cgroup is incremented or decremented. This patch
fails the expansion if the cgroup goes over its limit.

TODOs

1. Only when CONFIG_MMU is enabled, is the virtual address space control
   enabled. Should we do this for nommu cases as well? My suspicion is
   that we don't have to.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c  |    2 +
 arch/x86/kernel/ptrace.c    |    7 +++
 fs/exec.c                   |    2 +
 include/linux/memcontrol.h  |   26 +++++++++++++
 include/linux/res_counter.h |   19 ++++++++--
 init/Kconfig                |    2 -
 kernel/fork.c               |   17 +++++++--
 mm/memcontrol.c             |   83 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                   |   11 +++++
 mm/mremap.c                 |    2 +
 10 files changed, 163 insertions(+), 8 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:18:16.000000000 +0530
@@ -526,6 +526,76 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+/*
+ * Charge the address space usage for cgroup. This routine is most
+ * likely to be called from places that expand the total_vm of a mm_struct.
+ */
+void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
+{
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
+	css_put(&mem->css);
+}
+
+/*
+ * Uncharge the address space usage for cgroup. This routine is most
+ * likely to be called from places that shrink the total_vm of a mm_struct.
+ */
+void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages)
+{
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	res_counter_uncharge(&mem->as_res, (nr_pages * PAGE_SIZE));
+	css_put(&mem->css);
+}
+
+/*
+ * Check if the address space of the cgroup can be expanded.
+ * Returns 0 on success, anything else indicates failure
+ */
+int mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages)
+{
+	int ret = 0;
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return ret;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	if (!res_counter_check_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
+		ret = -ENOMEM;
+	css_put(&mem->css);
+	if (ret) {
+		printk("cannot expand as %d\n", ret);
+		dump_stack();
+	}
+	return ret;
+}
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_AS */
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -1111,6 +1181,19 @@ static void mem_cgroup_move_task(struct 
 		goto out;
 
 	css_get(&mem->css);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+	/*
+	 * For address space accounting, the charges are migrated.
+	 * We need to migrate it since all the future uncharge/charge will
+	 * now happen to the new cgroup. For consistency, we need to migrate
+	 * all charges, otherwise we could end up dropping charges from
+	 * the new cgroup (even though they were incurred in the current
+	 * group).
+	 */
+	if (res_counter_charge(&mem->as_res, (mm->total_vm * PAGE_SIZE)))
+		goto out;
+	res_counter_uncharge(&old_mem->as_res, (mm->total_vm * PAGE_SIZE));
+#endif
 	rcu_assign_pointer(mm->mem_cgroup, mem);
 	css_put(&old_mem->css);
 
diff -puN include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control include/linux/memcontrol.h
--- linux-2.6.25-rc5/include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/include/linux/memcontrol.h	2008-03-26 18:57:21.000000000 +0530
@@ -54,7 +54,6 @@ int task_in_mem_cgroup(struct task_struc
 extern int mem_cgroup_prepare_migration(struct page *page);
 extern void mem_cgroup_end_migration(struct page *page);
 extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
-
 /*
  * For memory reclaim.
  */
@@ -172,7 +171,32 @@ static inline long mem_cgroup_calc_recla
 {
 	return 0;
 }
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+
+extern void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages);
+extern void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages);
+extern int mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages);
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
+{
+}
+
+static inline void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages)
+{
+}
+
+static inline int
+mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
 #endif /* _LINUX_MEMCONTROL_H */
 
diff -puN mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.25-rc5/mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mmap.c	2008-03-26 22:37:25.000000000 +0530
@@ -1205,6 +1205,7 @@ munmap_back:
 		atomic_inc(&inode->i_writecount);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
@@ -1557,6 +1558,7 @@ static int acct_stack_growth(struct vm_a
 
 	/* Ok, everything looks good - let it rip */
 	mm->total_vm += grow;
+	mem_cgroup_charge_as(mm, grow);
 	if (vma->vm_flags & VM_LOCKED)
 		mm->locked_vm += grow;
 	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
@@ -1730,6 +1732,7 @@ static void remove_vma_list(struct mm_st
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		mem_cgroup_uncharge_as(mm, nrpages);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
@@ -2029,6 +2032,7 @@ unsigned long do_brk(unsigned long addr,
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
 		make_pages_present(addr, addr + len);
@@ -2056,6 +2060,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	mem_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2098,7 +2103,8 @@ int insert_vm_struct(struct mm_struct * 
 	if (__vma && __vma->vm_start < vma->vm_end)
 		return -ENOMEM;
 	if ((vma->vm_flags & VM_ACCOUNT) &&
-	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
+	     (security_vm_enough_memory_mm(mm, vma_pages(vma)) ||
+		mem_cgroup_cannot_expand_as(mm, vma_pages(vma))))
 		return -ENOMEM;
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
@@ -2174,6 +2180,8 @@ int may_expand_vm(struct mm_struct *mm, 
 
 	if (cur + npages > lim)
 		return 0;
+	if (mem_cgroup_cannot_expand_as(mm, npages))
+		return 0;
 	return 1;
 }
 
@@ -2252,6 +2260,7 @@ int install_special_mapping(struct mm_st
 	}
 
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 
 	return 0;
 }
diff -puN arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.25-rc5/arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/arch/x86/kernel/ptrace.c	2008-03-26 19:05:51.000000000 +0530
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -787,6 +788,8 @@ static int ptrace_bts_realloc(struct tas
 	current->mm->total_vm  -= old_size;
 	current->mm->locked_vm -= old_size;
 
+	mem_cgroup_uncharge_as(current->mm, old_size);
+
 	if (size == 0)
 		goto out;
 
@@ -803,6 +806,9 @@ static int ptrace_bts_realloc(struct tas
 			goto out;
 	}
 
+	if (mem_cgroup_cannot_expand_as(current->mm, size))
+		goto out;
+
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
 	vm = current->mm->locked_vm  + size;
 	if (rlim < vm) {
@@ -823,6 +829,7 @@ static int ptrace_bts_realloc(struct tas
 
 	current->mm->total_vm  += size;
 	current->mm->locked_vm += size;
+	mem_cgroup_charge_as(current->mm, size);
 
 out:
 	if (child->thread.ds_area_msr)
diff -puN kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.25-rc5/kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/kernel/fork.c	2008-03-26 22:36:17.000000000 +0530
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/memcontrol.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -237,17 +238,18 @@ static int dup_mmap(struct mm_struct *mm
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
+		unsigned int len = vma_pages(mpnt);
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			long pages = vma_pages(mpnt);
 			mm->total_vm -= pages;
+			mem_cgroup_uncharge_as(mm, pages);
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
 			continue;
 		}
 		charge = 0;
 		if (mpnt->vm_flags & VM_ACCOUNT) {
-			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(len))
 				goto fail_nomem;
 			charge = len;
@@ -311,8 +313,8 @@ out:
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
-	retval = -ENOMEM;
 	vm_unacct_memory(charge);
+	retval = -ENOMEM;
 	goto out;
 }
 
@@ -1047,6 +1049,17 @@ static struct task_struct *copy_process(
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
+
+	/*
+	 * It's OK to duplicate the charges of current->mm on fork
+	 */
+	if (current->mm && !(clone_flags & CLONE_VM)) {
+		if (mem_cgroup_cannot_expand_as(current->mm,
+						current->mm->total_vm))
+			goto bad_fork_free;
+		mem_cgroup_charge_as(current->mm, current->mm->total_vm);
+	}
+
 	retval = -EAGAIN;
 	if (atomic_read(&p->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
diff -puN mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control mm/mremap.c
--- linux-2.6.25-rc5/mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mremap.c	2008-03-26 19:08:18.000000000 +0530
@@ -213,6 +213,7 @@ static unsigned long move_vma(struct vm_
 	 */
 	hiwater_vm = mm->hiwater_vm;
 	mm->total_vm += new_len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, new_len >> PAGE_SHIFT);
 	vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
 
 	if (do_munmap(mm, old_addr, old_len) < 0) {
@@ -370,6 +371,7 @@ unsigned long do_mremap(unsigned long ad
 				addr + new_len, vma->vm_pgoff, NULL);
 
 			mm->total_vm += pages;
+			mem_cgroup_charge_as(mm, pages);
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
diff -puN init/Kconfig~memory-controller-virtual-address-space-accounting-and-control init/Kconfig
--- linux-2.6.25-rc5/init/Kconfig~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/init/Kconfig	2008-03-26 18:06:49.000000000 +0530
@@ -381,7 +381,7 @@ config CGROUP_MEM_RES_CTLR
 
 config CGROUP_MEM_RES_CTLR_AS
 	bool "Virtual Address Space Controller for Control Groups"
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR && MMU
 	help
 	  Provides control over the maximum amount of virtual address space
 	  that can be consumed by the tasks in the cgroup. Setting a reasonable
diff -puN mm/swapfile.c~memory-controller-virtual-address-space-accounting-and-control mm/swapfile.c
diff -puN mm/memory.c~memory-controller-virtual-address-space-accounting-and-control mm/memory.c
diff -puN arch/ia64/kernel/perfmon.c~memory-controller-virtual-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.25-rc5/arch/ia64/kernel/perfmon.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:32:02.000000000 +0530
+++ linux-2.6.25-rc5-balbir/arch/ia64/kernel/perfmon.c	2008-03-26 16:41:42.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memcontrol.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2375,6 +2376,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	insert_vm_struct(mm, vma);
 
 	mm->total_vm  += size >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, (size >> PAGE_SHIFT));
 	vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
 							vma_pages(vma));
 	up_write(&task->mm->mmap_sem);
diff -puN include/linux/res_counter.h~memory-controller-virtual-address-space-accounting-and-control include/linux/res_counter.h
--- linux-2.6.25-rc5/include/linux/res_counter.h~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 18:53:22.000000000 +0530
+++ linux-2.6.25-rc5-balbir/include/linux/res_counter.h	2008-03-26 20:09:06.000000000 +0530
@@ -104,9 +104,10 @@ int res_counter_charge(struct res_counte
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
-static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
+static inline bool res_counter_limit_check_locked(struct res_counter *cnt,
+							unsigned long val)
 {
-	if (cnt->usage < cnt->limit)
+	if (cnt->usage + val < cnt->limit)
 		return true;
 
 	return false;
@@ -122,7 +123,19 @@ static inline bool res_counter_check_und
 	unsigned long flags;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	ret = res_counter_limit_check_locked(cnt);
+	ret = res_counter_limit_check_locked(cnt, 0);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool res_counter_check_charge(struct res_counter *cnt,
+						unsigned long val)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_limit_check_locked(cnt, val);
 	spin_unlock_irqrestore(&cnt->lock, flags);
 	return ret;
 }
diff -puN fs/exec.c~memory-controller-virtual-address-space-accounting-and-control fs/exec.c
--- linux-2.6.25-rc5/fs/exec.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 23:32:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/fs/exec.c	2008-03-26 23:34:02.000000000 +0530
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -250,6 +251,7 @@ static int __bprm_mm_init(struct linux_b
 	}
 
 	mm->stack_vm = mm->total_vm = 1;
+	mem_cgroup_charge_as(mm, 1);
 	up_write(&mm->mmap_sem);
 
 	bprm->p = vma->vm_end - sizeof(void *);
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][3/3] Update documentation for virtual address space control (v2)
  2008-03-26 18:49 [RFC][0/3] Virtual address space control for cgroups (v2) Balbir Singh
  2008-03-26 18:50 ` [RFC][1/3] Add user interface for virtual address space control (v2) Balbir Singh
  2008-03-26 18:50 ` [RFC][2/3] Account and control virtual address space allocations (v2) Balbir Singh
@ 2008-03-26 18:50 ` Balbir Singh
  2008-03-26 22:22 ` [RFC][0/3] Virtual address space control for cgroups (v2) Paul Menage
  3 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-26 18:50 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, taka, linux-mm, David Rientjes, Balbir Singh,
	KAMEZAWA Hiroyuki



Changelog v2
------------
Fix typos and implement review suggestions from Randy

This patch adds documentation for virtual address space control.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memory.txt |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt
--- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation	2008-03-27 00:18:19.000000000 +0530
+++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt	2008-03-27 00:18:19.000000000 +0530
@@ -237,7 +237,33 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
-5. TODO
+5. Virtual address space accounting
+
+A new resource counter controls the address space expansion of the tasks in
+the cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user
+interface. Please see section 3 for more details on how to use the user
+interface to get and set values.
+
+The "as_usage_in_bytes" file provides information about the total address
+space usage of the cgroup in bytes.
+
+5.1 Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+   i.e., via a malloc or mmap failure as compared to OOM kill when no
+   pages can be reclaimed.
+2. It provides better control over how many pages can be swapped out when
+   the cgroup goes over its limit. A badly setup cgroup can cause excessive
+   swapping. Providing control over the address space allocations ensures
+   that the system administrator has control over the total swapping that
+   can take place.
+
+NOTE: This feature is controlled by the CONFIG_CGROUP_MEM_RES_CTLR_AS
+
+6. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-26 18:50 ` [RFC][2/3] Account and control virtual address space allocations (v2) Balbir Singh
@ 2008-03-26 19:10   ` Balbir Singh
  2008-03-27  7:19   ` Pavel Emelyanov
  1 sibling, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-26 19:10 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov
  Cc: Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, taka, linux-mm, David Rientjes, KAMEZAWA Hiroyuki

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-03-27 00:20:17]:

> +	if (ret) {
> +		printk("cannot expand as %d\n", ret);
> +		dump_stack();
> +	}

I left some debug code behind

Here's the fixed version without the debug


Changelog v2
------------
Change the accounting to what is already present in the kernel. Split
the address space accounting into mem_cgroup_charge_as and
mem_cgroup_uncharge_as. At the time of VM expansion, call
mem_cgroup_cannot_expand_as to check if the new allocation will push
us over the limit

This patch implements accounting and control of virtual address space.
Accounting is done when the virtual address space of any task/mm_struct
belonging to the cgroup is incremented or decremented. This patch
fails the expansion if the cgroup goes over its limit.

TODOs

1. Only when CONFIG_MMU is enabled, is the virtual address space control
   enabled. Should we do this for nommu cases as well? My suspicion is
   that we don't have to.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c  |    2 +
 arch/x86/kernel/ptrace.c    |    7 +++
 fs/exec.c                   |    2 +
 include/linux/memcontrol.h  |   26 +++++++++++++-
 include/linux/res_counter.h |   19 ++++++++--
 init/Kconfig                |    2 -
 kernel/fork.c               |   17 ++++++++-
 mm/memcontrol.c             |   79 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                   |   11 +++++-
 mm/mremap.c                 |    2 +
 10 files changed, 159 insertions(+), 8 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:38:55.000000000 +0530
@@ -526,6 +526,72 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+/*
+ * Charge the address space usage for cgroup. This routine is most
+ * likely to be called from places that expand the total_vm of a mm_struct.
+ */
+void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
+{
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
+	css_put(&mem->css);
+}
+
+/*
+ * Uncharge the address space usage for cgroup. This routine is most
+ * likely to be called from places that shrink the total_vm of a mm_struct.
+ */
+void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages)
+{
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	res_counter_uncharge(&mem->as_res, (nr_pages * PAGE_SIZE));
+	css_put(&mem->css);
+}
+
+/*
+ * Check if the address space of the cgroup can be expanded.
+ * Returns 0 on success, anything else indicates failure
+ */
+int mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages)
+{
+	int ret = 0;
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_subsys.disabled)
+		return ret;
+
+	rcu_read_lock();
+	mem = rcu_dereference(mm->mem_cgroup);
+	css_get(&mem->css);
+	rcu_read_unlock();
+
+	if (!res_counter_check_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
+		ret = -ENOMEM;
+	css_put(&mem->css);
+	return ret;
+}
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_AS */
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -1111,6 +1177,19 @@ static void mem_cgroup_move_task(struct 
 		goto out;
 
 	css_get(&mem->css);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+	/*
+	 * For address space accounting, the charges are migrated.
+	 * We need to migrate it since all the future uncharge/charge will
+	 * now happen to the new cgroup. For consistency, we need to migrate
+	 * all charges, otherwise we could end up dropping charges from
+	 * the new cgroup (even though they were incurred in the current
+	 * group).
+	 */
+	if (res_counter_charge(&mem->as_res, (mm->total_vm * PAGE_SIZE)))
+		goto out;
+	res_counter_uncharge(&old_mem->as_res, (mm->total_vm * PAGE_SIZE));
+#endif
 	rcu_assign_pointer(mm->mem_cgroup, mem);
 	css_put(&old_mem->css);
 
diff -puN include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control include/linux/memcontrol.h
--- linux-2.6.25-rc5/include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/include/linux/memcontrol.h	2008-03-26 18:57:21.000000000 +0530
@@ -54,7 +54,6 @@ int task_in_mem_cgroup(struct task_struc
 extern int mem_cgroup_prepare_migration(struct page *page);
 extern void mem_cgroup_end_migration(struct page *page);
 extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
-
 /*
  * For memory reclaim.
  */
@@ -172,7 +171,32 @@ static inline long mem_cgroup_calc_recla
 {
 	return 0;
 }
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
+
+extern void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages);
+extern void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages);
+extern int mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages);
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
+{
+}
+
+static inline void mem_cgroup_uncharge_as(struct mm_struct *mm, long nr_pages)
+{
+}
+
+static inline int
+mem_cgroup_cannot_expand_as(struct mm_struct *mm, long nr_pages)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
 #endif /* _LINUX_MEMCONTROL_H */
 
diff -puN mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.25-rc5/mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mmap.c	2008-03-26 22:37:25.000000000 +0530
@@ -1205,6 +1205,7 @@ munmap_back:
 		atomic_inc(&inode->i_writecount);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
@@ -1557,6 +1558,7 @@ static int acct_stack_growth(struct vm_a
 
 	/* Ok, everything looks good - let it rip */
 	mm->total_vm += grow;
+	mem_cgroup_charge_as(mm, grow);
 	if (vma->vm_flags & VM_LOCKED)
 		mm->locked_vm += grow;
 	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
@@ -1730,6 +1732,7 @@ static void remove_vma_list(struct mm_st
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		mem_cgroup_uncharge_as(mm, nrpages);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
@@ -2029,6 +2032,7 @@ unsigned long do_brk(unsigned long addr,
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
 		make_pages_present(addr, addr + len);
@@ -2056,6 +2060,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	mem_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2098,7 +2103,8 @@ int insert_vm_struct(struct mm_struct * 
 	if (__vma && __vma->vm_start < vma->vm_end)
 		return -ENOMEM;
 	if ((vma->vm_flags & VM_ACCOUNT) &&
-	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
+	     (security_vm_enough_memory_mm(mm, vma_pages(vma)) ||
+		mem_cgroup_cannot_expand_as(mm, vma_pages(vma))))
 		return -ENOMEM;
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
@@ -2174,6 +2180,8 @@ int may_expand_vm(struct mm_struct *mm, 
 
 	if (cur + npages > lim)
 		return 0;
+	if (mem_cgroup_cannot_expand_as(mm, npages))
+		return 0;
 	return 1;
 }
 
@@ -2252,6 +2260,7 @@ int install_special_mapping(struct mm_st
 	}
 
 	mm->total_vm += len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, len >> PAGE_SHIFT);
 
 	return 0;
 }
diff -puN arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.25-rc5/arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/arch/x86/kernel/ptrace.c	2008-03-26 19:05:51.000000000 +0530
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -787,6 +788,8 @@ static int ptrace_bts_realloc(struct tas
 	current->mm->total_vm  -= old_size;
 	current->mm->locked_vm -= old_size;
 
+	mem_cgroup_uncharge_as(current->mm, old_size);
+
 	if (size == 0)
 		goto out;
 
@@ -803,6 +806,9 @@ static int ptrace_bts_realloc(struct tas
 			goto out;
 	}
 
+	if (mem_cgroup_cannot_expand_as(current->mm, size))
+		goto out;
+
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
 	vm = current->mm->locked_vm  + size;
 	if (rlim < vm) {
@@ -823,6 +829,7 @@ static int ptrace_bts_realloc(struct tas
 
 	current->mm->total_vm  += size;
 	current->mm->locked_vm += size;
+	mem_cgroup_charge_as(current->mm, size);
 
 out:
 	if (child->thread.ds_area_msr)
diff -puN kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.25-rc5/kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/kernel/fork.c	2008-03-26 22:36:17.000000000 +0530
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/memcontrol.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -237,17 +238,18 @@ static int dup_mmap(struct mm_struct *mm
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
+		unsigned int len = vma_pages(mpnt);
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			long pages = vma_pages(mpnt);
 			mm->total_vm -= pages;
+			mem_cgroup_uncharge_as(mm, pages);
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
 			continue;
 		}
 		charge = 0;
 		if (mpnt->vm_flags & VM_ACCOUNT) {
-			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(len))
 				goto fail_nomem;
 			charge = len;
@@ -311,8 +313,8 @@ out:
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
-	retval = -ENOMEM;
 	vm_unacct_memory(charge);
+	retval = -ENOMEM;
 	goto out;
 }
 
@@ -1047,6 +1049,17 @@ static struct task_struct *copy_process(
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
+
+	/*
+	 * It's OK to duplicate the charges of current->mm on fork
+	 */
+	if (current->mm && !(clone_flags & CLONE_VM)) {
+		if (mem_cgroup_cannot_expand_as(current->mm,
+						current->mm->total_vm))
+			goto bad_fork_free;
+		mem_cgroup_charge_as(current->mm, current->mm->total_vm);
+	}
+
 	retval = -EAGAIN;
 	if (atomic_read(&p->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
diff -puN mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control mm/mremap.c
--- linux-2.6.25-rc5/mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mremap.c	2008-03-26 19:08:18.000000000 +0530
@@ -213,6 +213,7 @@ static unsigned long move_vma(struct vm_
 	 */
 	hiwater_vm = mm->hiwater_vm;
 	mm->total_vm += new_len >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, new_len >> PAGE_SHIFT);
 	vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
 
 	if (do_munmap(mm, old_addr, old_len) < 0) {
@@ -370,6 +371,7 @@ unsigned long do_mremap(unsigned long ad
 				addr + new_len, vma->vm_pgoff, NULL);
 
 			mm->total_vm += pages;
+			mem_cgroup_charge_as(mm, pages);
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
diff -puN init/Kconfig~memory-controller-virtual-address-space-accounting-and-control init/Kconfig
--- linux-2.6.25-rc5/init/Kconfig~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/init/Kconfig	2008-03-26 18:06:49.000000000 +0530
@@ -381,7 +381,7 @@ config CGROUP_MEM_RES_CTLR
 
 config CGROUP_MEM_RES_CTLR_AS
 	bool "Virtual Address Space Controller for Control Groups"
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR && MMU
 	help
 	  Provides control over the maximum amount of virtual address space
 	  that can be consumed by the tasks in the cgroup. Setting a reasonable
diff -puN mm/swapfile.c~memory-controller-virtual-address-space-accounting-and-control mm/swapfile.c
diff -puN mm/memory.c~memory-controller-virtual-address-space-accounting-and-control mm/memory.c
diff -puN arch/ia64/kernel/perfmon.c~memory-controller-virtual-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.25-rc5/arch/ia64/kernel/perfmon.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:32:02.000000000 +0530
+++ linux-2.6.25-rc5-balbir/arch/ia64/kernel/perfmon.c	2008-03-26 16:41:42.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memcontrol.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2375,6 +2376,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	insert_vm_struct(mm, vma);
 
 	mm->total_vm  += size >> PAGE_SHIFT;
+	mem_cgroup_charge_as(mm, (size >> PAGE_SHIFT));
 	vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
 							vma_pages(vma));
 	up_write(&task->mm->mmap_sem);
diff -puN include/linux/res_counter.h~memory-controller-virtual-address-space-accounting-and-control include/linux/res_counter.h
--- linux-2.6.25-rc5/include/linux/res_counter.h~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 18:53:22.000000000 +0530
+++ linux-2.6.25-rc5-balbir/include/linux/res_counter.h	2008-03-26 20:09:06.000000000 +0530
@@ -104,9 +104,10 @@ int res_counter_charge(struct res_counte
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
-static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
+static inline bool res_counter_limit_check_locked(struct res_counter *cnt,
+							unsigned long val)
 {
-	if (cnt->usage < cnt->limit)
+	if (cnt->usage + val < cnt->limit)
 		return true;
 
 	return false;
@@ -122,7 +123,19 @@ static inline bool res_counter_check_und
 	unsigned long flags;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	ret = res_counter_limit_check_locked(cnt);
+	ret = res_counter_limit_check_locked(cnt, 0);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
+static inline bool res_counter_check_charge(struct res_counter *cnt,
+						unsigned long val)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_limit_check_locked(cnt, val);
 	spin_unlock_irqrestore(&cnt->lock, flags);
 	return ret;
 }
diff -puN fs/exec.c~memory-controller-virtual-address-space-accounting-and-control fs/exec.c
--- linux-2.6.25-rc5/fs/exec.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 23:32:59.000000000 +0530
+++ linux-2.6.25-rc5-balbir/fs/exec.c	2008-03-26 23:34:02.000000000 +0530
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -250,6 +251,7 @@ static int __bprm_mm_init(struct linux_b
 	}
 
 	mm->stack_vm = mm->total_vm = 1;
+	mem_cgroup_charge_as(mm, 1);
 	up_write(&mm->mmap_sem);
 
 	bprm->p = vma->vm_end - sizeof(void *);
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-26 18:49 [RFC][0/3] Virtual address space control for cgroups (v2) Balbir Singh
                   ` (2 preceding siblings ...)
  2008-03-26 18:50 ` [RFC][3/3] Update documentation for virtual address space control (v2) Balbir Singh
@ 2008-03-26 22:22 ` Paul Menage
  2008-03-27  8:04   ` Balbir Singh
  2008-03-27 10:03   ` KAMEZAWA Hiroyuki
  3 siblings, 2 replies; 23+ messages in thread
From: Paul Menage @ 2008-03-26 22:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

On Wed, Mar 26, 2008 at 11:49 AM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
>
>  The changelog in each patchset documents what has changed in version 2.
>  The most important one being that virtual address space accounting is
>  now a config option.
>
>  Reviews, Comments?
>

I'm still of the strong opinion that this belongs in a separate
subsystem. (So some of these arguments will appear familiar, but are
generally because they were unaddressed previously).


The basic philosophy of cgroups is that one size does not fit all
(either all users, or all task groups), hence the ability to
pick'n'mix subsystems in a hierarchy, and have multiple different
hierarchies. So users who want physical memory isolation but not
virtual address isolation shouldn't have to pay the cost (multiple
atomic operations on a shared structure) on every mmap/munmap or other
address space change.


Trying to account/control physical memory or swap usage via virtual
address space limits is IMO a hopeless task. Taking Google's
production clusters and the virtual server systems that I worked on in
my previous job as real-life examples that I've encountered, there's
far too much variety of application behaviour (including Java apps
that have massive sparse heaps, jobs with lots of forked children
sharing pages but not address spaces with their parents, and multiple
serving processes mapping large shared data repositories from SHM
segments) that saying VA = RAM + swap is going to break lots of jobs.
But pushing up the VA limit massively makes it useless for the purpose
of preventing excessive swapping. If you want to prevent excessive
swap space usage without breaking a large class of apps, you need to
limit swap space, not virtual address space.

Additionally, you suggested that VA limits provide a "soft-landing".
But I'm think that the number of applications that will do much other
than abort() if mmap() returns ENOMEM is extremely small - I'd be
interested to hear if you know of any.

I'm not going to argue that there are no good reasons for VA limits,
but I think my arguments above will apply in enough cases that VA
limits won't be used in the majority of cases that are using the
memory controller, let alone all machines running kernels with the
memory controller configured (e.g. distro kernels). Hence it should be
possible to use the memory controller without paying the full overhead
for the virtual address space limits.


And in cases that do want to use VA limits, can you be 100% sure that
they're going to want to use the same groupings as the memory
controller? I'm not sure that I can come up with a realistic example
of why you'd want to have VA limits and memory limits in different
hierarchies (maybe tracking memory leaks in subgroups of a job and
using physical memory control for the job as a whole?), but any such
example would work for free if they were two separate subsystems.

The only real technical argument against having them in separate
subsystems is that there needs to be an extra pointer from mm_struct
to a va_limit subsystem object if they're separate, since the VA
limits can no longer use mm->mem_cgroup. This is basically 8 bytes of
overhead per process (not per-thread) which is minimal, and even that
could go away if we were to implement the mm->owner concept.


Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-26 18:50 ` [RFC][2/3] Account and control virtual address space allocations (v2) Balbir Singh
  2008-03-26 19:10   ` Balbir Singh
@ 2008-03-27  7:19   ` Pavel Emelyanov
  2008-03-27  8:02     ` Balbir Singh
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Emelyanov @ 2008-03-27  7:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki

Balbir Singh wrote:
> Changelog v2
> ------------
> Change the accounting to what is already present in the kernel. Split
> the address space accounting into mem_cgroup_charge_as and
> mem_cgroup_uncharge_as. At the time of VM expansion, call
> mem_cgroup_cannot_expand_as to check if the new allocation will push
> us over the limit
> 
> This patch implements accounting and control of virtual address space.
> Accounting is done when the virtual address space of any task/mm_struct
> belonging to the cgroup is incremented or decremented. This patch
> fails the expansion if the cgroup goes over its limit.
> 
> TODOs
> 
> 1. Only when CONFIG_MMU is enabled, is the virtual address space control
>    enabled. Should we do this for nommu cases as well? My suspicion is
>    that we don't have to.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  arch/ia64/kernel/perfmon.c  |    2 +
>  arch/x86/kernel/ptrace.c    |    7 +++
>  fs/exec.c                   |    2 +
>  include/linux/memcontrol.h  |   26 +++++++++++++
>  include/linux/res_counter.h |   19 ++++++++--
>  init/Kconfig                |    2 -
>  kernel/fork.c               |   17 +++++++--
>  mm/memcontrol.c             |   83 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/mmap.c                   |   11 +++++
>  mm/mremap.c                 |    2 +
>  10 files changed, 163 insertions(+), 8 deletions(-)
> 
> diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
> --- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
> +++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:18:16.000000000 +0530
> @@ -526,6 +526,76 @@ unsigned long mem_cgroup_isolate_pages(u
>  	return nr_taken;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
> +/*
> + * Charge the address space usage for cgroup. This routine is most
> + * likely to be called from places that expand the total_vm of a mm_struct.
> + */
> +void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
> +{
> +	struct mem_cgroup *mem;
> +
> +	if (mem_cgroup_subsys.disabled)
> +		return;
> +
> +	rcu_read_lock();
> +	mem = rcu_dereference(mm->mem_cgroup);
> +	css_get(&mem->css);
> +	rcu_read_unlock();
> +
> +	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
> +	css_put(&mem->css);

Why don't you check whether the counter is charged? This is
bad for two reasons:
1. you allow for some growth above the limit (e.g. in expand_stack)
2. you will undercharge it in the future when uncharging the
   vme, whose charge was failed and thus unaccounted.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-27  7:19   ` Pavel Emelyanov
@ 2008-03-27  8:02     ` Balbir Singh
  2008-03-27  8:24       ` Pavel Emelyanov
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-27  8:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki

Pavel Emelyanov wrote:
> Balbir Singh wrote:
>> Changelog v2
>> ------------
>> Change the accounting to what is already present in the kernel. Split
>> the address space accounting into mem_cgroup_charge_as and
>> mem_cgroup_uncharge_as. At the time of VM expansion, call
>> mem_cgroup_cannot_expand_as to check if the new allocation will push
>> us over the limit
>>
>> This patch implements accounting and control of virtual address space.
>> Accounting is done when the virtual address space of any task/mm_struct
>> belonging to the cgroup is incremented or decremented. This patch
>> fails the expansion if the cgroup goes over its limit.
>>
>> TODOs
>>
>> 1. Only when CONFIG_MMU is enabled, is the virtual address space control
>>    enabled. Should we do this for nommu cases as well? My suspicion is
>>    that we don't have to.
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>> ---
>>
>>  arch/ia64/kernel/perfmon.c  |    2 +
>>  arch/x86/kernel/ptrace.c    |    7 +++
>>  fs/exec.c                   |    2 +
>>  include/linux/memcontrol.h  |   26 +++++++++++++
>>  include/linux/res_counter.h |   19 ++++++++--
>>  init/Kconfig                |    2 -
>>  kernel/fork.c               |   17 +++++++--
>>  mm/memcontrol.c             |   83 ++++++++++++++++++++++++++++++++++++++++++++
>>  mm/mmap.c                   |   11 +++++
>>  mm/mremap.c                 |    2 +
>>  10 files changed, 163 insertions(+), 8 deletions(-)
>>
>> diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
>> --- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:18:16.000000000 +0530
>> @@ -526,6 +526,76 @@ unsigned long mem_cgroup_isolate_pages(u
>>  	return nr_taken;
>>  }
>>  
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
>> +/*
>> + * Charge the address space usage for cgroup. This routine is most
>> + * likely to be called from places that expand the total_vm of a mm_struct.
>> + */
>> +void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
>> +{
>> +	struct mem_cgroup *mem;
>> +
>> +	if (mem_cgroup_subsys.disabled)
>> +		return;
>> +
>> +	rcu_read_lock();
>> +	mem = rcu_dereference(mm->mem_cgroup);
>> +	css_get(&mem->css);
>> +	rcu_read_unlock();
>> +
>> +	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
>> +	css_put(&mem->css);
> 
> Why don't you check whether the counter is charged? This is
> bad for two reasons:
> 1. you allow for some growth above the limit (e.g. in expand_stack)

I was doing that earlier and then decided to keep the virtual address space code
in sync with the RLIMIT_AS checking code in the kernel. If you see the flow, it
closely resembles what we do with mm->total_vm and may_expand_vm().
expand_stack() in turn calls acct_stack_growth() which calls may_expand_vm()

> 2. you will undercharge it in the future when uncharging the
>    vme, whose charge was failed and thus unaccounted.

Hmmm...  This should ideally never happen, since we do a may_expand_vm() before
expanding the VM and in our case the virtual address space usage. I've not seen
it during my runs either. But it is something to keep in mind.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-26 22:22 ` [RFC][0/3] Virtual address space control for cgroups (v2) Paul Menage
@ 2008-03-27  8:04   ` Balbir Singh
  2008-03-27 14:28     ` Paul Menage
  2008-03-27 10:03   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-27  8:04 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Wed, Mar 26, 2008 at 11:49 AM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
>>  The changelog in each patchset documents what has changed in version 2.
>>  The most important one being that virtual address space accounting is
>>  now a config option.
>>
>>  Reviews, Comments?
>>
> 
> I'm still of the strong opinion that this belongs in a separate
> subsystem. (So some of these arguments will appear familiar, but are
> generally because they were unaddressed previously).
> 

I thought I addressed some of those by adding a separate config option. You
could enable just the address space control, by letting memory.limit_in_bytes at
the maximum value it is at (at the moment).

> 
> The basic philosophy of cgroups is that one size does not fit all
> (either all users, or all task groups), hence the ability to
> pick'n'mix subsystems in a hierarchy, and have multiple different
> hierarchies. So users who want physical memory isolation but not
> virtual address isolation shouldn't have to pay the cost (multiple
> atomic operations on a shared structure) on every mmap/munmap or other
> address space change.
> 

Yes, I agree with the overhead philosophy. I suspect that users will enable
both. I am not against making it a separate controller. I am still hopeful of
getting the mm->owner approach working

> 
> Trying to account/control physical memory or swap usage via virtual
> address space limits is IMO a hopeless task. Taking Google's
> production clusters and the virtual server systems that I worked on in
> my previous job as real-life examples that I've encountered, there's
> far too much variety of application behaviour (including Java apps
> that have massive sparse heaps, jobs with lots of forked children
> sharing pages but not address spaces with their parents, and multiple
> serving processes mapping large shared data repositories from SHM
> segments) that saying VA = RAM + swap is going to break lots of jobs.
> But pushing up the VA limit massively makes it useless for the purpose
> of preventing excessive swapping. If you want to prevent excessive
> swap space usage without breaking a large class of apps, you need to
> limit swap space, not virtual address space.
> 
> Additionally, you suggested that VA limits provide a "soft-landing".
> But I'm think that the number of applications that will do much other
> than abort() if mmap() returns ENOMEM is extremely small - I'd be
> interested to hear if you know of any.
> 

What happens if swap is completely disabled? Should the task running be OOM
killed in the container? How does the application get to know that it is
reaching its limit? I suspect the system administrator will consider
vm.overcommit_ratio while setting up virtual address space limits and real page
usage limit. As far as applications failing gracefully is concerned, my opinion is

1. Lets not be dictated by bad applications to design our features
2. Autonomic computing is forcing applications to see what resources
applications do have access to
3. Swapping is expensive, so most application developers, I spoken to at
conferences, recently, state that they can manage their own memory, provided
they are given sufficient hints from the OS. An mmap() failure, for example can
force the application to free memory it is not currently using or trigger the
garbage collector in a managed environment.


> I'm not going to argue that there are no good reasons for VA limits,
> but I think my arguments above will apply in enough cases that VA
> limits won't be used in the majority of cases that are using the
> memory controller, let alone all machines running kernels with the
> memory controller configured (e.g. distro kernels). Hence it should be
> possible to use the memory controller without paying the full overhead
> for the virtual address space limits.
> 

Yes, the overhead part is a compelling reason to split out the controllers. But
then again, we have a config option to disable the overhead.

> 
> And in cases that do want to use VA limits, can you be 100% sure that
> they're going to want to use the same groupings as the memory
> controller? I'm not sure that I can come up with a realistic example
> of why you'd want to have VA limits and memory limits in different
> hierarchies (maybe tracking memory leaks in subgroups of a job and
> using physical memory control for the job as a whole?), but any such
> example would work for free if they were two separate subsystems.
> 
> The only real technical argument against having them in separate
> subsystems is that there needs to be an extra pointer from mm_struct
> to a va_limit subsystem object if they're separate, since the VA
> limits can no longer use mm->mem_cgroup. This is basically 8 bytes of
> overhead per process (not per-thread) which is minimal, and even that
> could go away if we were to implement the mm->owner concept.
>

Yes, the mm->owner patches would help split the controller out more easily. Let
me see if I can get another revision of that working and measure the overhead of
finding the next mm->owner.


> 
> Paul


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-27  8:02     ` Balbir Singh
@ 2008-03-27  8:24       ` Pavel Emelyanov
  2008-03-27  8:30         ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelyanov @ 2008-03-27  8:24 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki

Balbir Singh wrote:
> Pavel Emelyanov wrote:
>> Balbir Singh wrote:
>>> Changelog v2
>>> ------------
>>> Change the accounting to what is already present in the kernel. Split
>>> the address space accounting into mem_cgroup_charge_as and
>>> mem_cgroup_uncharge_as. At the time of VM expansion, call
>>> mem_cgroup_cannot_expand_as to check if the new allocation will push
>>> us over the limit
>>>
>>> This patch implements accounting and control of virtual address space.
>>> Accounting is done when the virtual address space of any task/mm_struct
>>> belonging to the cgroup is incremented or decremented. This patch
>>> fails the expansion if the cgroup goes over its limit.
>>>
>>> TODOs
>>>
>>> 1. Only when CONFIG_MMU is enabled, is the virtual address space control
>>>    enabled. Should we do this for nommu cases as well? My suspicion is
>>>    that we don't have to.
>>>
>>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>>> ---
>>>
>>>  arch/ia64/kernel/perfmon.c  |    2 +
>>>  arch/x86/kernel/ptrace.c    |    7 +++
>>>  fs/exec.c                   |    2 +
>>>  include/linux/memcontrol.h  |   26 +++++++++++++
>>>  include/linux/res_counter.h |   19 ++++++++--
>>>  init/Kconfig                |    2 -
>>>  kernel/fork.c               |   17 +++++++--
>>>  mm/memcontrol.c             |   83 ++++++++++++++++++++++++++++++++++++++++++++
>>>  mm/mmap.c                   |   11 +++++
>>>  mm/mremap.c                 |    2 +
>>>  10 files changed, 163 insertions(+), 8 deletions(-)
>>>
>>> diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
>>> --- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
>>> +++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:18:16.000000000 +0530
>>> @@ -526,6 +526,76 @@ unsigned long mem_cgroup_isolate_pages(u
>>>  	return nr_taken;
>>>  }
>>>  
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
>>> +/*
>>> + * Charge the address space usage for cgroup. This routine is most
>>> + * likely to be called from places that expand the total_vm of a mm_struct.
>>> + */
>>> +void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
>>> +{
>>> +	struct mem_cgroup *mem;
>>> +
>>> +	if (mem_cgroup_subsys.disabled)
>>> +		return;
>>> +
>>> +	rcu_read_lock();
>>> +	mem = rcu_dereference(mm->mem_cgroup);
>>> +	css_get(&mem->css);
>>> +	rcu_read_unlock();
>>> +
>>> +	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
>>> +	css_put(&mem->css);
>> Why don't you check whether the counter is charged? This is
>> bad for two reasons:
>> 1. you allow for some growth above the limit (e.g. in expand_stack)
> 
> I was doing that earlier and then decided to keep the virtual address space code
> in sync with the RLIMIT_AS checking code in the kernel. If you see the flow, it
> closely resembles what we do with mm->total_vm and may_expand_vm().
> expand_stack() in turn calls acct_stack_growth() which calls may_expand_vm()

But this is racy! Look - you do expand_stack on two CPUs and the limit is
almost reached - so that there's room for a single expansion. In this case 
may_expand_vm will return true for both, since it only checks the limit, 
while the subsequent charge will fail on one of them, since it actually 
tries to raise the usage...

>> 2. you will undercharge it in the future when uncharging the
>>    vme, whose charge was failed and thus unaccounted.
> 
> Hmmm...  This should ideally never happen, since we do a may_expand_vm() before
> expanding the VM and in our case the virtual address space usage. I've not seen
> it during my runs either. But it is something to keep in mind.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-27  8:24       ` Pavel Emelyanov
@ 2008-03-27  8:30         ` Balbir Singh
  2008-03-27  8:38           ` Pavel Emelyanov
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-27  8:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki

Pavel Emelyanov wrote:
> Balbir Singh wrote:
>> Pavel Emelyanov wrote:
>>> Balbir Singh wrote:
>>>> Changelog v2
>>>> ------------
>>>> Change the accounting to what is already present in the kernel. Split
>>>> the address space accounting into mem_cgroup_charge_as and
>>>> mem_cgroup_uncharge_as. At the time of VM expansion, call
>>>> mem_cgroup_cannot_expand_as to check if the new allocation will push
>>>> us over the limit
>>>>
>>>> This patch implements accounting and control of virtual address space.
>>>> Accounting is done when the virtual address space of any task/mm_struct
>>>> belonging to the cgroup is incremented or decremented. This patch
>>>> fails the expansion if the cgroup goes over its limit.
>>>>
>>>> TODOs
>>>>
>>>> 1. Only when CONFIG_MMU is enabled, is the virtual address space control
>>>>    enabled. Should we do this for nommu cases as well? My suspicion is
>>>>    that we don't have to.
>>>>
>>>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>>  arch/ia64/kernel/perfmon.c  |    2 +
>>>>  arch/x86/kernel/ptrace.c    |    7 +++
>>>>  fs/exec.c                   |    2 +
>>>>  include/linux/memcontrol.h  |   26 +++++++++++++
>>>>  include/linux/res_counter.h |   19 ++++++++--
>>>>  init/Kconfig                |    2 -
>>>>  kernel/fork.c               |   17 +++++++--
>>>>  mm/memcontrol.c             |   83 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  mm/mmap.c                   |   11 +++++
>>>>  mm/mremap.c                 |    2 +
>>>>  10 files changed, 163 insertions(+), 8 deletions(-)
>>>>
>>>> diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
>>>> --- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-26 16:27:59.000000000 +0530
>>>> +++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-27 00:18:16.000000000 +0530
>>>> @@ -526,6 +526,76 @@ unsigned long mem_cgroup_isolate_pages(u
>>>>  	return nr_taken;
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_AS
>>>> +/*
>>>> + * Charge the address space usage for cgroup. This routine is most
>>>> + * likely to be called from places that expand the total_vm of a mm_struct.
>>>> + */
>>>> +void mem_cgroup_charge_as(struct mm_struct *mm, long nr_pages)
>>>> +{
>>>> +	struct mem_cgroup *mem;
>>>> +
>>>> +	if (mem_cgroup_subsys.disabled)
>>>> +		return;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	mem = rcu_dereference(mm->mem_cgroup);
>>>> +	css_get(&mem->css);
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE));
>>>> +	css_put(&mem->css);
>>> Why don't you check whether the counter is charged? This is
>>> bad for two reasons:
>>> 1. you allow for some growth above the limit (e.g. in expand_stack)
>> I was doing that earlier and then decided to keep the virtual address space code
>> in sync with the RLIMIT_AS checking code in the kernel. If you see the flow, it
>> closely resembles what we do with mm->total_vm and may_expand_vm().
>> expand_stack() in turn calls acct_stack_growth() which calls may_expand_vm()
> 
> But this is racy! Look - you do expand_stack on two CPUs and the limit is
> almost reached - so that there's room for a single expansion. In this case 
> may_expand_vm will return true for both, since it only checks the limit, 
> while the subsequent charge will fail on one of them, since it actually 
> tries to raise the usage...
> 

Hmm... yes, possibly. Thanks for pointing this out. For a single mm_struct, the
check is done under mmap_sem(), so it's OK for processes. I suspect, I'll have
to go back to what I had earlier. I don't want to add a mutex to mem_cgroup,
that will hurt parallelism badly.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][2/3] Account and control virtual address space allocations (v2)
  2008-03-27  8:30         ` Balbir Singh
@ 2008-03-27  8:38           ` Pavel Emelyanov
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Emelyanov @ 2008-03-27  8:38 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki

[snip]

>>>>> +	css_put(&mem->css);
>>>> Why don't you check whether the counter is charged? This is
>>>> bad for two reasons:
>>>> 1. you allow for some growth above the limit (e.g. in expand_stack)
>>> I was doing that earlier and then decided to keep the virtual address space code
>>> in sync with the RLIMIT_AS checking code in the kernel. If you see the flow, it
>>> closely resembles what we do with mm->total_vm and may_expand_vm().
>>> expand_stack() in turn calls acct_stack_growth() which calls may_expand_vm()
>> But this is racy! Look - you do expand_stack on two CPUs and the limit is
>> almost reached - so that there's room for a single expansion. In this case 
>> may_expand_vm will return true for both, since it only checks the limit, 
>> while the subsequent charge will fail on one of them, since it actually 
>> tries to raise the usage...
>>
> 
> Hmm... yes, possibly. Thanks for pointing this out. For a single mm_struct, the
> check is done under mmap_sem(), so it's OK for processes. I suspect, I'll have

Sure, but this controller should work with arbitrary group of processes ;)

> to go back to what I had earlier. I don't want to add a mutex to mem_cgroup,
> that will hurt parallelism badly.

My opinion is that we should always perform a pure charge without any
pre-checks, etc.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][1/3] Add user interface for virtual address space control (v2)
  2008-03-26 18:50 ` [RFC][1/3] Add user interface for virtual address space control (v2) Balbir Singh
@ 2008-03-27  9:14   ` KAMEZAWA Hiroyuki
  2008-03-27  9:39     ` Pavel Emelyanov
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-27  9:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, taka,
	linux-mm, David Rientjes

On Thu, 27 Mar 2008 00:20:06 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> 
> Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
> control over the total address space that the processes combined together
> in the cgroup can grow upto. This functionality is analogous to
> the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
> A as_res resource counter is added to the mem_cgroup structure. The
> as_res counter handles all the accounting associated with the virtual
> address space accounting and control of cgroups.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

I wonder that it's better to create "rlimit cgroup" rather than enhancing
memory controller. (But I have no strong opinion.)
How do you think ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][1/3] Add user interface for virtual address space control (v2)
  2008-03-27  9:14   ` KAMEZAWA Hiroyuki
@ 2008-03-27  9:39     ` Pavel Emelyanov
  2008-03-27  9:46       ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Emelyanov @ 2008-03-27  9:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Andrew Morton, Hugh Dickins, Sudhir Kumar, YAMAMOTO Takashi,
	Paul Menage, lizf, linux-kernel, taka, linux-mm, David Rientjes

KAMEZAWA Hiroyuki wrote:
> On Thu, 27 Mar 2008 00:20:06 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>>
>> Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
>> control over the total address space that the processes combined together
>> in the cgroup can grow upto. This functionality is analogous to
>> the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
>> A as_res resource counter is added to the mem_cgroup structure. The
>> as_res counter handles all the accounting associated with the virtual
>> address space accounting and control of cgroups.
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> I wonder that it's better to create "rlimit cgroup" rather than enhancing
> memory controller. (But I have no strong opinion.)
> How do you think ?

I believe that all memory management is better to have in one controller...

> Thanks,
> -Kame
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][1/3] Add user interface for virtual address space control (v2)
  2008-03-27  9:39     ` Pavel Emelyanov
@ 2008-03-27  9:46       ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-27  9:46 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, taka,
	linux-mm, David Rientjes

Pavel Emelyanov wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Thu, 27 Mar 2008 00:20:06 +0530
>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>
>>> Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
>>> control over the total address space that the processes combined together
>>> in the cgroup can grow upto. This functionality is analogous to
>>> the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
>>> A as_res resource counter is added to the mem_cgroup structure. The
>>> as_res counter handles all the accounting associated with the virtual
>>> address space accounting and control of cgroups.
>>>
>>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>> I wonder that it's better to create "rlimit cgroup" rather than enhancing
>> memory controller. (But I have no strong opinion.)
>> How do you think ?
> 
> I believe that all memory management is better to have in one controller...
> 

Paul wants to see it in a different controller. He has been reasoning it out in
another email thread.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-26 22:22 ` [RFC][0/3] Virtual address space control for cgroups (v2) Paul Menage
  2008-03-27  8:04   ` Balbir Singh
@ 2008-03-27 10:03   ` KAMEZAWA Hiroyuki
  2008-03-27 13:59     ` Paul Menage
  1 sibling, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-27 10:03 UTC (permalink / raw)
  To: Paul Menage
  Cc: Balbir Singh, Andrew Morton, Pavel Emelianov, Hugh Dickins,
	Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, taka,
	linux-mm, David Rientjes

On Wed, 26 Mar 2008 15:22:47 -0700
"Paul Menage" <menage@google.com> wrote:

> On Wed, Mar 26, 2008 at 11:49 AM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
> >
> >  The changelog in each patchset documents what has changed in version 2.
> >  The most important one being that virtual address space accounting is
> >  now a config option.
> >
> >  Reviews, Comments?
> >
> 
> I'm still of the strong opinion that this belongs in a separate
> subsystem. (So some of these arguments will appear familiar, but are
> generally because they were unaddressed previously).
> 
> 
How about creating "rlimit controller" and expands rlimit to process groups ?
I think it's more straightforward to do this.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-27 10:03   ` KAMEZAWA Hiroyuki
@ 2008-03-27 13:59     ` Paul Menage
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Menage @ 2008-03-27 13:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Pavel Emelianov, Hugh Dickins,
	Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, taka,
	linux-mm, David Rientjes

On Thu, Mar 27, 2008 at 3:03 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  How about creating "rlimit controller" and expands rlimit to process groups ?
>  I think it's more straightforward to do this.
>

Yes, that could be useful - the only concern that I would have is that
putting all the rlimits in the same subsystem could limit flexibility.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-27  8:04   ` Balbir Singh
@ 2008-03-27 14:28     ` Paul Menage
  2008-03-27 17:50       ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Menage @ 2008-03-27 14:28 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

On Thu, Mar 27, 2008 at 1:04 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>  I thought I addressed some of those by adding a separate config option. You
>  could enable just the address space control, by letting memory.limit_in_bytes at
>  the maximum value it is at (at the moment).
>

Having a config option is better than none at all, certainly for
people who roll their own kernels. But what config choice should a
distro make when they're deciding what to build into their kernel
configuration?

It's much easier to decide to build in a feature that can be ignored
by those who don't use it.

>
>  Yes, I agree with the overhead philosophy. I suspect that users will enable
>  both. I am not against making it a separate controller. I am still hopeful of
>  getting the mm->owner approach working
>

I was thinking more about that, and I think I found a possibly fatal flaw:


>
>  >
>  > Trying to account/control physical memory or swap usage via virtual
>  > address space limits is IMO a hopeless task. Taking Google's
>  > production clusters and the virtual server systems that I worked on in
>  > my previous job as real-life examples that I've encountered, there's
>  > far too much variety of application behaviour (including Java apps
>  > that have massive sparse heaps, jobs with lots of forked children
>  > sharing pages but not address spaces with their parents, and multiple
>  > serving processes mapping large shared data repositories from SHM
>  > segments) that saying VA = RAM + swap is going to break lots of jobs.
>  > But pushing up the VA limit massively makes it useless for the purpose
>  > of preventing excessive swapping. If you want to prevent excessive
>  > swap space usage without breaking a large class of apps, you need to
>  > limit swap space, not virtual address space.
>  >
>  > Additionally, you suggested that VA limits provide a "soft-landing".
>  > But I'm think that the number of applications that will do much other
>  > than abort() if mmap() returns ENOMEM is extremely small - I'd be
>  > interested to hear if you know of any.
>  >
>
>  What happens if swap is completely disabled? Should the task running be OOM
>  killed in the container?

Yes, I think so.

>  How does the application get to know that it is
>  reaching its limit?

That's something that needs to be addressed outside of the concept of
cgroups too.

> I suspect the system administrator will consider
>  vm.overcommit_ratio while setting up virtual address space limits and real page
>  usage limit. As far as applications failing gracefully is concerned, my opinion is
>
>  1. Lets not be dictated by bad applications to design our features
>  2. Autonomic computing is forcing applications to see what resources
>  applications do have access to

Yes, you're right - I shouldn't be arguing this based on current apps,
I should be thinking of the potential for future apps.

>  3. Swapping is expensive, so most application developers, I spoken to at
>  conferences, recently, state that they can manage their own memory, provided
>  they are given sufficient hints from the OS. An mmap() failure, for example can
>  force the application to free memory it is not currently using or trigger the
>  garbage collector in a managed environment.

But the problem that I have with this is that mmap() is only very
loosely connected with physical memory. If we're trying to help
applications avoid swapping, and giving them advance warning that
they're running out of physical memory, then we should do exactly
that, not try to treat address space as a proxy for physical memory.
For apps where there's a close correspondence between virtual address
space and physical memory, this should work equally well. For apps
that use a lot more virtual address space than physical memory this
should work much better.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-27 14:28     ` Paul Menage
@ 2008-03-27 17:50       ` Balbir Singh
  2008-03-27 18:44         ` Paul Menage
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-27 17:50 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Thu, Mar 27, 2008 at 1:04 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  I thought I addressed some of those by adding a separate config option. You
>>  could enable just the address space control, by letting memory.limit_in_bytes at
>>  the maximum value it is at (at the moment).
>>
> 
> Having a config option is better than none at all, certainly for
> people who roll their own kernels. But what config choice should a
> distro make when they're deciding what to build into their kernel
> configuration?
> 
> It's much easier to decide to build in a feature that can be ignored
> by those who don't use it.
> 

Yes, the distro problem definitely arises.

>>  Yes, I agree with the overhead philosophy. I suspect that users will enable
>>  both. I am not against making it a separate controller. I am still hopeful of
>>  getting the mm->owner approach working
>>
> 
> I was thinking more about that, and I think I found a possibly fatal flaw:
> 

What is the critical flaw?

> 
>>  >
>>  > Trying to account/control physical memory or swap usage via virtual
>>  > address space limits is IMO a hopeless task. Taking Google's
>>  > production clusters and the virtual server systems that I worked on in
>>  > my previous job as real-life examples that I've encountered, there's
>>  > far too much variety of application behaviour (including Java apps
>>  > that have massive sparse heaps, jobs with lots of forked children
>>  > sharing pages but not address spaces with their parents, and multiple
>>  > serving processes mapping large shared data repositories from SHM
>>  > segments) that saying VA = RAM + swap is going to break lots of jobs.
>>  > But pushing up the VA limit massively makes it useless for the purpose
>>  > of preventing excessive swapping. If you want to prevent excessive
>>  > swap space usage without breaking a large class of apps, you need to
>>  > limit swap space, not virtual address space.
>>  >
>>  > Additionally, you suggested that VA limits provide a "soft-landing".
>>  > But I'm think that the number of applications that will do much other
>>  > than abort() if mmap() returns ENOMEM is extremely small - I'd be
>>  > interested to hear if you know of any.
>>  >
>>
>>  What happens if swap is completely disabled? Should the task running be OOM
>>  killed in the container?
> 
> Yes, I think so.
> 
>>  How does the application get to know that it is
>>  reaching its limit?
> 
> That's something that needs to be addressed outside of the concept of
> cgroups too.
> 

Yes, I've seen some patches there as well. As far as sparse virtual addresses
are concerned, I find it hard to understand why applications would use sparse
physical memory and large virtual addresses. Please see my comment on overcommit
below.

>> I suspect the system administrator will consider
>>  vm.overcommit_ratio while setting up virtual address space limits and real page
>>  usage limit. As far as applications failing gracefully is concerned, my opinion is
>>
>>  1. Lets not be dictated by bad applications to design our features
>>  2. Autonomic computing is forcing applications to see what resources
>>  applications do have access to
> 
> Yes, you're right - I shouldn't be arguing this based on current apps,
> I should be thinking of the potential for future apps.
> 
>>  3. Swapping is expensive, so most application developers, I spoken to at
>>  conferences, recently, state that they can manage their own memory, provided
>>  they are given sufficient hints from the OS. An mmap() failure, for example can
>>  force the application to free memory it is not currently using or trigger the
>>  garbage collector in a managed environment.
> 
> But the problem that I have with this is that mmap() is only very
> loosely connected with physical memory. If we're trying to help
> applications avoid swapping, and giving them advance warning that
> they're running out of physical memory, then we should do exactly
> that, not try to treat address space as a proxy for physical memory.

Consider why we have the overcommit feature in the Linux kernel. Virtual memory
limits (decided by the administrator) help us prevent from excessively over
committing the system. Please try on your system, where you predict that the
physical address space usage is sparse compared to virtual memory usage to see
if you can allocate more than Committed_AS (as seen in /proc/meminfo).

> For apps where there's a close correspondence between virtual address
> space and physical memory, this should work equally well. For apps
> that use a lot more virtual address space than physical memory this
> should work much better.
> 
> Paul
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-27 17:50       ` Balbir Singh
@ 2008-03-27 18:44         ` Paul Menage
  2008-03-28  3:59           ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Menage @ 2008-03-27 18:44 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

On Thu, Mar 27, 2008 at 10:50 AM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
>  >
>  > I was thinking more about that, and I think I found a possibly fatal flaw:
>  >
>
>  What is the critical flaw?
>

Oops, after I'd written that I decided while describing it that maybe
it wasn't that fatal after all, just fiddly, and so deleted the
description but forgot to delete the preceding sentence. :-)

There were a couple of issues. The first was that if the new owner is
in a different cgroup, we might have to fix up the address space
charges when we pass off the ownership, which would be a bit of a
layer violation but maybe manageable.

The other was to do with ensuring that mm->owner remains valid until
after exit_mmap() has been called (so the va limit controller can
deduct from the va usage).
>
>  Yes, I've seen some patches there as well. As far as sparse virtual addresses
>  are concerned, I find it hard to understand why applications would use sparse
>  physical memory and large virtual addresses. Please see my comment on overcommit
>  below.

Java (or at least, Sun's JRE) is an example of a common application
that does this. It creates a huge heap mapping at startup, and faults
it in as necessary.

>  > But the problem that I have with this is that mmap() is only very
>  > loosely connected with physical memory. If we're trying to help
>  > applications avoid swapping, and giving them advance warning that
>  > they're running out of physical memory, then we should do exactly
>  > that, not try to treat address space as a proxy for physical memory.
>
>  Consider why we have the overcommit feature in the Linux kernel. Virtual memory
>  limits (decided by the administrator) help us prevent from excessively over
>  committing the system.

Well if I don't believe in per-container virtual address space limits,
I'm unlikely to be a big fan of system-wide virtual address space
limits either. So running with vm.overcommit_memory=2 is right out ...

I'm certainly not disputing that it's possible to avoid excessive
overcommit by using virtual address space limits.

It's just for that both of the real-world large-scale production
systems I've worked with (a virtual server system for ISPs, and
Google's production datacenters) there were enough cases of apps/jobs
that used far more virtual address space than actual physical memory
that picking a virtual address space ratio/limit that would be useful
for preventing dangerous overcommit while not breaking lots of apps
would be pretty much impossible to do automatically. And specifying
them manually requires either unusually clueful users (most of whom
have enough trouble figuring out how much physical memory they'll
need, and would just set very high virtual address space limits) or
sysadmins with way too much time on their hands ...

As I said, I think focussing on ways to tell apps that they're running
low on physical memory would be much more productive.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-27 18:44         ` Paul Menage
@ 2008-03-28  3:59           ` Balbir Singh
  2008-03-28 14:37             ` Paul Menage
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2008-03-28  3:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Thu, Mar 27, 2008 at 10:50 AM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
>>  >
>>  > I was thinking more about that, and I think I found a possibly fatal flaw:
>>  >
>>
>>  What is the critical flaw?
>>
> 
> Oops, after I'd written that I decided while describing it that maybe
> it wasn't that fatal after all, just fiddly, and so deleted the
> description but forgot to delete the preceding sentence. :-)
> 
> There were a couple of issues. The first was that if the new owner is
> in a different cgroup, we might have to fix up the address space
> charges when we pass off the ownership, which would be a bit of a
> layer violation but maybe manageable.
> 

Yes, we do pass of virtual address space charges during migration. As far as
physical memory control is concerned, the page_cgroup has a pointer to the
mem_cgroup and thus gets returned back to the original mem_cgroup.

> The other was to do with ensuring that mm->owner remains valid until
> after exit_mmap() has been called (so the va limit controller can
> deduct from the va usage).
>>  Yes, I've seen some patches there as well. As far as sparse virtual addresses
>>  are concerned, I find it hard to understand why applications would use sparse
>>  physical memory and large virtual addresses. Please see my comment on overcommit
>>  below.
> 
> Java (or at least, Sun's JRE) is an example of a common application
> that does this. It creates a huge heap mapping at startup, and faults
> it in as necessary.
> 

Isn't this controlled by the java -Xm options?

>>  > But the problem that I have with this is that mmap() is only very
>>  > loosely connected with physical memory. If we're trying to help
>>  > applications avoid swapping, and giving them advance warning that
>>  > they're running out of physical memory, then we should do exactly
>>  > that, not try to treat address space as a proxy for physical memory.
>>
>>  Consider why we have the overcommit feature in the Linux kernel. Virtual memory
>>  limits (decided by the administrator) help us prevent from excessively over
>>  committing the system.
> 
> Well if I don't believe in per-container virtual address space limits,
> I'm unlikely to be a big fan of system-wide virtual address space
> limits either. So running with vm.overcommit_memory=2 is right out ...
> 

Yes, must distros don't do that, on my distro, we have

vm.overcommit_ratio = 50
vm.overcommit_memory = 0


> I'm certainly not disputing that it's possible to avoid excessive
> overcommit by using virtual address space limits.
> 
> It's just for that both of the real-world large-scale production
> systems I've worked with (a virtual server system for ISPs, and
> Google's production datacenters) there were enough cases of apps/jobs
> that used far more virtual address space than actual physical memory
> that picking a virtual address space ratio/limit that would be useful
> for preventing dangerous overcommit while not breaking lots of apps
> would be pretty much impossible to do automatically.

I understand, but

1. The system by default enforces overcommit on most distros, so why should we
not have something similar and that flexible for cgroups.
2. The administrator is expected to figure out what applications need virtual
address space control. Some might need them, some might not


 And specifying
> them manually requires either unusually clueful users (most of whom
> have enough trouble figuring out how much physical memory they'll
> need, and would just set very high virtual address space limits) or
> sysadmins with way too much time on their hands ...
> 

It's a one time thing to setup for sysadmins

> As I said, I think focussing on ways to tell apps that they're running
> low on physical memory would be much more productive.
>

We intend to do that as well. We intend to have user space OOM notification.
It's in the long term list of things TODO, along with watermarks, etc.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-28  3:59           ` Balbir Singh
@ 2008-03-28 14:37             ` Paul Menage
  2008-03-28 18:13               ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Menage @ 2008-03-28 14:37 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

On Thu, Mar 27, 2008 at 8:59 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  > Java (or at least, Sun's JRE) is an example of a common application
>  > that does this. It creates a huge heap mapping at startup, and faults
>  > it in as necessary.
>  >
>
>  Isn't this controlled by the java -Xm options?
>

Probably - that was just an example, and the behaviour of Java isn't
exactly unreasonable. A different example would be an app that maps a
massive database file, but only pages small amounts of it in at any
one time.

>
>  I understand, but
>
>  1. The system by default enforces overcommit on most distros, so why should we
>  not have something similar and that flexible for cgroups.

Right, I guess I should make it clear that I'm *not* arguing that we
shouldn't have a virtual address space limit subsystem.

My main arguments in this and my previous email were to back up my
assertion that there are a significant set of real-world cases where
it doesn't help, and hence it should be a separate subsystem that can
be turned on or off as desired.

It strikes me that when split into its own subsystem, this is going to
be very simple - basically just a resource counter and some file
handlers. We should probably have something like
include/linux/rescounter_subsys_template.h, so you can do:

#define SUBSYS_NAME va
#define SUBSYS_UNIT_SUFFIX in_bytes
#include <linux/rescounter_subsys_template.h>

then all you have to add are the hooks to call the rescounter
charge/uncharge functions and you're done. It would be nice to have a
separate trivial subsystem like this for each of the rlimit types, not
just virtual address space.

>   And specifying
>  > them manually requires either unusually clueful users (most of whom
>  > have enough trouble figuring out how much physical memory they'll
>  > need, and would just set very high virtual address space limits) or
>  > sysadmins with way too much time on their hands ...
>  >
>
>  It's a one time thing to setup for sysadmins
>

Sure, it's a one-time thing to setup *if* your cluster workload is
completely static.

>
>  > As I said, I think focussing on ways to tell apps that they're running
>  > low on physical memory would be much more productive.
>  >
>
>  We intend to do that as well. We intend to have user space OOM notification.

We've been playing with a user-space OOM notification system at Google
- it's on my TODO list to push it to mainline (as an independent
subsystem, since either cpusets or the memory controller can be used
to cause OOMs that are localized to a cgroup). What we have works
pretty well but I think our interface is a bit too much of a kludge at
this point.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][0/3] Virtual address space control for cgroups (v2)
  2008-03-28 14:37             ` Paul Menage
@ 2008-03-28 18:13               ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2008-03-28 18:13 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Pavel Emelianov, Hugh Dickins, Sudhir Kumar,
	YAMAMOTO Takashi, lizf, linux-kernel, taka, linux-mm,
	David Rientjes, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Thu, Mar 27, 2008 at 8:59 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  > Java (or at least, Sun's JRE) is an example of a common application
>>  > that does this. It creates a huge heap mapping at startup, and faults
>>  > it in as necessary.
>>  >
>>
>>  Isn't this controlled by the java -Xm options?
>>
> 
> Probably - that was just an example, and the behaviour of Java isn't
> exactly unreasonable. A different example would be an app that maps a
> massive database file, but only pages small amounts of it in at any
> one time.
> 
>>  I understand, but
>>
>>  1. The system by default enforces overcommit on most distros, so why should we
>>  not have something similar and that flexible for cgroups.
> 
> Right, I guess I should make it clear that I'm *not* arguing that we
> shouldn't have a virtual address space limit subsystem.
> 
> My main arguments in this and my previous email were to back up my
> assertion that there are a significant set of real-world cases where
> it doesn't help, and hence it should be a separate subsystem that can
> be turned on or off as desired.
> 
> It strikes me that when split into its own subsystem, this is going to
> be very simple - basically just a resource counter and some file
> handlers. We should probably have something like
> include/linux/rescounter_subsys_template.h, so you can do:
> 
> #define SUBSYS_NAME va
> #define SUBSYS_UNIT_SUFFIX in_bytes
> #include <linux/rescounter_subsys_template.h>
> 
> then all you have to add are the hooks to call the rescounter
> charge/uncharge functions and you're done. It would be nice to have a
> separate trivial subsystem like this for each of the rlimit types, not
> just virtual address space.
> 

OK, I'll consider doing a separate controller, once we get the mm->owner issue
sorted out.

>>   And specifying
>>  > them manually requires either unusually clueful users (most of whom
>>  > have enough trouble figuring out how much physical memory they'll
>>  > need, and would just set very high virtual address space limits) or
>>  > sysadmins with way too much time on their hands ...
>>  >
>>
>>  It's a one time thing to setup for sysadmins
>>
> 
> Sure, it's a one-time thing to setup *if* your cluster workload is
> completely static.
> 
>>  > As I said, I think focussing on ways to tell apps that they're running
>>  > low on physical memory would be much more productive.
>>  >
>>
>>  We intend to do that as well. We intend to have user space OOM notification.
> 
> We've been playing with a user-space OOM notification system at Google
> - it's on my TODO list to push it to mainline (as an independent
> subsystem, since either cpusets or the memory controller can be used
> to cause OOMs that are localized to a cgroup). What we have works
> pretty well but I think our interface is a bit too much of a kludge at
> this point.

It's good to know you have something generic working. I was planning to start
work on it later.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-03-28 18:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-26 18:49 [RFC][0/3] Virtual address space control for cgroups (v2) Balbir Singh
2008-03-26 18:50 ` [RFC][1/3] Add user interface for virtual address space control (v2) Balbir Singh
2008-03-27  9:14   ` KAMEZAWA Hiroyuki
2008-03-27  9:39     ` Pavel Emelyanov
2008-03-27  9:46       ` Balbir Singh
2008-03-26 18:50 ` [RFC][2/3] Account and control virtual address space allocations (v2) Balbir Singh
2008-03-26 19:10   ` Balbir Singh
2008-03-27  7:19   ` Pavel Emelyanov
2008-03-27  8:02     ` Balbir Singh
2008-03-27  8:24       ` Pavel Emelyanov
2008-03-27  8:30         ` Balbir Singh
2008-03-27  8:38           ` Pavel Emelyanov
2008-03-26 18:50 ` [RFC][3/3] Update documentation for virtual address space control (v2) Balbir Singh
2008-03-26 22:22 ` [RFC][0/3] Virtual address space control for cgroups (v2) Paul Menage
2008-03-27  8:04   ` Balbir Singh
2008-03-27 14:28     ` Paul Menage
2008-03-27 17:50       ` Balbir Singh
2008-03-27 18:44         ` Paul Menage
2008-03-28  3:59           ` Balbir Singh
2008-03-28 14:37             ` Paul Menage
2008-03-28 18:13               ` Balbir Singh
2008-03-27 10:03   ` KAMEZAWA Hiroyuki
2008-03-27 13:59     ` Paul Menage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).