[PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation
@ 2012-04-16 10:44 Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups

Hi,

This patchset implements a memory controller extension to control
HugeTLB allocations. The extension allows to limit the HugeTLB
usage per control group and enforces the controller limit during
page fault. Since HugeTLB doesn't support page reclaim, enforcing
the limit at page fault time implies that, the application will get
SIGBUS signal if it tries to access HugeTLB pages beyond its limit.
This requires the application to know beforehand how much HugeTLB
pages it would require for its use.

The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met,
job schedulers like (SLURM) will schedule the job. We need to make sure
that the jobs won't consume more resources than requested. If they do
we should either error out or kill the application.

Patches are on top of

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.5

Changes from V5:
 * Address review feedback.

Changes from V4:
 * Add support for charge/uncharge during page migration
 * Drop the usage of page->lru in unmap_hugepage_range.

Changes from v3:
 * Address review feedback.
 * Fix a bug in cgroup removal related parent charging with use_hierarchy set

Changes from V2:
* Changed the implementation to limit the HugeTLB usage during page
  fault time. This simplifies the extension and keep it closer to
  memcg design. This also allows to support cgroup removal with less
  complexity. Only caveat is the application should ensure its HugeTLB
  usage doesn't cross the cgroup limit.

Changes from V1:
* Changed the implementation as a memcg extension. We still use
  the same logic to track the cgroup and range.

Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes

-aneesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-05-24 21:11   ` David Rientjes
  2012-04-16 10:44 ` [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We will be using this from other subsystems like memcg
in later patches.

Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..766eb90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-static int max_hstate;
+static int hugetlb_max_hstate;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
@@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
 
 #define for_each_hstate(h) \
-	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
+	for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
 		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
 		return;
 	}
-	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
-	h = &hstates[max_hstate++];
+	h = &hstates[hugetlb_max_hstate++];
 	h->order = order;
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
@@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
 	static unsigned long *last_mhp;
 
 	/*
-	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
 	 * so this hugepages= parameter goes to the "default hstate".
 	 */
-	if (!max_hstate)
+	if (!hugetlb_max_hstate)
 		mhp = &default_hstate_max_huge_pages;
 	else
 		mhp = &parsed_hstate->max_huge_pages;
@@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
 	 * But we need to allocate >= MAX_ORDER hstates here early to still
 	 * use the bootmem allocator.
 	 */
-	if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+	if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
 		hugetlb_hstate_alloc_pages(parsed_hstate);
 
 	last_mhp = mhp;
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-05-24 21:17   ` David Rientjes
  2012-04-16 10:44 ` [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Using VM_FAULT_* codes with ERR_PTR will require us to make sure
VM_FAULT_* values will not exceed MAX_ERRNO value.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 766eb90..5a472a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
-		return ERR_PTR(-VM_FAULT_OOM);
+		return ERR_PTR(-ENOMEM);
 	if (chg)
 		if (hugepage_subpool_get_pages(spool, chg))
-			return ERR_PTR(-VM_FAULT_SIGBUS);
+			return ERR_PTR(-ENOSPC);
 
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugepage_subpool_put_pages(spool, chg);
-			return ERR_PTR(-VM_FAULT_SIGBUS);
+			return ERR_PTR(-ENOSPC);
 		}
 	}
 
@@ -2486,6 +2486,7 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
+		int err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2515,7 +2516,10 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(&mm->page_table_lock);
-		return -PTR_ERR(new_page);
+		if (err == -ENOMEM)
+			return VM_FAULT_OOM;
+		else
+			return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2633,7 +2637,11 @@ retry:
 			goto out;
 		page = alloc_huge_page(vma, address, 0);
 		if (IS_ERR(page)) {
-			ret = -PTR_ERR(page);
+			ret = PTR_ERR(page);
+			if (ret == -ENOMEM)
+				ret = VM_FAULT_OOM;
+			else
+				ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-05-24 21:22   ` David Rientjes
  2012-04-16 10:44 ` [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Add an inline helper and use it in the code.

Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h |    6 ++++++
 mm/hugetlb.c            |   18 ++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 000837e..876457e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -294,6 +294,11 @@ static inline unsigned hstate_index_to_shift(unsigned index)
 	return hstates[index].order + PAGE_SHIFT;
 }
 
+static inline int hstate_index(struct hstate *h)
+{
+	return h - hstates;
+}
+
 #else
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -312,6 +317,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 	return 1;
 }
 #define hstate_index_to_shift(index) 0
+#define hstate_index(h) 0
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a472a5..d94c987 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1646,7 +1646,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct attribute_group *hstate_attr_group)
 {
 	int retval;
-	int hi = h - hstates;
+	int hi = hstate_index(h);
 
 	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
 	if (!hstate_kobjs[hi])
@@ -1741,11 +1741,13 @@ void hugetlb_unregister_node(struct node *node)
 	if (!nhs->hugepages_kobj)
 		return;		/* no hstate attributes */
 
-	for_each_hstate(h)
-		if (nhs->hstate_kobjs[h - hstates]) {
-			kobject_put(nhs->hstate_kobjs[h - hstates]);
-			nhs->hstate_kobjs[h - hstates] = NULL;
+	for_each_hstate(h) {
+		int idx = hstate_index(h);
+		if (nhs->hstate_kobjs[idx]) {
+			kobject_put(nhs->hstate_kobjs[idx]);
+			nhs->hstate_kobjs[idx] = NULL;
 		}
+	}
 
 	kobject_put(nhs->hugepages_kobj);
 	nhs->hugepages_kobj = NULL;
@@ -1848,7 +1850,7 @@ static void __exit hugetlb_exit(void)
 	hugetlb_unregister_all_nodes();
 
 	for_each_hstate(h) {
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
 
 	kobject_put(hugepages_kobj);
@@ -2678,7 +2680,7 @@ retry:
 		 */
 		if (unlikely(PageHWPoison(page))) {
 			ret = VM_FAULT_HWPOISON |
-			      VM_FAULT_SET_HINDEX(h - hstates);
+				VM_FAULT_SET_HINDEX(hstate_index(h));
 			goto backout_unlocked;
 		}
 	}
@@ -2751,7 +2753,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
-			       VM_FAULT_SET_HINDEX(h - hstates);
+				VM_FAULT_SET_HINDEX(hstate_index(h));
 	}
 
 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-23 23:44   ` Andrew Morton
  2012-04-16 10:44 ` [PATCH -V6 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Use mmu_gather instead of temporary linked list for accumulating pages when
we unmap a hugepage range. This also allows us to get rid of i_mmap_mutex
unmap_hugepage_range in the following patch.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/inode.c    |    4 ++--
 include/linux/hugetlb.h |    6 ++---
 mm/hugetlb.c            |   59 ++++++++++++++++++++++++++++-------------------
 mm/memory.c             |    7 ++++--
 4 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ea25174..92f75aa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 		else
 			v_offset = 0;
 
-		__unmap_hugepage_range(vma,
-				vma->vm_start + v_offset, vma->vm_end, NULL);
+		unmap_hugepage_range(vma, vma->vm_start + v_offset,
+				     vma->vm_end, NULL);
 	}
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 876457e..46c6cbd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -40,9 +40,9 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			struct page **, struct vm_area_struct **,
 			unsigned long *, int *, int, unsigned int flags);
 void unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
+			  unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *,
+			    unsigned long, unsigned long, struct page *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d94c987..a3ac624 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,8 +24,9 @@
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
-#include <linux/io.h>
+#include <asm/tlb.h>
 
+#include <linux/io.h>
 #include <linux/hugetlb.h>
 #include <linux/node.h>
 #include "internal.h"
@@ -2300,30 +2301,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 		return 0;
 }
 
-void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			    unsigned long end, struct page *ref_page)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			    unsigned long start, unsigned long end,
+			    struct page *ref_page)
 {
+	int force_flush = 0;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
 	struct page *page;
-	struct page *tmp;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 
-	/*
-	 * A page gathering list, protected by per file i_mmap_mutex. The
-	 * lock is used to avoid list corruption from multiple unmapping
-	 * of the same page since we are using page->lru.
-	 */
-	LIST_HEAD(page_list);
-
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, start, end);
+again:
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2362,30 +2359,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 		}
 
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
+		tlb_remove_tlb_entry(tlb, ptep, address);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
-		list_add(&page->lru, &page_list);
 
+		page_remove_rmap(page);
+		force_flush = !__tlb_remove_page(tlb, page);
+		if (force_flush)
+			break;
 		/* Bail out after unmapping reference page if supplied */
 		if (ref_page)
 			break;
 	}
-	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
-	mmu_notifier_invalidate_range_end(mm, start, end);
-	list_for_each_entry_safe(page, tmp, &page_list, lru) {
-		page_remove_rmap(page);
-		list_del(&page->lru);
-		put_page(page);
+	/*
+	 * mmu_gather ran out of room to batch pages, we break out of
+	 * the PTE lock to avoid doing the potential expensive TLB invalidate
+	 * and page-free while holding it.
+	 */
+	if (force_flush) {
+		force_flush = 0;
+		tlb_flush_mmu(tlb);
+		if (address < end && !ref_page)
+			goto again;
 	}
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_end_vma(tlb, vma);
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	__unmap_hugepage_range(vma, start, end, ref_page);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	struct mm_struct *mm;
+	struct mmu_gather tlb;
+
+	mm = vma->vm_mm;
+
+	tlb_gather_mmu(&tlb, mm, 0);
+	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2430,9 +2442,8 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * from the time of fork. This would look like data corruption
 		 */
 		if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
-			__unmap_hugepage_range(iter_vma,
-				address, address + huge_page_size(h),
-				page);
+			unmap_hugepage_range(iter_vma, address,
+					     address + huge_page_size(h), page);
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
 
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..4b11961 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * Since no pte has actually been setup, it is
 			 * safe to do nothing in this case.
 			 */
-			if (vma->vm_file)
-				unmap_hugepage_range(vma, start, end, NULL);
+			if (vma->vm_file) {
+				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+				__unmap_hugepage_range(tlb, vma, start, end, NULL);
+				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
 	}
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

i_mmap_mutex lock was added in unmap_single_vma by 502717f4e112b18d9c37753a32f675bec9f2838b
But we don't use page->lru in unmap_hugepage_range any more. Also the lock was
taken higher up in the stack in some code path. That would result in deadlock.

unmap_mapping_range (i_mmap_mutex)
 -> unmap_mapping_range_tree
    -> unmap_mapping_range_vma
       -> zap_page_range_single
         -> unmap_single_vma
	      -> unmap_hugepage_range (i_mmap_mutex)

For shared pagetable support for huge pages, since pagetable pages are
ref counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
( 39dde65c9940c97fcd178a3d2b1c57ed8b7b68aa )

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/memory.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4b11961..d642b3e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * Since no pte has actually been setup, it is
 			 * safe to do nothing in this case.
 			 */
-			if (vma->vm_file) {
-				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			if (vma->vm_file)
 				__unmap_hugepage_range(tlb, vma, start, end, NULL);
-				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
 	}
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-05-24 21:35   ` David Rientjes
  2012-04-16 10:44 ` [PATCH -V6 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Since we migrate only one hugepage don't use linked list for passing
the page around. Directly pass page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/migrate.h |    4 +--
 mm/memory-failure.c     |   13 ++--------
 mm/migrate.c            |   65 +++++++++++++++--------------------------------
 3 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 855c337..ce7e667 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
-extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
 
@@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
-static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97cc273..1f092db 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
-	LIST_HEAD(pagelist);
 
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
@@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	}
 
 	/* Keep page count to indicate a given hugepage is isolated. */
-
-	list_add(&hpage->lru, &pagelist);
-	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
-				true);
+	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
+	put_page(page);
 	if (ret) {
-		struct page *page1, *page2;
-		list_for_each_entry_safe(page1, page2, &pagelist, lru)
-			put_page(page1);
-
 		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
 			pfn, ret, page->flags);
-		if (ret > 0)
-			ret = -EIO;
 		return ret;
 	}
 done:
diff --git a/mm/migrate.c b/mm/migrate.c
index 51c08a0..d7eb82d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -929,15 +929,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	unlock_page(hpage);
-
 out:
-	if (rc != -EAGAIN) {
-		list_del(&hpage->lru);
-		put_page(hpage);
-	}
-
 	put_page(new_hpage);
-
 	if (result) {
 		if (rc)
 			*result = rc;
@@ -1013,48 +1006,32 @@ out:
 	return nr_failed + retry;
 }
 
-int migrate_huge_pages(struct list_head *from,
-		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
+		      unsigned long private, bool offlining,
+		      enum migrate_mode mode)
 {
-	int retry = 1;
-	int nr_failed = 0;
-	int pass = 0;
-	struct page *page;
-	struct page *page2;
-	int rc;
-
-	for (pass = 0; pass < 10 && retry; pass++) {
-		retry = 0;
-
-		list_for_each_entry_safe(page, page2, from, lru) {
+	int pass, rc;
+
+	for (pass = 0; pass < 10; pass++) {
+		rc = unmap_and_move_huge_page(get_new_page,
+					      private, hpage, pass > 2, offlining,
+					      mode);
+		switch (rc) {
+		case -ENOMEM:
+			goto out;
+		case -EAGAIN:
+			/* try again */
 			cond_resched();
-
-			rc = unmap_and_move_huge_page(get_new_page,
-					private, page, pass > 2, offlining,
-					mode);
-
-			switch(rc) {
-			case -ENOMEM:
-				goto out;
-			case -EAGAIN:
-				retry++;
-				break;
-			case 0:
-				break;
-			default:
-				/* Permanent failure */
-				nr_failed++;
-				break;
-			}
+			break;
+		case 0:
+			goto out;
+		default:
+			rc = -EIO;
+			goto out;
 		}
 	}
-	rc = 0;
 out:
-	if (rc)
-		return rc;
-
-	return nr_failed + retry;
+	return rc;
 }
 
 #ifdef CONFIG_NUMA
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-05-02  0:20   ` Paul Gortmaker
  2012-05-24 21:52   ` David Rientjes
  2012-04-16 10:44 ` [PATCH -V6 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This patch implements a memcg extension that allows us to control HugeTLB
allocations via memory controller. The extension allows to limit the
HugeTLB usage per control group and enforces the controller limit during
page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
at page fault time implies that, the application will get SIGBUS signal if it
tries to access HugeTLB pages beyond its limit. This requires the application
to know beforehand how much HugeTLB pages it would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for memcg removal will be added in later patches.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    1 +
 include/linux/memcontrol.h |   42 ++++++++++++++
 init/Kconfig               |    8 +++
 mm/hugetlb.c               |    2 +-
 mm/memcontrol.c            |  132 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 46c6cbd..995c238 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -226,6 +226,7 @@ struct hstate *size_to_hstate(unsigned long size);
 #define HUGE_MAX_HSTATE 1
 #endif
 
+extern int hugetlb_max_hstate;
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 extern unsigned int default_hstate_idx;
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f94efd2..1d07e14 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -448,5 +448,47 @@ static inline void sock_release_memcg(struct sock *sk)
 {
 }
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+extern int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+					  struct mem_cgroup **ptr);
+extern void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+					     struct mem_cgroup *memcg,
+					     struct page *page);
+extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+					     struct page *page);
+extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+					      struct mem_cgroup *memcg);
+
+#else
+static inline int
+mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+						 struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline void
+mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+				 struct mem_cgroup *memcg,
+				 struct page *page)
+{
+	return;
+}
+
+static inline void
+mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+				 struct page *page)
+{
+	return;
+}
+
+static inline void
+mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+				  struct mem_cgroup *memcg)
+{
+	return;
+}
+#endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/init/Kconfig b/init/Kconfig
index 72f33fa..a3b5665 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -716,6 +716,14 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config MEM_RES_CTLR_HUGETLB
+	bool "Memory Resource Controller HugeTLB Extension (EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && HUGETLB_PAGE && EXPERIMENTAL
+	default n
+	help
+	  Add HugeTLB management to memory resource controller. When you
+	  enable this, you can put a per cgroup limit on HugeTLB usage.
+
 menuconfig CGROUP_SCHED
 	bool "Group CPU scheduler"
 	default n
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a3ac624..8cd89b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-static int hugetlb_max_hstate;
+int hugetlb_max_hstate;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 901bb03..884f479 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -252,6 +252,10 @@ struct mem_cgroup {
 	};
 
 	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct res_counter hugepage[HUGE_MAX_HSTATE];
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -3213,6 +3217,114 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+{
+	int idx;
+	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
+		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
+			return 1;
+	}
+	return 0;
+}
+
+int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+				   struct mem_cgroup **ptr)
+{
+	int ret = 0;
+	struct mem_cgroup *memcg = NULL;
+	struct res_counter *fail_res;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		goto done;
+again:
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
+	if (!css_tryget(&memcg->css)) {
+		rcu_read_unlock();
+		goto again;
+	}
+	rcu_read_unlock();
+
+	ret = res_counter_charge(&memcg->hugepage[idx], csize, &fail_res);
+	css_put(&memcg->css);
+done:
+	*ptr = memcg;
+	return ret;
+}
+
+void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+				      struct mem_cgroup *memcg,
+				      struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		mem_cgroup_hugetlb_uncharge_memcg(idx, nr_pages, memcg);
+		return;
+	}
+	pc->mem_cgroup = memcg;
+	SetPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+	return;
+}
+
+void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+				      struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!PageCgroupUsed(pc)))
+		return;
+
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc)) {
+		unlock_page_cgroup(pc);
+		return;
+	}
+	memcg = pc->mem_cgroup;
+	pc->mem_cgroup = root_mem_cgroup;
+	ClearPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+
+	res_counter_uncharge(&memcg->hugepage[idx], csize);
+	return;
+}
+
+void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+				       struct mem_cgroup *memcg)
+{
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	res_counter_uncharge(&memcg->hugepage[idx], csize);
+	return;
+}
+#else
+static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
@@ -4955,6 +5067,7 @@ err_cleanup:
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup *cont)
 {
+	int idx;
 	struct mem_cgroup *memcg, *parent;
 	long error = -ENOMEM;
 	int node;
@@ -4997,9 +5110,22 @@ mem_cgroup_create(struct cgroup *cont)
 		 * mem_cgroup(see mem_cgroup_put).
 		 */
 		mem_cgroup_get(parent);
+		/*
+		 * We could get called before hugetlb init is called.
+		 * Use HUGE_MAX_HSTATE as the max index.
+		 */
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&memcg->hugepage[idx],
+					 &parent->hugepage[idx]);
 	} else {
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
+		/*
+		 * We could get called before hugetlb init is called.
+		 * Use HUGE_MAX_HSTATE as the max index.
+		 */
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&memcg->hugepage[idx], NULL);
 	}
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
@@ -5030,6 +5156,12 @@ free_out:
 static int mem_cgroup_pre_destroy(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	/*
+	 * Don't allow memcg removal if we have HugeTLB resource
+	 * usage.
+	 */
+	if (mem_cgroup_have_hugetlb_usage(memcg))
+		return -EBUSY;
 
 	return mem_cgroup_force_empty(memcg, false);
 }
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This adds necessary charge/uncharge calls in the HugeTLB code. We do
memcg charge in page alloc and uncharge in compound page destructor.
We also need to ignore HugeTLB pages in __mem_cgroup_uncharge_common
because that get called from delete_from_page_cache

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c    |   20 +++++++++++++++++++-
 mm/memcontrol.c |    5 +++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8cd89b4..dd00087 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,8 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -628,6 +630,8 @@ static void free_huge_page(struct page *page)
 	BUG_ON(page_mapcount(page));
 	INIT_LIST_HEAD(&page->lru);
 
+	mem_cgroup_hugetlb_uncharge_page(hstate_index(h),
+					 pages_per_huge_page(h), page);
 	spin_lock(&hugetlb_lock);
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		update_and_free_page(h, page);
@@ -1113,7 +1117,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	int ret, idx;
+	struct mem_cgroup *memcg;
 
+	idx = hstate_index(h);
 	/*
 	 * Processes that did not create the mapping will have no
 	 * reserves and will not have accounted against subpool
@@ -1129,6 +1136,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		if (hugepage_subpool_get_pages(spool, chg))
 			return ERR_PTR(-ENOSPC);
 
+	ret = mem_cgroup_hugetlb_charge_page(idx, pages_per_huge_page(h),
+					     &memcg);
+	if (ret) {
+		hugepage_subpool_put_pages(spool, chg);
+		return ERR_PTR(-ENOSPC);
+	}
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
 	spin_unlock(&hugetlb_lock);
@@ -1136,6 +1149,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
+			mem_cgroup_hugetlb_uncharge_memcg(idx,
+							  pages_per_huge_page(h),
+							  memcg);
 			hugepage_subpool_put_pages(spool, chg);
 			return ERR_PTR(-ENOSPC);
 		}
@@ -1144,7 +1160,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	set_page_private(page, (unsigned long)spool);
 
 	vma_commit_reservation(h, vma, addr);
-
+	/* update page cgroup details */
+	mem_cgroup_hugetlb_commit_charge(idx, pages_per_huge_page(h),
+					 memcg, page);
 	return page;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 884f479..e906b41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2966,6 +2966,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	if (PageSwapCache(page))
 		return NULL;
+	/*
+	 * HugeTLB page uncharge happen in the HugeTLB compound page destructor
+	 */
+	if (PageHuge(page))
+		return NULL;
 
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 09/14] memcg: track resource index in cftype private
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This patch adds a new charge type _MEMHUGETLB for tracking hugetlb
resources. We also use cftype to encode the hugetlb resource index.
This helps in using same memcg callbacks for hugetlb control files.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/memcontrol.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e906b41..0f9ec34 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -381,9 +381,14 @@ enum charge_type {
 #define _MEM			(0)
 #define _MEMSWAP		(1)
 #define _OOM_TYPE		(2)
-#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
-#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
-#define MEMFILE_ATTR(val)	((val) & 0xffff)
+#define _MEMHUGETLB		(3)
+
+/*  0 ... val ...16.... x...24...idx...32*/
+#define __MEMFILE_PRIVATE(idx, x, val)	(((idx) << 24) | ((x) << 16) | (val))
+#define MEMFILE_PRIVATE(x, val)		__MEMFILE_PRIVATE(0, x, val)
+#define MEMFILE_TYPE(val)		(((val) >> 16) & 0xff)
+#define MEMFILE_IDX(val)		(((val) >> 24) & 0xff)
+#define MEMFILE_ATTR(val)		((val) & 0xffff)
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
@@ -4003,7 +4008,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	char str[64];
 	u64 val;
-	int type, name, len;
+	int type, name, len, idx;
 
 	type = MEMFILE_TYPE(cft->private);
 	name = MEMFILE_ATTR(cft->private);
@@ -4024,6 +4029,10 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 		else
 			val = res_counter_read_u64(&memcg->memsw, name);
 		break;
+	case _MEMHUGETLB:
+		idx = MEMFILE_IDX(cft->private);
+		val = res_counter_read_u64(&memcg->hugepage[idx], name);
+		break;
 	default:
 		BUG();
 	}
@@ -4061,7 +4070,10 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			break;
 		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
-		else
+		else if (type == _MEMHUGETLB) {
+			int idx = MEMFILE_IDX(cft->private);
+			ret = res_counter_set_limit(&memcg->hugepage[idx], val);
+		} else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	case RES_SOFT_LIMIT:
@@ -4127,7 +4139,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	case RES_MAX_USAGE:
 		if (type == _MEM)
 			res_counter_reset_max(&memcg->res);
-		else
+		else if (type == _MEMHUGETLB) {
+			int idx = MEMFILE_IDX(event);
+			res_counter_reset_max(&memcg->hugepage[idx]);
+		} else
 			res_counter_reset_max(&memcg->memsw);
 		break;
 	case RES_FAILCNT:
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 23:13   ` Andrew Morton
  2012-04-16 10:44 ` [PATCH -V6 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This add control files for hugetlbfs in memcg

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    5 +++++
 include/linux/memcontrol.h |    7 ++++++
 mm/hugetlb.c               |    2 +-
 mm/memcontrol.c            |   51 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 995c238..d008342 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #include <linux/mm_types.h>
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
 
 struct ctl_table;
 struct user_struct;
@@ -203,6 +204,10 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+	/* mem cgroup control files */
+	struct cftype mem_cgroup_files[4];
+#endif
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1d07e14..4f17574 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -459,6 +459,7 @@ extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
 					     struct page *page);
 extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 					      struct mem_cgroup *memcg);
+extern int mem_cgroup_hugetlb_file_init(int idx) __init;
 
 #else
 static inline int
@@ -489,6 +490,12 @@ mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 {
 	return;
 }
+
+static inline int mem_cgroup_hugetlb_file_init(int idx)
+{
+	return 0;
+}
+
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd00087..340e575 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1931,7 +1931,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-
+	mem_cgroup_hugetlb_file_init(hugetlb_max_hstate - 1);
 	parsed_hstate = h;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0f9ec34..93e077a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5195,6 +5195,57 @@ static void mem_cgroup_destroy(struct cgroup *cont)
 	mem_cgroup_put(memcg);
 }
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+static char *mem_fmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%luGB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%luMB", n >> 20);
+	else
+		sprintf(buf, "%luKB", n >> 10);
+	return buf;
+}
+
+int __init mem_cgroup_hugetlb_file_init(int idx)
+{
+	char buf[32];
+	struct cftype *cft;
+	struct hstate *h = &hstates[idx];
+
+	/* format the size */
+	mem_fmt(buf, huge_page_size(h));
+
+	/* Add the limit file */
+	cft = &h->mem_cgroup_files[0];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.limit_in_bytes", buf);
+	cft->private = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_LIMIT);
+	cft->read = mem_cgroup_read;
+	cft->write_string = mem_cgroup_write;
+
+	/* Add the usage file */
+	cft = &h->mem_cgroup_files[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.usage_in_bytes", buf);
+	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_USAGE);
+	cft->read = mem_cgroup_read;
+
+	/* Add the MAX usage file */
+	cft = &h->mem_cgroup_files[2];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.max_usage_in_bytes", buf);
+	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_MAX_USAGE);
+	cft->trigger  = mem_cgroup_reset;
+	cft->read = mem_cgroup_read;
+
+	/* NULL terminate the last cft */
+	cft = &h->mem_cgroup_files[3];
+	memset(cft, 0, sizeof(*cft));
+
+	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, h->mem_cgroup_files));
+
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 #define PRECHARGE_COUNT_AT_ONCE	256
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (9 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support memcg removal.
On memcg removal we update the page's memory cgroup to point to
parent cgroup.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h |    1 +
 mm/hugetlb.c            |   12 +++++++-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d008342..6bf6afc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -200,6 +200,7 @@ struct hstate {
 	unsigned long resv_huge_pages;
 	unsigned long surplus_huge_pages;
 	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 340e575..8a520b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -512,7 +512,7 @@ void copy_huge_page(struct page *dst, struct page *src)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
 }
@@ -524,7 +524,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 	if (list_empty(&h->hugepage_freelists[nid]))
 		return NULL;
 	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
-	list_del(&page->lru);
+	list_move(&page->lru, &h->hugepage_activelist);
 	set_page_refcounted(page);
 	h->free_huge_pages--;
 	h->free_huge_pages_node[nid]--;
@@ -628,12 +628,13 @@ static void free_huge_page(struct page *page)
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	BUG_ON(page_mapcount(page));
-	INIT_LIST_HEAD(&page->lru);
 
 	mem_cgroup_hugetlb_uncharge_page(hstate_index(h),
 					 pages_per_huge_page(h), page);
 	spin_lock(&hugetlb_lock);
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+		/* remove the page from active list */
+		list_del(&page->lru);
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -646,6 +647,7 @@ static void free_huge_page(struct page *page)
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
+	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
 	h->nr_huge_pages++;
@@ -894,6 +896,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
+		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -998,7 +1001,6 @@ retry:
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
-		list_del(&page->lru);
 		/*
 		 * This page is now managed by the hugetlb allocator and has
 		 * no users -- drop the buddy allocator's reference.
@@ -1013,7 +1015,6 @@ free:
 	/* Free unnecessary surplus pages to the buddy allocator */
 	if (!list_empty(&surplus_list)) {
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-			list_del(&page->lru);
 			put_page(page);
 		}
 	}
@@ -1927,6 +1928,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->free_huge_pages = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	INIT_LIST_HEAD(&h->hugepage_activelist);
 	h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (10 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-23 22:45   ` Andrew Morton
  2012-04-16 10:44 ` [PATCH -V6 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
  13 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This add support for memcg removal with HugeTLB resource usage.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    8 ++++++
 include/linux/memcontrol.h |   14 ++++++++++
 mm/hugetlb.c               |   43 +++++++++++++++++++++++++++++
 mm/memcontrol.c            |   65 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 120 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6bf6afc..bada0ac 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -327,4 +327,12 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 #define hstate_index(h) 0
 #endif
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
+#else
+static inline int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+	return 0;
+}
+#endif
 #endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4f17574..70317e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -460,6 +460,9 @@ extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
 extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 					      struct mem_cgroup *memcg);
 extern int mem_cgroup_hugetlb_file_init(int idx) __init;
+extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+					  struct page *page);
+extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);
 
 #else
 static inline int
@@ -496,6 +499,17 @@ static inline int mem_cgroup_hugetlb_file_init(int idx)
 	return 0;
 }
 
+static inline int
+mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+			       struct page *page)
+{
+	return 0;
+}
+
+static inline bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
+{
+	return 0;
+}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8a520b5..1d3c8ea9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1909,6 +1909,49 @@ static int __init hugetlb_init(void)
 }
 module_init(hugetlb_init);
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+/*
+ * Force the memcg to empty the hugetlb resources by moving them to
+ * the parent cgroup. We can fail if the parent cgroup's limit prevented
+ * the charging. This should only happen if use_hierarchy is not set.
+ */
+int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+	struct hstate *h;
+	struct page *page;
+	int ret = 0, idx = 0;
+
+	do {
+		if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+			goto out;
+		/*
+		 * If the task doing the cgroup_rmdir got a signal
+		 * we don't really need to loop till the hugetlb resource
+		 * usage become zero.
+		 */
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+		for_each_hstate(h) {
+			spin_lock(&hugetlb_lock);
+			list_for_each_entry(page, &h->hugepage_activelist, lru) {
+				ret = mem_cgroup_move_hugetlb_parent(idx, cgroup, page);
+				if (ret) {
+					spin_unlock(&hugetlb_lock);
+					goto out;
+				}
+			}
+			spin_unlock(&hugetlb_lock);
+			idx++;
+		}
+		cond_resched();
+	} while (mem_cgroup_have_hugetlb_usage(cgroup));
+out:
+	return ret;
+}
+#endif
+
 /* Should be called on processing a hugepagesz=... option */
 void __init hugetlb_add_hstate(unsigned order)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 93e077a..0b245fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3228,9 +3228,11 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 #endif
 
 #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
 {
 	int idx;
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
 		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
 			return 1;
@@ -3328,10 +3330,54 @@ void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 	res_counter_uncharge(&memcg->hugepage[idx], csize);
 	return;
 }
-#else
-static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+
+int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+				   struct page *page)
 {
-	return 0;
+	struct page_cgroup *pc;
+	int csize,  ret = 0;
+	struct res_counter *fail_res;
+	struct cgroup *pcgrp = cgroup->parent;
+	struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
+	struct mem_cgroup *memcg  = mem_cgroup_from_cont(cgroup);
+
+	if (!get_page_unless_zero(page))
+		goto out;
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc) || pc->mem_cgroup != memcg)
+		goto err_out;
+
+	csize = PAGE_SIZE << compound_order(page);
+	/*
+	 * If we have use_hierarchy set we can never fail here. So instead of
+	 * using res_counter_uncharge use the open-coded variant which just
+	 * uncharge the child res_counter. The parent will retain the charge.
+	 */
+	if (parent->use_hierarchy) {
+		unsigned long flags;
+		struct res_counter *counter;
+
+		counter = &memcg->hugepage[idx];
+		spin_lock_irqsave(&counter->lock, flags);
+		res_counter_uncharge_locked(counter, csize);
+		spin_unlock_irqrestore(&counter->lock, flags);
+	} else {
+		ret = res_counter_charge(&parent->hugepage[idx],
+					 csize, &fail_res);
+		if (ret) {
+			ret = -EBUSY;
+			goto err_out;
+		}
+		res_counter_uncharge(&memcg->hugepage[idx], csize);
+	}
+	pc->mem_cgroup = parent;
+err_out:
+	unlock_page_cgroup(pc);
+	put_page(page);
+out:
+	return ret;
 }
 #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
 
@@ -3852,6 +3898,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 	/* should free all ? */
 	if (free_all)
 		goto try_to_free;
+
+	/* move the hugetlb charges */
+	ret = hugetlb_force_memcg_empty(cgrp);
+	if (ret)
+		goto out;
 move_account:
 	do {
 		ret = -EBUSY;
@@ -5176,12 +5227,6 @@ free_out:
 static int mem_cgroup_pre_destroy(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
-	/*
-	 * Don't allow memcg removal if we have HugeTLB resource
-	 * usage.
-	 */
-	if (mem_cgroup_have_hugetlb_usage(memcg))
-		return -EBUSY;
 
 	return mem_cgroup_force_empty(memcg, false);
 }
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 13/14] hugetlb: migrate memcg info from oldpage to new page during migration
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (11 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  2012-04-16 10:44 ` [PATCH -V6 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

With HugeTLB pages, memcg is uncharged in compound page destructor.
Since we are holding a hugepage reference, we can be sure that old
page won't get uncharged till the last put_page(). On successful
migrate, we can move the memcg information to new page's page_cgroup
and mark the old page's page_cgroup unused.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/memcontrol.h |    8 ++++++++
 mm/memcontrol.c            |   28 ++++++++++++++++++++++++++++
 mm/migrate.c               |    4 ++++
 3 files changed, 40 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70317e5..6f2d392 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -464,6 +464,8 @@ extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 					  struct page *page);
 extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);
 
+extern void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+					struct page *newhpage);
 #else
 static inline int
 mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
@@ -510,6 +512,12 @@ static inline bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
 {
 	return 0;
 }
+
+static inline  void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+						struct page *newhpage)
+{
+	return;
+}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b245fb..519d370 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3379,6 +3379,34 @@ err_out:
 out:
 	return ret;
 }
+
+void  mem_cgroup_hugetlb_migrate(struct page *oldhpage, struct page *newhpage)
+{
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON(!PageHuge(oldhpage));
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldhpage);
+	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	pc->mem_cgroup = root_mem_cgroup;
+	ClearPageCgroupUsed(pc);
+	cgroup_exclude_rmdir(&memcg->css);
+	unlock_page_cgroup(pc);
+
+	/* move the mem_cg details to new cgroup */
+	pc = lookup_page_cgroup(newhpage);
+	lock_page_cgroup(pc);
+	pc->mem_cgroup = memcg;
+	SetPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+	cgroup_release_and_wakeup_rmdir(&memcg->css);
+	return;
+}
 #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index d7eb82d..2b931e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -928,6 +928,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (anon_vma)
 		put_anon_vma(anon_vma);
+
+	if (!rc)
+		mem_cgroup_hugetlb_migrate(hpage, new_hpage);
+
 	unlock_page(hpage);
 out:
 	put_page(new_hpage);
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -V6 14/14] memcg: Add memory controller documentation for hugetlb management
  2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (12 preceding siblings ...)
  2012-04-16 10:44 ` [PATCH -V6 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
@ 2012-04-16 10:44 ` Aneesh Kumar K.V
  13 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-16 10:44 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 Documentation/cgroups/memory.txt |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4c95c00..d99c41b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -43,6 +43,7 @@ Features:
  - usage threshold notifier
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
+ - resource accounting for HugeTLB pages
 
  Kernel memory support is work in progress, and the current version provides
  basically functionality. (See Section 2.7)
@@ -75,6 +76,12 @@ Brief summary of control files.
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
  memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
 
+
+ memory.hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
+ memory.hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
+ memory.hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
+						  # see 5.7 for details
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -279,6 +286,15 @@ per cgroup, instead of globally.
 
 * tcp memory pressure: sockets memory pressure for the tcp protocol.
 
+2.8 HugeTLB extension
+
+This extension allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
 3. User Interface
 
 0. Configuration
@@ -287,6 +303,7 @@ a. Enable CONFIG_CGROUPS
 b. Enable CONFIG_RESOURCE_COUNTERS
 c. Enable CONFIG_CGROUP_MEM_RES_CTLR
 d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
+f. Enable CONFIG_MEM_RES_CTLR_HUGETLB (to use HugeTLB extension)
 
 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
 # mount -t tmpfs none /sys/fs/cgroup
@@ -510,6 +527,18 @@ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 
 And we have total = file + anon + unevictable.
 
+5.7 HugeTLB resource control files
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+ memory.hugetlb.16GB.limit_in_bytes
+ memory.hugetlb.16GB.max_usage_in_bytes
+ memory.hugetlb.16GB.usage_in_bytes
+ memory.hugetlb.16MB.limit_in_bytes
+ memory.hugetlb.16MB.max_usage_in_bytes
+ memory.hugetlb.16MB.usage_in_bytes
+
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-16 10:44 ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
@ 2012-04-16 23:13   ` Andrew Morton
  2012-04-18  6:15     ` [PATCH] memcg: Use scnprintf instead of sprintf Aneesh Kumar K.V
  2012-04-18  6:16     ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2012-04-16 23:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012 16:14:47 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +static char *mem_fmt(char *buf, unsigned long n)
> +{
> +	if (n >= (1UL << 30))
> +		sprintf(buf, "%luGB", n >> 30);
> +	else if (n >= (1UL << 20))
> +		sprintf(buf, "%luMB", n >> 20);
> +	else
> +		sprintf(buf, "%luKB", n >> 10);
> +	return buf;
> +}
> +
> +int __init mem_cgroup_hugetlb_file_init(int idx)
> +{
> +	char buf[32];
> +	struct cftype *cft;
> +	struct hstate *h = &hstates[idx];
> +
> +	/* format the size */
> +	mem_fmt(buf, huge_page_size(h));

The sprintf() into a fixed-sized buffer is a bit ugly.  I didn't check
it for possible overflows because 32 looks like "enough".  Actually too
much.

Oh well, it's hard to avoid.  But using scnprintf() would prevent nasty
accidents.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] memcg: Use scnprintf instead of sprintf
  2012-04-16 23:13   ` Andrew Morton
@ 2012-04-18  6:15     ` Aneesh Kumar K.V
  2012-04-18 22:36       ` Andrew Morton
  2012-04-18  6:16     ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
  1 sibling, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-18  6:15 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This make sure we don't overflow.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/memcontrol.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 519d370..0ccf934 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5269,14 +5269,14 @@ static void mem_cgroup_destroy(struct cgroup *cont)
 }
 
 #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-static char *mem_fmt(char *buf, unsigned long n)
+static char *mem_fmt(char *buf, int size, unsigned long hsize)
 {
-	if (n >= (1UL << 30))
-		sprintf(buf, "%luGB", n >> 30);
-	else if (n >= (1UL << 20))
-		sprintf(buf, "%luMB", n >> 20);
+	if (hsize >= (1UL << 30))
+		scnprintf(buf, size, "%luGB", hsize >> 30);
+	else if (hsize >= (1UL << 20))
+		scnprintf(buf, size, "%luMB", hsize >> 20);
 	else
-		sprintf(buf, "%luKB", n >> 10);
+		scnprintf(buf, size, "%luKB", hsize >> 10);
 	return buf;
 }
 
@@ -5287,7 +5287,7 @@ int __init mem_cgroup_hugetlb_file_init(int idx)
 	struct hstate *h = &hstates[idx];
 
 	/* format the size */
-	mem_fmt(buf, huge_page_size(h));
+	mem_fmt(buf, 32, huge_page_size(h));
 
 	/* Add the limit file */
 	cft = &h->mem_cgroup_files[0];
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-16 23:13   ` Andrew Morton
  2012-04-18  6:15     ` [PATCH] memcg: Use scnprintf instead of sprintf Aneesh Kumar K.V
@ 2012-04-18  6:16     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-18  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	hannes, linux-kernel, cgroups

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 16 Apr 2012 16:14:47 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
>> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
>> +static char *mem_fmt(char *buf, unsigned long n)
>> +{
>> +	if (n >= (1UL << 30))
>> +		sprintf(buf, "%luGB", n >> 30);
>> +	else if (n >= (1UL << 20))
>> +		sprintf(buf, "%luMB", n >> 20);
>> +	else
>> +		sprintf(buf, "%luKB", n >> 10);
>> +	return buf;
>> +}
>> +
>> +int __init mem_cgroup_hugetlb_file_init(int idx)
>> +{
>> +	char buf[32];
>> +	struct cftype *cft;
>> +	struct hstate *h = &hstates[idx];
>> +
>> +	/* format the size */
>> +	mem_fmt(buf, huge_page_size(h));
>
> The sprintf() into a fixed-sized buffer is a bit ugly.  I didn't check
> it for possible overflows because 32 looks like "enough".  Actually too
> much.
>
> Oh well, it's hard to avoid.  But using scnprintf() would prevent nasty
> accidents.
>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 519d370..0ccf934 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5269,14 +5269,14 @@ static void mem_cgroup_destroy(struct cgroup *cont)
 }
 
 #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-static char *mem_fmt(char *buf, unsigned long n)
+static char *mem_fmt(char *buf, int size, unsigned long hsize)
 {
-	if (n >= (1UL << 30))
-		sprintf(buf, "%luGB", n >> 30);
-	else if (n >= (1UL << 20))
-		sprintf(buf, "%luMB", n >> 20);
+	if (hsize >= (1UL << 30))
+		scnprintf(buf, size, "%luGB", hsize >> 30);
+	else if (hsize >= (1UL << 20))
+		scnprintf(buf, size, "%luMB", hsize >> 20);
 	else
-		sprintf(buf, "%luKB", n >> 10);
+		scnprintf(buf, size, "%luKB", hsize >> 10);
 	return buf;
 }
 
@@ -5287,7 +5287,7 @@ int __init mem_cgroup_hugetlb_file_init(int idx)
 	struct hstate *h = &hstates[idx];
 
 	/* format the size */
-	mem_fmt(buf, huge_page_size(h));
+	mem_fmt(buf, 32, huge_page_size(h));
 
 	/* Add the limit file */
 	cft = &h->mem_cgroup_files[0];


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] memcg: Use scnprintf instead of sprintf
  2012-04-18  6:15     ` [PATCH] memcg: Use scnprintf instead of sprintf Aneesh Kumar K.V
@ 2012-04-18 22:36       ` Andrew Morton
  2012-04-19  8:26         ` Andreas Schwab
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2012-04-18 22:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	hannes, linux-kernel, cgroups, James Bottomley

On Wed, 18 Apr 2012 11:45:56 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This make sure we don't overflow.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  mm/memcontrol.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 519d370..0ccf934 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5269,14 +5269,14 @@ static void mem_cgroup_destroy(struct cgroup *cont)
>  }
>  
>  #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> -static char *mem_fmt(char *buf, unsigned long n)
> +static char *mem_fmt(char *buf, int size, unsigned long hsize)
>  {
> -	if (n >= (1UL << 30))
> -		sprintf(buf, "%luGB", n >> 30);
> -	else if (n >= (1UL << 20))
> -		sprintf(buf, "%luMB", n >> 20);
> +	if (hsize >= (1UL << 30))
> +		scnprintf(buf, size, "%luGB", hsize >> 30);
> +	else if (hsize >= (1UL << 20))
> +		scnprintf(buf, size, "%luMB", hsize >> 20);
>  	else
> -		sprintf(buf, "%luKB", n >> 10);
> +		scnprintf(buf, size, "%luKB", hsize >> 10);
>  	return buf;
>  }

We could use snprintf() here too but it doesn't seem to matter either
way.  I guess we _should_ use snprintf as it causes less surprise.

--- a/mm/memcontrol.c~hugetlbfs-add-memcg-control-files-for-hugetlbfs-use-scnprintf-instead-of-sprintf-fix
+++ a/mm/memcontrol.c
@@ -4037,7 +4037,7 @@ static ssize_t mem_cgroup_read(struct cg
 		BUG();
 	}
 
-	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
+	len = snprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
 /*
@@ -5199,11 +5199,11 @@ static void mem_cgroup_destroy(struct cg
 static char *mem_fmt(char *buf, int size, unsigned long hsize)
 {
 	if (hsize >= (1UL << 30))
-		scnprintf(buf, size, "%luGB", hsize >> 30);
+		snprintf(buf, size, "%luGB", hsize >> 30);
 	else if (hsize >= (1UL << 20))
-		scnprintf(buf, size, "%luMB", hsize >> 20);
+		snprintf(buf, size, "%luMB", hsize >> 20);
 	else
-		scnprintf(buf, size, "%luKB", hsize >> 10);
+		snprintf(buf, size, "%luKB", hsize >> 10);
 	return buf;
 }
 

It is regrettable that your mem_fmt() exists, especially within
memcontrol.c - it is quite a generic thing.  Can't we use
lib/string_helpers.c:string_get_size()?  Or if not, modify
string_get_size() so it is usable here?

Speaking of which,

From: Andrew Morton <akpm@linux-foundation.org>
Subject: lib/string_helpers.c: make arrays static

Moving these arrays into static storage shrinks the kernel a bit:

   text    data     bss     dec     hex filename
    723     112      64     899     383 lib/string_helpers.o
    516     272      64     852     354 lib/string_helpers.o

Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string_helpers.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff -puN lib/string_helpers.c~lib-string_helpersc-make-arrays-static lib/string_helpers.c
--- a/lib/string_helpers.c~lib-string_helpersc-make-arrays-static
+++ a/lib/string_helpers.c
@@ -23,15 +23,15 @@
 int string_get_size(u64 size, const enum string_size_units units,
 		    char *buf, int len)
 {
-	const char *units_10[] = { "B", "kB", "MB", "GB", "TB", "PB",
+	static const char *units_10[] = { "B", "kB", "MB", "GB", "TB", "PB",
 				   "EB", "ZB", "YB", NULL};
-	const char *units_2[] = {"B", "KiB", "MiB", "GiB", "TiB", "PiB",
+	static const char *units_2[] = {"B", "KiB", "MiB", "GiB", "TiB", "PiB",
 				 "EiB", "ZiB", "YiB", NULL };
-	const char **units_str[] = {
+	static const char **units_str[] = {
 		[STRING_UNITS_10] =  units_10,
 		[STRING_UNITS_2] = units_2,
 	};
-	const unsigned int divisor[] = {
+	static const unsigned int divisor[] = {
 		[STRING_UNITS_10] = 1000,
 		[STRING_UNITS_2] = 1024,
 	};
_


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] memcg: Use scnprintf instead of sprintf
  2012-04-18 22:36       ` Andrew Morton
@ 2012-04-19  8:26         ` Andreas Schwab
  0 siblings, 0 replies; 44+ messages in thread
From: Andreas Schwab @ 2012-04-19  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aneesh Kumar K.V, linux-mm, mgorman-l3A5Bk7waGM,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	dhillf-Re5JQEeQqe8AvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, linux-kernel,
	cgroups-u79uwXL29TY76Z2rM5mHXA, James Bottomley

Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
writes:

> diff -puN lib/string_helpers.c~lib-string_helpersc-make-arrays-static lib/string_helpers.c
> --- a/lib/string_helpers.c~lib-string_helpersc-make-arrays-static
> +++ a/lib/string_helpers.c
> @@ -23,15 +23,15 @@
>  int string_get_size(u64 size, const enum string_size_units units,
>  		    char *buf, int len)
>  {
> -	const char *units_10[] = { "B", "kB", "MB", "GB", "TB", "PB",
> +	static const char *units_10[] = { "B", "kB", "MB", "GB", "TB", "PB",
>  				   "EB", "ZB", "YB", NULL};
> -	const char *units_2[] = {"B", "KiB", "MiB", "GiB", "TiB", "PiB",
> +	static const char *units_2[] = {"B", "KiB", "MiB", "GiB", "TiB", "PiB",
>  				 "EiB", "ZiB", "YiB", NULL };
> -	const char **units_str[] = {
> +	static const char **units_str[] = {
>  		[STRING_UNITS_10] =  units_10,
>  		[STRING_UNITS_2] = units_2,
>  	};

You could even make them const, I think.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-16 10:44 ` [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
@ 2012-04-23 22:45   ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-04-23 22:45 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012 16:14:49 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> This add support for memcg removal with HugeTLB resource usage.

include/linux/memcontrol.h:504: warning: 'struct cgroup' declared inside parameter list
include/linux/memcontrol.h:504: warning: its scope is only this definition or declaration, which is probably not what you want
include/linux/memcontrol.h:509: warning: 'struct cgroup' declared inside parameter list

Documentation/SubmitChecklist, section 2.  Please do these things -
what you have done here is to send untested code, for some
configuration options.


I'll try this:

 include/linux/hugetlb.h    |    6 +-----
 include/linux/memcontrol.h |   11 -----------
 2 files changed, 1 insertion(+), 16 deletions(-)

--- a/include/linux/memcontrol.h~memcg-move-hugetlb-resource-count-to-parent-cgroup-on-memcg-removal-fix
+++ a/include/linux/memcontrol.h
@@ -499,17 +499,6 @@ static inline int mem_cgroup_hugetlb_fil
 	return 0;
 }
 
-static inline int
-mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
-			       struct page *page)
-{
-	return 0;
-}
-
-static inline bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
-{
-	return 0;
-}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
--- a/include/linux/hugetlb.h~memcg-move-hugetlb-resource-count-to-parent-cgroup-on-memcg-removal-fix
+++ a/include/linux/hugetlb.h
@@ -337,10 +337,6 @@ static inline unsigned int pages_per_hug
 
 #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
 extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
-#else
-static inline int hugetlb_force_memcg_empty(struct cgroup *cgroup)
-{
-	return 0;
-}
 #endif
+
 #endif /* _LINUX_HUGETLB_H */
_
 

We shouldn't be calling these functions if CONFIG_MEM_RES_CTLR_HUGETLB=n?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages
  2012-04-16 10:44 ` [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
@ 2012-04-23 23:44   ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-04-23 23:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012 16:14:41 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Use mmu_gather instead of temporary linked list for accumulating pages when
> we unmap a hugepage range. This also allows us to get rid of i_mmap_mutex
> unmap_hugepage_range in the following patch.
> 

Another warning and a build error, due to inadequate coverage testing.

mm/memory.c: In function 'unmap_single_vma':
mm/memory.c:1334: error: implicit declaration of function '__unmap_hugepage_range'

--- a/include/linux/hugetlb.h~hugetlb-use-mmu_gather-instead-of-a-temporary-linked-list-for-accumulating-pages-fix
+++ a/include/linux/hugetlb.h
@@ -41,8 +41,9 @@ int follow_hugetlb_page(struct mm_struct
 			unsigned long *, int *, int, unsigned int flags);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *,
-			    unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vms,
+				unsigned long start, unsigned long end,
+				struct page *ref_page);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(int, char *);
@@ -119,6 +120,12 @@ static inline void copy_huge_page(struct
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
+static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
+			struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct page *ref_page)
+{
+}
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"


I also fixed up that __unmap_hugepage_range() declaration - it's quite
maddening to work on and review and read code when people have gone and
left out the names of the arguments.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-04-16 10:44 ` [PATCH -V6 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
@ 2012-05-02  0:20   ` Paul Gortmaker
  2012-05-03  4:37     ` Aneesh Kumar K.V
  2012-05-24 21:52   ` David Rientjes
  1 sibling, 1 reply; 44+ messages in thread
From: Paul Gortmaker @ 2012-05-02  0:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups, linux-next

On Mon, Apr 16, 2012 at 6:44 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> This patch implements a memcg extension that allows us to control HugeTLB
> allocations via memory controller. The extension allows to limit the

Hi Aneesh,

This breaks linux-next on some arch because they don't have any
HUGE_MAX_HSTATE in scope with the current #ifdef layout.

The breakage is in sh4, m68k, s390, and possibly others.

http://kisskb.ellerman.id.au/kisskb/buildresult/6228689/
http://kisskb.ellerman.id.au/kisskb/buildresult/6228670/
http://kisskb.ellerman.id.au/kisskb/buildresult/6228484/

This is a commit in akpm's mmotm queue, which used to be here:

http://userweb.kernel.org/~akpm/mmotm

Of course the above is invalid since userweb.kernel.org is dead.
I don't have a post-kernel.org break-in link handy and a quick
search didn't give me one, but I'm sure you'll recognize the change.

Thanks,
Paul.
--

> HugeTLB usage per control group and enforces the controller limit during
> page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
> at page fault time implies that, the application will get SIGBUS signal if it
> tries to access HugeTLB pages beyond its limit. This requires the application
> to know beforehand how much HugeTLB pages it would require for its use.
>
> The charge/uncharge calls will be added to HugeTLB code in later patch.
> Support for memcg removal will be added in later patches.
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  include/linux/hugetlb.h    |    1 +
>  include/linux/memcontrol.h |   42 ++++++++++++++
>  init/Kconfig               |    8 +++
>  mm/hugetlb.c               |    2 +-
>  mm/memcontrol.c            |  132 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 184 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 46c6cbd..995c238 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -226,6 +226,7 @@ struct hstate *size_to_hstate(unsigned long size);
>  #define HUGE_MAX_HSTATE 1
>  #endif
>
> +extern int hugetlb_max_hstate;
>  extern struct hstate hstates[HUGE_MAX_HSTATE];
>  extern unsigned int default_hstate_idx;
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f94efd2..1d07e14 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -448,5 +448,47 @@ static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +extern int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +                                         struct mem_cgroup **ptr);
> +extern void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +                                            struct mem_cgroup *memcg,
> +                                            struct page *page);
> +extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +                                            struct page *page);
> +extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +                                             struct mem_cgroup *memcg);
> +
> +#else
> +static inline int
> +mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +                                                struct mem_cgroup **ptr)
> +{
> +       return 0;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +                                struct mem_cgroup *memcg,
> +                                struct page *page)
> +{
> +       return;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +                                struct page *page)
> +{
> +       return;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +                                 struct mem_cgroup *memcg)
> +{
> +       return;
> +}
> +#endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
>  #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 72f33fa..a3b5665 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -716,6 +716,14 @@ config CGROUP_PERF
>
>          Say N if unsure.
>
> +config MEM_RES_CTLR_HUGETLB
> +       bool "Memory Resource Controller HugeTLB Extension (EXPERIMENTAL)"
> +       depends on CGROUP_MEM_RES_CTLR && HUGETLB_PAGE && EXPERIMENTAL
> +       default n
> +       help
> +         Add HugeTLB management to memory resource controller. When you
> +         enable this, you can put a per cgroup limit on HugeTLB usage.
> +
>  menuconfig CGROUP_SCHED
>        bool "Group CPU scheduler"
>        default n
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a3ac624..8cd89b4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
>  static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
>  unsigned long hugepages_treat_as_movable;
>
> -static int hugetlb_max_hstate;
> +int hugetlb_max_hstate;
>  unsigned int default_hstate_idx;
>  struct hstate hstates[HUGE_MAX_HSTATE];
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 901bb03..884f479 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -252,6 +252,10 @@ struct mem_cgroup {
>        };
>
>        /*
> +        * the counter to account for hugepages from hugetlb.
> +        */
> +       struct res_counter hugepage[HUGE_MAX_HSTATE];
> +       /*
>         * Per cgroup active and inactive list, similar to the
>         * per zone LRU lists.
>         */
> @@ -3213,6 +3217,114 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  }
>  #endif
>
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +{
> +       int idx;
> +       for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> +               if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
> +                       return 1;
> +       }
> +       return 0;
> +}
> +
> +int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +                                  struct mem_cgroup **ptr)
> +{
> +       int ret = 0;
> +       struct mem_cgroup *memcg = NULL;
> +       struct res_counter *fail_res;
> +       unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +       if (mem_cgroup_disabled())
> +               goto done;
> +again:
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (!memcg)
> +               memcg = root_mem_cgroup;
> +
> +       if (!css_tryget(&memcg->css)) {
> +               rcu_read_unlock();
> +               goto again;
> +       }
> +       rcu_read_unlock();
> +
> +       ret = res_counter_charge(&memcg->hugepage[idx], csize, &fail_res);
> +       css_put(&memcg->css);
> +done:
> +       *ptr = memcg;
> +       return ret;
> +}
> +
> +void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +                                     struct mem_cgroup *memcg,
> +                                     struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       pc = lookup_page_cgroup(page);
> +       lock_page_cgroup(pc);
> +       if (unlikely(PageCgroupUsed(pc))) {
> +               unlock_page_cgroup(pc);
> +               mem_cgroup_hugetlb_uncharge_memcg(idx, nr_pages, memcg);
> +               return;
> +       }
> +       pc->mem_cgroup = memcg;
> +       SetPageCgroupUsed(pc);
> +       unlock_page_cgroup(pc);
> +       return;
> +}
> +
> +void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +                                     struct page *page)
> +{
> +       struct page_cgroup *pc;
> +       struct mem_cgroup *memcg;
> +       unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       pc = lookup_page_cgroup(page);
> +       if (unlikely(!PageCgroupUsed(pc)))
> +               return;
> +
> +       lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)) {
> +               unlock_page_cgroup(pc);
> +               return;
> +       }
> +       memcg = pc->mem_cgroup;
> +       pc->mem_cgroup = root_mem_cgroup;
> +       ClearPageCgroupUsed(pc);
> +       unlock_page_cgroup(pc);
> +
> +       res_counter_uncharge(&memcg->hugepage[idx], csize);
> +       return;
> +}
> +
> +void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +                                      struct mem_cgroup *memcg)
> +{
> +       unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       res_counter_uncharge(&memcg->hugepage[idx], csize);
> +       return;
> +}
> +#else
> +static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +{
> +       return 0;
> +}
> +#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> +
>  /*
>  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
>  * page belongs to.
> @@ -4955,6 +5067,7 @@ err_cleanup:
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup *cont)
>  {
> +       int idx;
>        struct mem_cgroup *memcg, *parent;
>        long error = -ENOMEM;
>        int node;
> @@ -4997,9 +5110,22 @@ mem_cgroup_create(struct cgroup *cont)
>                 * mem_cgroup(see mem_cgroup_put).
>                 */
>                mem_cgroup_get(parent);
> +               /*
> +                * We could get called before hugetlb init is called.
> +                * Use HUGE_MAX_HSTATE as the max index.
> +                */
> +               for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> +                       res_counter_init(&memcg->hugepage[idx],
> +                                        &parent->hugepage[idx]);
>        } else {
>                res_counter_init(&memcg->res, NULL);
>                res_counter_init(&memcg->memsw, NULL);
> +               /*
> +                * We could get called before hugetlb init is called.
> +                * Use HUGE_MAX_HSTATE as the max index.
> +                */
> +               for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> +                       res_counter_init(&memcg->hugepage[idx], NULL);
>        }
>        memcg->last_scanned_node = MAX_NUMNODES;
>        INIT_LIST_HEAD(&memcg->oom_notify);
> @@ -5030,6 +5156,12 @@ free_out:
>  static int mem_cgroup_pre_destroy(struct cgroup *cont)
>  {
>        struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +       /*
> +        * Don't allow memcg removal if we have HugeTLB resource
> +        * usage.
> +        */
> +       if (mem_cgroup_have_hugetlb_usage(memcg))
> +               return -EBUSY;
>
>        return mem_cgroup_force_empty(memcg, false);
>  }
> --
> 1.7.10
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-02  0:20   ` Paul Gortmaker
@ 2012-05-03  4:37     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-05-03  4:37 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups, linux-next

Paul Gortmaker <paul.gortmaker@windriver.com> writes:

> On Mon, Apr 16, 2012 at 6:44 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>
>> This patch implements a memcg extension that allows us to control HugeTLB
>> allocations via memory controller. The extension allows to limit the
>
> Hi Aneesh,
>
> This breaks linux-next on some arch because they don't have any
> HUGE_MAX_HSTATE in scope with the current #ifdef layout.
>
> The breakage is in sh4, m68k, s390, and possibly others.
>
> http://kisskb.ellerman.id.au/kisskb/buildresult/6228689/
> http://kisskb.ellerman.id.au/kisskb/buildresult/6228670/
> http://kisskb.ellerman.id.au/kisskb/buildresult/6228484/
>
> This is a commit in akpm's mmotm queue, which used to be here:
>
> http://userweb.kernel.org/~akpm/mmotm
>
> Of course the above is invalid since userweb.kernel.org is dead.
> I don't have a post-kernel.org break-in link handy and a quick
> search didn't give me one, but I'm sure you'll recognize the change.
>

Andrew have the below patch 

http://article.gmane.org/gmane.linux.kernel.commits.mm/71649

Does that fix the error ?

-aneesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate
  2012-04-16 10:44 ` [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
@ 2012-05-24 21:11   ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2012-05-24 21:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> We will be using this from other subsystems like memcg
> in later patches.
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Hillf Danton <dhillf@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values
  2012-04-16 10:44 ` [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
@ 2012-05-24 21:17   ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2012-05-24 21:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Using VM_FAULT_* codes with ERR_PTR will require us to make sure
> VM_FAULT_* values will not exceed MAX_ERRNO value.
> 

Is it worth the extra branches?

> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  mm/hugetlb.c |   18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 766eb90..5a472a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	 */
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
> -		return ERR_PTR(-VM_FAULT_OOM);
> +		return ERR_PTR(-ENOMEM);
>  	if (chg)
>  		if (hugepage_subpool_get_pages(spool, chg))
> -			return ERR_PTR(-VM_FAULT_SIGBUS);
> +			return ERR_PTR(-ENOSPC);
>  
>  	spin_lock(&hugetlb_lock);
>  	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> @@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {
>  			hugepage_subpool_put_pages(spool, chg);
> -			return ERR_PTR(-VM_FAULT_SIGBUS);
> +			return ERR_PTR(-ENOSPC);
>  		}
>  	}
>  
> @@ -2486,6 +2486,7 @@ retry_avoidcopy:
>  	new_page = alloc_huge_page(vma, address, outside_reserve);
>  
>  	if (IS_ERR(new_page)) {
> +		int err = PTR_ERR(new_page);
>  		page_cache_release(old_page);
>  
>  		/*

PTR_ERR() returns long.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index
  2012-04-16 10:44 ` [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
@ 2012-05-24 21:22   ` David Rientjes
  2012-05-27 20:07     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2012-05-24 21:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Add an inline helper and use it in the code.
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

I like the helper function, but you missed using it in 
hugetlb_init().

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page
  2012-04-16 10:44 ` [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
@ 2012-05-24 21:35   ` David Rientjes
  2012-05-27 20:13     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2012-05-24 21:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:

> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 97cc273..1f092db 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
> -	LIST_HEAD(pagelist);
>  
>  	ret = get_any_page(page, pfn, flags);
>  	if (ret < 0)
> @@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	}
>  
>  	/* Keep page count to indicate a given hugepage is isolated. */
> -
> -	list_add(&hpage->lru, &pagelist);
> -	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
> -				true);
> +	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);

Was this tested?  Shouldn't this be migrate_huge_page(compound_head(page), 
...)?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-04-16 10:44 ` [PATCH -V6 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
  2012-05-02  0:20   ` Paul Gortmaker
@ 2012-05-24 21:52   ` David Rientjes
  2012-05-24 22:57     ` Andrew Morton
  2012-05-27 20:28     ` Aneesh Kumar K.V
  1 sibling, 2 replies; 44+ messages in thread
From: David Rientjes @ 2012-05-24 21:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf, aarcange, mhocko,
	Andrew Morton, hannes, linux-kernel, cgroups

On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:

> This patch implements a memcg extension that allows us to control HugeTLB
> allocations via memory controller. The extension allows to limit the
> HugeTLB usage per control group and enforces the controller limit during
> page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
> at page fault time implies that, the application will get SIGBUS signal if it
> tries to access HugeTLB pages beyond its limit. This requires the application
> to know beforehand how much HugeTLB pages it would require for its use.
> 
> The charge/uncharge calls will be added to HugeTLB code in later patch.
> Support for memcg removal will be added in later patches.
> 

Again, I disagree with this approach because it's adding the functionality 
to memcg when it's unnecessary; it would be a complete legitimate usecase 
to want to limit the number of globally available hugepages to a set of 
tasks without incurring the per-page tracking from memcg.

This can be implemented as a seperate cgroup and as we move to a single 
hierarchy, you lose no functionality if you mount both cgroups from what 
is done here.

It would be much cleaner in terms of

 - build: not requiring ifdefs and dependencies on CONFIG_HUGETLB_PAGE, 
   which is a prerequisite for this functionality and is not for 
   CONFIG_CGROUP_MEM_RES_CTLR,

 - code: seperating hugetlb bits out from memcg bits to avoid growing 
   mm/memcontrol.c beyond its current 5650 lines, and

 - performance: not incurring any overhead of enabling memcg for per-
   page tracking that is unnecessary if users only want to limit hugetlb 
   pages.

Kmem accounting and swap accounting is really a seperate topic and makes 
sense to be incorporated directly into memcg because their usage is a 
single number, the same is not true for hugetlb pages where charging one 
1GB page is not the same as charging 512 2M pages.  And we have no 
usecases for wanting to track kmem or swap only without user page 
tracking, what would be the point?

There's a reason we don't enable CONFIG_CGROUP_MEM_RES_CTLR in the 
defconfig, we don't want the extra 1% metadata overhead of enabling it and 
the potential performance regression from doing per-page tracking if we 
only want to limit a global resource (hugetlb pages) to a set of tasks.

So please consider seperating this functionality out into its own cgroup, 
there's no reason not to do it and it would benefit hugetlb users who 
don't want to incur the disadvantages of enabling memcg entirely.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-24 21:52   ` David Rientjes
@ 2012-05-24 22:57     ` Andrew Morton
  2012-05-24 23:20       ` David Rientjes
  2012-05-27 20:28     ` Aneesh Kumar K.V
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2012-05-24 22:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Aneesh Kumar K.V, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups

On Thu, 24 May 2012 14:52:26 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:
> 
> > This patch implements a memcg extension that allows us to control HugeTLB
> > allocations via memory controller. The extension allows to limit the
> > HugeTLB usage per control group and enforces the controller limit during
> > page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
> > at page fault time implies that, the application will get SIGBUS signal if it
> > tries to access HugeTLB pages beyond its limit. This requires the application
> > to know beforehand how much HugeTLB pages it would require for its use.
> > 
> > The charge/uncharge calls will be added to HugeTLB code in later patch.
> > Support for memcg removal will be added in later patches.
> > 
> 
> Again, I disagree with this approach because it's adding the functionality 
> to memcg when it's unnecessary; it would be a complete legitimate usecase 
> to want to limit the number of globally available hugepages to a set of 
> tasks without incurring the per-page tracking from memcg.
> 
> This can be implemented as a seperate cgroup and as we move to a single 
> hierarchy, you lose no functionality if you mount both cgroups from what 
> is done here.
> 
> It would be much cleaner in terms of
> 
>  - build: not requiring ifdefs and dependencies on CONFIG_HUGETLB_PAGE, 
>    which is a prerequisite for this functionality and is not for 
>    CONFIG_CGROUP_MEM_RES_CTLR,
> 
>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
>    mm/memcontrol.c beyond its current 5650 lines, and
> 
>  - performance: not incurring any overhead of enabling memcg for per-
>    page tracking that is unnecessary if users only want to limit hugetlb 
>    pages.
> 
> Kmem accounting and swap accounting is really a seperate topic and makes 
> sense to be incorporated directly into memcg because their usage is a 
> single number, the same is not true for hugetlb pages where charging one 
> 1GB page is not the same as charging 512 2M pages.  And we have no 
> usecases for wanting to track kmem or swap only without user page 
> tracking, what would be the point?
> 
> There's a reason we don't enable CONFIG_CGROUP_MEM_RES_CTLR in the 
> defconfig, we don't want the extra 1% metadata overhead of enabling it and 
> the potential performance regression from doing per-page tracking if we 
> only want to limit a global resource (hugetlb pages) to a set of tasks.
> 
> So please consider seperating this functionality out into its own cgroup, 
> there's no reason not to do it and it would benefit hugetlb users who 
> don't want to incur the disadvantages of enabling memcg entirely.

These arguments look pretty strong to me.  But poorly timed :(

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-24 22:57     ` Andrew Morton
@ 2012-05-24 23:20       ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2012-05-24 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aneesh Kumar K.V, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups

On Thu, 24 May 2012, Andrew Morton wrote:

> These arguments look pretty strong to me.  But poorly timed :(
> 

What I argued here is nothing new, I said the same thing back on April 27 
and I was expecting it to be reproposed as a seperate controller.  The 
counter argument that memcg shouldn't cause a performance degradation 
doesn't hold water: you can't expect every page to be tracked without 
incurring some penalty somewhere.  And it certainly causes ~1% of memory 
to be used up at boot with all the struct page_cgroups.

The counter argument that we'd have to duplicate cgroup setup and 
initialization code from memcg also is irrelevant: all generic cgroup 
mounting, creation, and initialization code should be in kernel/cgroup.c.  
Obviously there will be added code because we're introducing a new cgroup, 
but that's not a reason to force everybody who wants to control hugetlb 
pages to be forced to enable memcg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index
  2012-05-24 21:22   ` David Rientjes
@ 2012-05-27 20:07     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-05-27 20:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Thu, May 24, 2012 at 02:22:27PM -0700, David Rientjes wrote:
> On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:
> 
> > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> > 
> > Add an inline helper and use it in the code.
> > 
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> I like the helper function, but you missed using it in 
> hugetlb_init().
> 

Will do this as an add on patc on top

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b90dd5..58eead5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1891,7 +1891,7 @@ static int __init hugetlb_init(void)
 		if (!size_to_hstate(default_hstate_size))
 			hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
 	}
-	default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
+	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size));
 	if (default_hstate_max_huge_pages)
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
 


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page
  2012-05-24 21:35   ` David Rientjes
@ 2012-05-27 20:13     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-05-27 20:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes, linux-kernel, cgroups

On Thu, May 24, 2012 at 02:35:05PM -0700, David Rientjes wrote:
> On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:
> 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 97cc273..1f092db 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  	int ret;
> >  	unsigned long pfn = page_to_pfn(page);
> >  	struct page *hpage = compound_head(page);
> > -	LIST_HEAD(pagelist);
> >  
> >  	ret = get_any_page(page, pfn, flags);
> >  	if (ret < 0)
> > @@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  	}
> >  
> >  	/* Keep page count to indicate a given hugepage is isolated. */
> > -
> > -	list_add(&hpage->lru, &pagelist);
> > -	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
> > -				true);
> > +	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
> 
> Was this tested?  Shouldn't this be migrate_huge_page(compound_head(page), 
> ...)?
> 

I tested this using madvise, but by not using tail pages. How about the below diff ?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4a45098..53a1495 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1428,8 +1428,8 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	}
 
 	/* Keep page count to indicate a given hugepage is isolated. */
-	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
-	put_page(page);
+	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
+	put_page(hpage);
 	if (ret) {
 		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
 			pfn, ret, page->flags);


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-24 21:52   ` David Rientjes
  2012-05-24 22:57     ` Andrew Morton
@ 2012-05-27 20:28     ` Aneesh Kumar K.V
  2012-05-30 14:43       ` Aneesh Kumar K.V
  1 sibling, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-05-27 20:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf, aarcange, mhocko,
	Andrew Morton, hannes, linux-kernel, cgroups

On Thu, May 24, 2012 at 02:52:26PM -0700, David Rientjes wrote:
> On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:
> 
> > This patch implements a memcg extension that allows us to control HugeTLB
> > allocations via memory controller. The extension allows to limit the
> > HugeTLB usage per control group and enforces the controller limit during
> > page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
> > at page fault time implies that, the application will get SIGBUS signal if it
> > tries to access HugeTLB pages beyond its limit. This requires the application
> > to know beforehand how much HugeTLB pages it would require for its use.
> > 
> > The charge/uncharge calls will be added to HugeTLB code in later patch.
> > Support for memcg removal will be added in later patches.
> > 
> 
> Again, I disagree with this approach because it's adding the functionality 
> to memcg when it's unnecessary; it would be a complete legitimate usecase 
> to want to limit the number of globally available hugepages to a set of 
> tasks without incurring the per-page tracking from memcg.
> 
> This can be implemented as a seperate cgroup and as we move to a single 
> hierarchy, you lose no functionality if you mount both cgroups from what 
> is done here.
> 
> It would be much cleaner in terms of
> 
>  - build: not requiring ifdefs and dependencies on CONFIG_HUGETLB_PAGE, 
>    which is a prerequisite for this functionality and is not for 
>    CONFIG_CGROUP_MEM_RES_CTLR,

I am not sure we have large number of #ifdef as you have outlined above.
Most of the hugetlb limit code is well isolated already. If we were to
split it as a seperate controller, we will be duplicating code related
cgroup deletion,  migration support etc from memcg, because in case
of memcg and hugetlb limit they depend on struct page. So I would expect
we would be end up #ifdef around that code or duplicate them in the
new controller if we were to do hugetlb limit as a seperate controller.

Another reason for it to be part of memcg is, it is normal to look
at hugetlb usage also as a memory usage. One of the feedback I got
for the earlier post is to see if i can enhace the current code to
make sure memory.usage_in_bytes can also account for hugetlb usage.
People would also like to look at memory.limit_in_bytes to limit total
usage. (inclusive of hugetlb).

> 
>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
>    mm/memcontrol.c beyond its current 5650 lines, and
> 

I can definitely look at spliting mm/memcontrol.c 


>  - performance: not incurring any overhead of enabling memcg for per-
>    page tracking that is unnecessary if users only want to limit hugetlb 
>    pages.
> 

-aneesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-27 20:28     ` Aneesh Kumar K.V
@ 2012-05-30 14:43       ` Aneesh Kumar K.V
  2012-06-08 23:06         ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-05-30 14:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf, aarcange, mhocko,
	Andrew Morton, hannes, linux-kernel, cgroups

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> On Thu, May 24, 2012 at 02:52:26PM -0700, David Rientjes wrote:
>> On Mon, 16 Apr 2012, Aneesh Kumar K.V wrote:
>> 
>> > This patch implements a memcg extension that allows us to control HugeTLB
>> > allocations via memory controller. The extension allows to limit the
>> > HugeTLB usage per control group and enforces the controller limit during
>> > page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
>> > at page fault time implies that, the application will get SIGBUS signal if it
>> > tries to access HugeTLB pages beyond its limit. This requires the application
>> > to know beforehand how much HugeTLB pages it would require for its use.
>> > 
>> > The charge/uncharge calls will be added to HugeTLB code in later patch.
>> > Support for memcg removal will be added in later patches.
>> > 
>> 
>> Again, I disagree with this approach because it's adding the functionality 
>> to memcg when it's unnecessary; it would be a complete legitimate usecase 
>> to want to limit the number of globally available hugepages to a set of 
>> tasks without incurring the per-page tracking from memcg.
>> 
>> This can be implemented as a seperate cgroup and as we move to a single 
>> hierarchy, you lose no functionality if you mount both cgroups from what 
>> is done here.
>> 
>> It would be much cleaner in terms of
>> 
>>  - build: not requiring ifdefs and dependencies on CONFIG_HUGETLB_PAGE, 
>>    which is a prerequisite for this functionality and is not for 
>>    CONFIG_CGROUP_MEM_RES_CTLR,
>
> I am not sure we have large number of #ifdef as you have outlined above.
> Most of the hugetlb limit code is well isolated already. If we were to
> split it as a seperate controller, we will be duplicating code related
> cgroup deletion,  migration support etc from memcg, because in case
> of memcg and hugetlb limit they depend on struct page. So I would expect
> we would be end up #ifdef around that code or duplicate them in the
> new controller if we were to do hugetlb limit as a seperate controller.
>
> Another reason for it to be part of memcg is, it is normal to look
> at hugetlb usage also as a memory usage. One of the feedback I got
> for the earlier post is to see if i can enhace the current code to
> make sure memory.usage_in_bytes can also account for hugetlb usage.
> People would also like to look at memory.limit_in_bytes to limit total
> usage. (inclusive of hugetlb).
>
>> 
>>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
>>    mm/memcontrol.c beyond its current 5650 lines, and
>> 
>
> I can definitely look at spliting mm/memcontrol.c 
>
>
>>  - performance: not incurring any overhead of enabling memcg for per-
>>    page tracking that is unnecessary if users only want to limit hugetlb 
>>    pages.
>> 

Since Andrew didn't sent the patchset to Linus because of this
discussion, I looked at reworking the patchset as a seperate
controller. The patchset I sent here

http://thread.gmane.org/gmane.linux.kernel.mm/79230

have seen minimal testing. I also folded the fixup patches
Andrew had in -mm to original patchset.

Let me know if the changes looks good.
-aneesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-05-30 14:43       ` Aneesh Kumar K.V
@ 2012-06-08 23:06         ` Andrew Morton
  2012-06-09 14:16           ` Aneesh Kumar K.V
                             ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Andrew Morton @ 2012-06-08 23:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: David Rientjes, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups, Michal Hocko,
	Ying Han

On Wed, 30 May 2012 20:13:31 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> >> 
> >>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
> >>    mm/memcontrol.c beyond its current 5650 lines, and
> >> 
> >
> > I can definitely look at spliting mm/memcontrol.c 
> >
> >
> >>  - performance: not incurring any overhead of enabling memcg for per-
> >>    page tracking that is unnecessary if users only want to limit hugetlb 
> >>    pages.
> >> 
> 
> Since Andrew didn't sent the patchset to Linus because of this
> discussion, I looked at reworking the patchset as a seperate
> controller. The patchset I sent here
> 
> http://thread.gmane.org/gmane.linux.kernel.mm/79230
> 
> have seen minimal testing. I also folded the fixup patches
> Andrew had in -mm to original patchset.
> 
> Let me know if the changes looks good.

This is starting to be a problem.  I'm still sitting on the old version
of this patchset and it will start to get in the way of other work.

We now have this new version of the patchset which implements a
separate controller but it is unclear to me which way we want to go.

Can the memcg developers please drop everything else and make a
decision here?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-08 23:06         ` Andrew Morton
@ 2012-06-09 14:16           ` Aneesh Kumar K.V
  2012-06-10  1:55             ` David Rientjes
  2012-06-11  3:55           ` Kamezawa Hiroyuki
  2012-06-11  9:32           ` Michal Hocko
  2 siblings, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-06-09 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups, Michal Hocko,
	Ying Han

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 30 May 2012 20:13:31 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
>> >> 
>> >>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
>> >>    mm/memcontrol.c beyond its current 5650 lines, and
>> >> 
>> >
>> > I can definitely look at spliting mm/memcontrol.c 
>> >
>> >
>> >>  - performance: not incurring any overhead of enabling memcg for per-
>> >>    page tracking that is unnecessary if users only want to limit hugetlb 
>> >>    pages.
>> >> 
>> 
>> Since Andrew didn't sent the patchset to Linus because of this
>> discussion, I looked at reworking the patchset as a seperate
>> controller. The patchset I sent here
>> 
>> http://thread.gmane.org/gmane.linux.kernel.mm/79230
>> 
>> have seen minimal testing. I also folded the fixup patches
>> Andrew had in -mm to original patchset.
>> 
>> Let me know if the changes looks good.
>
> This is starting to be a problem.  I'm still sitting on the old version
> of this patchset and it will start to get in the way of other work.
>
> We now have this new version of the patchset which implements a
> separate controller but it is unclear to me which way we want to go.
>
> Can the memcg developers please drop everything else and make a
> decision here?


David Rientjes didn't like HugetTLB limit to be a memcg extension and
wanted this to be a separate controller. I posted a v7 version that did
HugeTLB limit as a separate controller and used page cgroup to track
HugeTLB cgroup. Kamezawa Hiroyuki didn't like the usage of page_cgroup
in HugeTLB controller( http://mid.gmane.org/4FCD648E.90709@jp.fujitsu.com )

I ended up doing a v8 that used page[2].lru.next for storing hugetlb
controller.

http://mid.gmane.org/1339232401-14392-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com

I guess that should address all the concerns.

-aneesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-09 14:16           ` Aneesh Kumar K.V
@ 2012-06-10  1:55             ` David Rientjes
  2012-06-10 15:04               ` Aneesh Kumar K.V
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2012-06-10  1:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, hannes, linux-kernel, cgroups, Michal Hocko, Ying Han

On Sat, 9 Jun 2012, Aneesh Kumar K.V wrote:

> David Rientjes didn't like HugetTLB limit to be a memcg extension and
> wanted this to be a separate controller. I posted a v7 version that did
> HugeTLB limit as a separate controller and used page cgroup to track
> HugeTLB cgroup. Kamezawa Hiroyuki didn't like the usage of page_cgroup
> in HugeTLB controller( http://mid.gmane.org/4FCD648E.90709@jp.fujitsu.com )
> 

Yes, and thank you very much for working on v8 to remove the dependency on 
page_cgroup and to seperate this out.  I think it will benefit users who 
don't want to enable all of memcg but still want to account and restrict 
hugetlb page usage, and I think the code seperation is much cleaner 
internally.

I'll review that patchset and suggest that the old hugetlb extension in 
-mm be dropped in the interim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-10  1:55             ` David Rientjes
@ 2012-06-10 15:04               ` Aneesh Kumar K.V
  0 siblings, 0 replies; 44+ messages in thread
From: Aneesh Kumar K.V @ 2012-06-10 15:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, mgorman, KAMEZAWA Hiroyuki, dhillf,
	aarcange, hannes, linux-kernel, cgroups, Michal Hocko, Ying Han

On Sat, Jun 09, 2012 at 06:55:30PM -0700, David Rientjes wrote:
> On Sat, 9 Jun 2012, Aneesh Kumar K.V wrote:
> 
> > David Rientjes didn't like HugetTLB limit to be a memcg extension and
> > wanted this to be a separate controller. I posted a v7 version that did
> > HugeTLB limit as a separate controller and used page cgroup to track
> > HugeTLB cgroup. Kamezawa Hiroyuki didn't like the usage of page_cgroup
> > in HugeTLB controller( http://mid.gmane.org/4FCD648E.90709@jp.fujitsu.com )
> > 
> 
> Yes, and thank you very much for working on v8 to remove the dependency on 
> page_cgroup and to seperate this out.  I think it will benefit users who 
> don't want to enable all of memcg but still want to account and restrict 
> hugetlb page usage, and I think the code seperation is much cleaner 
> internally.
> 

I have V9 ready to post. Only change I have against v8 is to fix the compund_order
comparison and folding the charge/uncharge patches with its users. I will wait for
your review feedback before posting V9 so that I can address the review comments
in V9. Once we get V9 out we can get the series added to -mm ?

> I'll review that patchset and suggest that the old hugetlb extension in 
> -mm be dropped in the interim.
> 

I also agree with dropping the old hugetlb extension patchset in -mm.

-aneesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-08 23:06         ` Andrew Morton
  2012-06-09 14:16           ` Aneesh Kumar K.V
@ 2012-06-11  3:55           ` Kamezawa Hiroyuki
  2012-06-11  9:23             ` David Rientjes
  2012-06-11  9:32           ` Michal Hocko
  2 siblings, 1 reply; 44+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-11  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aneesh Kumar K.V, David Rientjes, linux-mm, mgorman, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups, Ying Han

(2012/06/09 8:06), Andrew Morton wrote:
> On Wed, 30 May 2012 20:13:31 +0530
> "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>  wrote:
>
>>>>
>>>>   - code: seperating hugetlb bits out from memcg bits to avoid growing
>>>>     mm/memcontrol.c beyond its current 5650 lines, and
>>>>
>>>
>>> I can definitely look at spliting mm/memcontrol.c
>>>
>>>
>>>>   - performance: not incurring any overhead of enabling memcg for per-
>>>>     page tracking that is unnecessary if users only want to limit hugetlb
>>>>     pages.
>>>>
>>
>> Since Andrew didn't sent the patchset to Linus because of this
>> discussion, I looked at reworking the patchset as a seperate
>> controller. The patchset I sent here
>>
>> http://thread.gmane.org/gmane.linux.kernel.mm/79230
>>
>> have seen minimal testing. I also folded the fixup patches
>> Andrew had in -mm to original patchset.
>>
>> Let me know if the changes looks good.
>
> This is starting to be a problem.  I'm still sitting on the old version
> of this patchset and it will start to get in the way of other work.
>
> We now have this new version of the patchset which implements a
> separate controller but it is unclear to me which way we want to go.
>
> Can the memcg developers please drop everything else and make a
> decision here?

Following is a summary in my point of view.
I think there are several topics.

  - overheads.
   (A) IMHO, runtime overhead will be negligible because...
      - if hugetlb is used, anonymous memory accouning doesn't add much overheads
        because they're not used.
      - when it comes to file-cache accounting, I/O dominates performance rather
        than memcg..
      - but you may see some overheads with 100+ cpu system...I'm not sure.

   (B) memory space overhead will not be negligible.
      - now, memcg uses 16bytes per page....4GB/1TB.
        This may be an obvious overhead to the system if working set size are
        quite big and the apps want to use huge size memory.

   (C) what hugetlbfs is.
    - hugetlb is statically allocated. So, they're not usual memory.
      Then, hugetlb cgroup is better.

    - IMHO, hugetlb is memory. And I thought memory.limit_in_bytes should
      take it into account....

   (D) code duplication
    - memory cgroup and hugetlb cgroup will have similar hooks,codes,UIs.
    - we need some #ifdef if we have consolidated memory/hugetlb cgroup.

   (E) user experience
    - with independent hugetlb cgroup, users can disable memory cgroup.
    - with consolidated memcg+hugetlb cgroup, we'll be able to limit
      usual page + hugetlb usage by a limit.


Now, I think...

   1. I need to agree that overhead is _not_ negligible.

   2. THP should be the way rather than hugetlb for my main target platform.
      (shmem/tmpfs should support THP. we need study.)
      user-experience should be fixed by THP+tmpfs+memcg.

   3. It seems Aneesh decided to have independent hugetlb cgroup.

So, now, I admit to have independent hugetlb cgroup.
Other opinions ?

Thanks,
-Kame













^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-11  3:55           ` Kamezawa Hiroyuki
@ 2012-06-11  9:23             ` David Rientjes
  2012-06-15 22:31               ` Aditya Kali
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2012-06-11  9:23 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Aneesh Kumar K.V, linux-mm, mgorman, dhillf,
	aarcange, mhocko, hannes, linux-kernel, cgroups, Ying Han

On Mon, 11 Jun 2012, Kamezawa Hiroyuki wrote:

> Now, I think...
> 
>   1. I need to agree that overhead is _not_ negligible.
> 
>   2. THP should be the way rather than hugetlb for my main target platform.
>      (shmem/tmpfs should support THP. we need study.)
>      user-experience should be fixed by THP+tmpfs+memcg.
> 
>   3. It seems Aneesh decided to have independent hugetlb cgroup.
> 
> So, now, I admit to have independent hugetlb cgroup.
> Other opinions ?
> 

I suggested the seperate controller in the review of the patchset so I 
obviously agree with your conclusion.  I don't think we should account for 
hugetlb pages in memory.usage_in_bytes and enforce memory.limit_in_bytes 
since 512 4K pages is not the same as 1 2M page which may be a sacred 
resource if fragmentation is high.

Many thanks to Aneesh for continuing to update the patchset and working 
toward a resolution on this, I love the direction its taking.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-08 23:06         ` Andrew Morton
  2012-06-09 14:16           ` Aneesh Kumar K.V
  2012-06-11  3:55           ` Kamezawa Hiroyuki
@ 2012-06-11  9:32           ` Michal Hocko
  2 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-11  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aneesh Kumar K.V, David Rientjes, linux-mm, mgorman,
	KAMEZAWA Hiroyuki, dhillf, aarcange, hannes, linux-kernel,
	cgroups, Ying Han

On Fri 08-06-12 16:06:12, Andrew Morton wrote:
> On Wed, 30 May 2012 20:13:31 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > >> 
> > >>  - code: seperating hugetlb bits out from memcg bits to avoid growing 
> > >>    mm/memcontrol.c beyond its current 5650 lines, and
> > >> 
> > >
> > > I can definitely look at spliting mm/memcontrol.c 
> > >
> > >
> > >>  - performance: not incurring any overhead of enabling memcg for per-
> > >>    page tracking that is unnecessary if users only want to limit hugetlb 
> > >>    pages.
> > >> 
> > 
> > Since Andrew didn't sent the patchset to Linus because of this
> > discussion, I looked at reworking the patchset as a seperate
> > controller. The patchset I sent here
> > 
> > http://thread.gmane.org/gmane.linux.kernel.mm/79230
> > 
> > have seen minimal testing. I also folded the fixup patches
> > Andrew had in -mm to original patchset.
> > 
> > Let me know if the changes looks good.
> 
> This is starting to be a problem.  I'm still sitting on the old version
> of this patchset and it will start to get in the way of other work.
> 
> We now have this new version of the patchset which implements a
> separate controller but it is unclear to me which way we want to go.
 
I guess you are talking about v7 which is mem_cgroup based. This one has
some drawbacks (e.g. the most user visible one is that if one wants to
disable memory overhead from memcg he has to disable hugetlb controller
as well).
v8 took a different approach ((ab)use lru.next on the 3rd page to store
the group pointer) which looks as a reasonable compromise.

> Can the memcg developers please drop everything else and make a
> decision here?

I think that v8 (+fixups) is the way to go.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-11  9:23             ` David Rientjes
@ 2012-06-15 22:31               ` Aditya Kali
  2012-06-16 20:26                 ` David Rientjes
  0 siblings, 1 reply; 44+ messages in thread
From: Aditya Kali @ 2012-06-15 22:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Kamezawa Hiroyuki, Andrew Morton, Aneesh Kumar K.V, linux-mm,
	mgorman, dhillf, aarcange, mhocko, hannes, linux-kernel, cgroups,
	Ying Han

On Mon, Jun 11, 2012 at 2:23 AM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 11 Jun 2012, Kamezawa Hiroyuki wrote:
>
>> Now, I think...
>>
>>   1. I need to agree that overhead is _not_ negligible.
>>
>>   2. THP should be the way rather than hugetlb for my main target platform.
>>      (shmem/tmpfs should support THP. we need study.)
>>      user-experience should be fixed by THP+tmpfs+memcg.
>>
>>   3. It seems Aneesh decided to have independent hugetlb cgroup.
>>
>> So, now, I admit to have independent hugetlb cgroup.
>> Other opinions ?
>>
>
> I suggested the seperate controller in the review of the patchset so I
> obviously agree with your conclusion.  I don't think we should account for
> hugetlb pages in memory.usage_in_bytes and enforce memory.limit_in_bytes
> since 512 4K pages is not the same as 1 2M page which may be a sacred
> resource if fragmentation is high.
>
Based on the usecase at Google, I see a definite value in including
hugepage usage in memory.usage_in_bytes as well and having a single
limit for memory usage for the job. Our jobs wants to specify only one
(total) memory limit (including slab usage, and other kernel memory
usage, hugepages, etc.).

The hugepage/smallpage requirements of the job vary during its
lifetime. Having two different limits means less flexibility for jobs
as they now have to specify their limit as (max_hugepage,
max_smallpage) instead of max(hugepage + smallpage). Two limits
complicates the API for the users and requires them to over-specify
the resources.

> Many thanks to Aneesh for continuing to update the patchset and working
> toward a resolution on this, I love the direction its taking.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Thanks,
-- 
Aditya

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -V6 07/14] memcg: Add HugeTLB extension
  2012-06-15 22:31               ` Aditya Kali
@ 2012-06-16 20:26                 ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2012-06-16 20:26 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Kamezawa Hiroyuki, Andrew Morton, Aneesh Kumar K.V, linux-mm,
	mgorman, dhillf, aarcange, mhocko, hannes, linux-kernel, cgroups,
	Ying Han

On Fri, 15 Jun 2012, Aditya Kali wrote:

> Based on the usecase at Google, I see a definite value in including
> hugepage usage in memory.usage_in_bytes as well and having a single
> limit for memory usage for the job. Our jobs wants to specify only one
> (total) memory limit (including slab usage, and other kernel memory
> usage, hugepages, etc.).
> 
> The hugepage/smallpage requirements of the job vary during its
> lifetime. Having two different limits means less flexibility for jobs
> as they now have to specify their limit as (max_hugepage,
> max_smallpage) instead of max(hugepage + smallpage). Two limits
> complicates the API for the users and requires them to over-specify
> the resources.
> 

If a large number of hugepages, for example, are allocated on the command 
line because there's a lower success rate of dynamic allocation due to 
fragmentation, with your suggestion it would no longer allow the admin to 
restrict the use of those hugepages to only a particular set of tasks.  
Consider especially 1GB hugepagez on x86, your suggestion would treat a 
single 1GB hugepage which cannot be freed after boot exactly the same as 
using 1GB of memory which is obviously not the desired behavior of any 
hugetlb controller.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2012-06-16 20:26 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-16 10:44 [PATCH -V6 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
2012-05-24 21:11   ` David Rientjes
2012-04-16 10:44 ` [PATCH -V6 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
2012-05-24 21:17   ` David Rientjes
2012-04-16 10:44 ` [PATCH -V6 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
2012-05-24 21:22   ` David Rientjes
2012-05-27 20:07     ` Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
2012-04-23 23:44   ` Andrew Morton
2012-04-16 10:44 ` [PATCH -V6 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
2012-05-24 21:35   ` David Rientjes
2012-05-27 20:13     ` Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
2012-05-02  0:20   ` Paul Gortmaker
2012-05-03  4:37     ` Aneesh Kumar K.V
2012-05-24 21:52   ` David Rientjes
2012-05-24 22:57     ` Andrew Morton
2012-05-24 23:20       ` David Rientjes
2012-05-27 20:28     ` Aneesh Kumar K.V
2012-05-30 14:43       ` Aneesh Kumar K.V
2012-06-08 23:06         ` Andrew Morton
2012-06-09 14:16           ` Aneesh Kumar K.V
2012-06-10  1:55             ` David Rientjes
2012-06-10 15:04               ` Aneesh Kumar K.V
2012-06-11  3:55           ` Kamezawa Hiroyuki
2012-06-11  9:23             ` David Rientjes
2012-06-15 22:31               ` Aditya Kali
2012-06-16 20:26                 ` David Rientjes
2012-06-11  9:32           ` Michal Hocko
2012-04-16 10:44 ` [PATCH -V6 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
2012-04-16 23:13   ` Andrew Morton
2012-04-18  6:15     ` [PATCH] memcg: Use scnprintf instead of sprintf Aneesh Kumar K.V
2012-04-18 22:36       ` Andrew Morton
2012-04-19  8:26         ` Andreas Schwab
2012-04-18  6:16     ` [PATCH -V6 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
2012-04-23 22:45   ` Andrew Morton
2012-04-16 10:44 ` [PATCH -V6 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
2012-04-16 10:44 ` [PATCH -V6 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).