[PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation
@ 2012-04-06 18:50 Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
                   ` (13 more replies)
  0 siblings, 14 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups

Hi,

This patchset implements a memory controller extension to control
HugeTLB allocations. The extension allows to limit the HugeTLB
usage per control group and enforces the controller limit during
page fault. Since HugeTLB doesn't support page reclaim, enforcing
the limit at page fault time implies that, the application will get
SIGBUS signal if it tries to access HugeTLB pages beyond its limit.
This requires the application to know beforehand how much HugeTLB
pages it would require for its use.

The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met,
job schedulers like (SLURM) will schedule the job. We need to make sure
that the jobs won't consume more resources than requested. If they do
we should either error out or kill the application.

Patches are on top of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.5

Changes from V4:
 * Add support for charge/uncharge during page migration
 * Drop the usage of page->lru in unmap_hugepage_range.

Changes from v3:
 * Address review feedback.
 * Fix a bug in cgroup removal related parent charging with use_hierarchy set

Changes from V2:
* Changed the implementation to limit the HugeTLB usage during page
  fault time. This simplifies the extension and keep it closer to
  memcg design. This also allows to support cgroup removal with less
  complexity. Only caveat is the application should ensure its HugeTLB
  usage doesn't cross the cgroup limit.

Changes from V1:
* Changed the implementation as a memcg extension. We still use
  the same logic to track the cgroup and range.

Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes

-aneesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -V5 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We will be using this from other subsystems like memcg
in later patches.

Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..766eb90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-static int max_hstate;
+static int hugetlb_max_hstate;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
@@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
 
 #define for_each_hstate(h) \
-	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
+	for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
 		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
 		return;
 	}
-	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
-	h = &hstates[max_hstate++];
+	h = &hstates[hugetlb_max_hstate++];
 	h->order = order;
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
@@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
 	static unsigned long *last_mhp;
 
 	/*
-	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
 	 * so this hugepages= parameter goes to the "default hstate".
 	 */
-	if (!max_hstate)
+	if (!hugetlb_max_hstate)
 		mhp = &default_hstate_max_huge_pages;
 	else
 		mhp = &parsed_hstate->max_huge_pages;
@@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
 	 * But we need to allocate >= MAX_ORDER hstates here early to still
 	 * use the bootmem allocator.
 	 */
-	if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+	if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
 		hugetlb_hstate_alloc_pages(parsed_hstate);
 
 	last_mhp = mhp;
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Using VM_FAULT_* codes with ERR_PTR will require us to make sure
VM_FAULT_* values will not exceed MAX_ERRNO value.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 766eb90..5a472a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
-		return ERR_PTR(-VM_FAULT_OOM);
+		return ERR_PTR(-ENOMEM);
 	if (chg)
 		if (hugepage_subpool_get_pages(spool, chg))
-			return ERR_PTR(-VM_FAULT_SIGBUS);
+			return ERR_PTR(-ENOSPC);
 
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugepage_subpool_put_pages(spool, chg);
-			return ERR_PTR(-VM_FAULT_SIGBUS);
+			return ERR_PTR(-ENOSPC);
 		}
 	}
 
@@ -2486,6 +2486,7 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
+		int err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2515,7 +2516,10 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(&mm->page_table_lock);
-		return -PTR_ERR(new_page);
+		if (err == -ENOMEM)
+			return VM_FAULT_OOM;
+		else
+			return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2633,7 +2637,11 @@ retry:
 			goto out;
 		page = alloc_huge_page(vma, address, 0);
 		if (IS_ERR(page)) {
-			ret = -PTR_ERR(page);
+			ret = PTR_ERR(page);
+			if (ret == -ENOMEM)
+				ret = VM_FAULT_OOM;
+			else
+				ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 03/14] hugetlbfs: Add an inline helper for finding hstate index
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Add an inline helper and use it in the code.

Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h |    6 ++++++
 mm/hugetlb.c            |   18 ++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 000837e..876457e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -294,6 +294,11 @@ static inline unsigned hstate_index_to_shift(unsigned index)
 	return hstates[index].order + PAGE_SHIFT;
 }
 
+static inline int hstate_index(struct hstate *h)
+{
+	return h - hstates;
+}
+
 #else
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -312,6 +317,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 	return 1;
 }
 #define hstate_index_to_shift(index) 0
+#define hstate_index(h) 0
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a472a5..d94c987 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1646,7 +1646,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct attribute_group *hstate_attr_group)
 {
 	int retval;
-	int hi = h - hstates;
+	int hi = hstate_index(h);
 
 	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
 	if (!hstate_kobjs[hi])
@@ -1741,11 +1741,13 @@ void hugetlb_unregister_node(struct node *node)
 	if (!nhs->hugepages_kobj)
 		return;		/* no hstate attributes */
 
-	for_each_hstate(h)
-		if (nhs->hstate_kobjs[h - hstates]) {
-			kobject_put(nhs->hstate_kobjs[h - hstates]);
-			nhs->hstate_kobjs[h - hstates] = NULL;
+	for_each_hstate(h) {
+		int idx = hstate_index(h);
+		if (nhs->hstate_kobjs[idx]) {
+			kobject_put(nhs->hstate_kobjs[idx]);
+			nhs->hstate_kobjs[idx] = NULL;
 		}
+	}
 
 	kobject_put(nhs->hugepages_kobj);
 	nhs->hugepages_kobj = NULL;
@@ -1848,7 +1850,7 @@ static void __exit hugetlb_exit(void)
 	hugetlb_unregister_all_nodes();
 
 	for_each_hstate(h) {
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
 
 	kobject_put(hugepages_kobj);
@@ -2678,7 +2680,7 @@ retry:
 		 */
 		if (unlikely(PageHWPoison(page))) {
 			ret = VM_FAULT_HWPOISON |
-			      VM_FAULT_SET_HINDEX(h - hstates);
+				VM_FAULT_SET_HINDEX(hstate_index(h));
 			goto backout_unlocked;
 		}
 	}
@@ -2751,7 +2753,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
-			       VM_FAULT_SET_HINDEX(h - hstates);
+				VM_FAULT_SET_HINDEX(hstate_index(h));
 	}
 
 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  5:36   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Use mmu_gather instead of temporary linked list for accumulating pages when
we unmap a hugepage range. This also allows us to get rid of i_mmap_mutex
unmap_hugepage_range in the following patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/inode.c    |    4 ++--
 include/linux/hugetlb.h |    6 ++---
 mm/hugetlb.c            |   59 ++++++++++++++++++++++++++++-------------------
 mm/memory.c             |    7 ++++--
 4 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ea25174..92f75aa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 		else
 			v_offset = 0;
 
-		__unmap_hugepage_range(vma,
-				vma->vm_start + v_offset, vma->vm_end, NULL);
+		unmap_hugepage_range(vma, vma->vm_start + v_offset,
+				     vma->vm_end, NULL);
 	}
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 876457e..46c6cbd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -40,9 +40,9 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			struct page **, struct vm_area_struct **,
 			unsigned long *, int *, int, unsigned int flags);
 void unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
+			  unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *,
+			    unsigned long, unsigned long, struct page *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d94c987..a3ac624 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,8 +24,9 @@
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
-#include <linux/io.h>
+#include <asm/tlb.h>
 
+#include <linux/io.h>
 #include <linux/hugetlb.h>
 #include <linux/node.h>
 #include "internal.h"
@@ -2300,30 +2301,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 		return 0;
 }
 
-void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			    unsigned long end, struct page *ref_page)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			    unsigned long start, unsigned long end,
+			    struct page *ref_page)
 {
+	int force_flush = 0;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
 	struct page *page;
-	struct page *tmp;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 
-	/*
-	 * A page gathering list, protected by per file i_mmap_mutex. The
-	 * lock is used to avoid list corruption from multiple unmapping
-	 * of the same page since we are using page->lru.
-	 */
-	LIST_HEAD(page_list);
-
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, start, end);
+again:
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2362,30 +2359,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 		}
 
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
+		tlb_remove_tlb_entry(tlb, ptep, address);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
-		list_add(&page->lru, &page_list);
 
+		page_remove_rmap(page);
+		force_flush = !__tlb_remove_page(tlb, page);
+		if (force_flush)
+			break;
 		/* Bail out after unmapping reference page if supplied */
 		if (ref_page)
 			break;
 	}
-	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
-	mmu_notifier_invalidate_range_end(mm, start, end);
-	list_for_each_entry_safe(page, tmp, &page_list, lru) {
-		page_remove_rmap(page);
-		list_del(&page->lru);
-		put_page(page);
+	/*
+	 * mmu_gather ran out of room to batch pages, we break out of
+	 * the PTE lock to avoid doing the potential expensive TLB invalidate
+	 * and page-free while holding it.
+	 */
+	if (force_flush) {
+		force_flush = 0;
+		tlb_flush_mmu(tlb);
+		if (address < end && !ref_page)
+			goto again;
 	}
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_end_vma(tlb, vma);
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	__unmap_hugepage_range(vma, start, end, ref_page);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	struct mm_struct *mm;
+	struct mmu_gather tlb;
+
+	mm = vma->vm_mm;
+
+	tlb_gather_mmu(&tlb, mm, 0);
+	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2430,9 +2442,8 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * from the time of fork. This would look like data corruption
 		 */
 		if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
-			__unmap_hugepage_range(iter_vma,
-				address, address + huge_page_size(h),
-				page);
+			unmap_hugepage_range(iter_vma, address,
+					     address + huge_page_size(h), page);
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
 
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..4b11961 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * Since no pte has actually been setup, it is
 			 * safe to do nothing in this case.
 			 */
-			if (vma->vm_file)
-				unmap_hugepage_range(vma, start, end, NULL);
+			if (vma->vm_file) {
+				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+				__unmap_hugepage_range(tlb, vma, start, end, NULL);
+				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
 	}
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

i_mmap_mutex lock was added in unmap_single_vma by 502717f4e112b18d9c37753a32f675bec9f2838b
But we don't use page->lru in unmap_hugepage_range any more. Also the lock was
taken higher up in the stack in some code path. That would result in deadlock.

unmap_mapping_range (i_mmap_mutex)
 -> unmap_mapping_range_tree
    -> unmap_mapping_range_vma
       -> zap_page_range_single
         -> unmap_single_vma
	      -> unmap_hugepage_range (i_mmap_mutex)

For shared pagetable support for huge pages, since pagetable pages are
ref counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
( 39dde65c9940c97fcd178a3d2b1c57ed8b7b68aa )

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/memory.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4b11961..d642b3e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * Since no pte has actually been setup, it is
 			 * safe to do nothing in this case.
 			 */
-			if (vma->vm_file) {
-				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			if (vma->vm_file)
 				__unmap_hugepage_range(tlb, vma, start, end, NULL);
-				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
 	}
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  5:47   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Since we migrate only one hugepage don't use linked list for passing
the page around. Directly pass page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/migrate.h |    4 +--
 mm/memory-failure.c     |   13 ++--------
 mm/migrate.c            |   65 +++++++++++++++--------------------------------
 3 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 855c337..ce7e667 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
-extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
 
@@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
-static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97cc273..1f092db 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
-	LIST_HEAD(pagelist);
 
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
@@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	}
 
 	/* Keep page count to indicate a given hugepage is isolated. */
-
-	list_add(&hpage->lru, &pagelist);
-	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
-				true);
+	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
+	put_page(page);
 	if (ret) {
-		struct page *page1, *page2;
-		list_for_each_entry_safe(page1, page2, &pagelist, lru)
-			put_page(page1);
-
 		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
 			pfn, ret, page->flags);
-		if (ret > 0)
-			ret = -EIO;
 		return ret;
 	}
 done:
diff --git a/mm/migrate.c b/mm/migrate.c
index 51c08a0..d7eb82d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -929,15 +929,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	unlock_page(hpage);
-
 out:
-	if (rc != -EAGAIN) {
-		list_del(&hpage->lru);
-		put_page(hpage);
-	}
-
 	put_page(new_hpage);
-
 	if (result) {
 		if (rc)
 			*result = rc;
@@ -1013,48 +1006,32 @@ out:
 	return nr_failed + retry;
 }
 
-int migrate_huge_pages(struct list_head *from,
-		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
+		      unsigned long private, bool offlining,
+		      enum migrate_mode mode)
 {
-	int retry = 1;
-	int nr_failed = 0;
-	int pass = 0;
-	struct page *page;
-	struct page *page2;
-	int rc;
-
-	for (pass = 0; pass < 10 && retry; pass++) {
-		retry = 0;
-
-		list_for_each_entry_safe(page, page2, from, lru) {
+	int pass, rc;
+
+	for (pass = 0; pass < 10; pass++) {
+		rc = unmap_and_move_huge_page(get_new_page,
+					      private, hpage, pass > 2, offlining,
+					      mode);
+		switch (rc) {
+		case -ENOMEM:
+			goto out;
+		case -EAGAIN:
+			/* try again */
 			cond_resched();
-
-			rc = unmap_and_move_huge_page(get_new_page,
-					private, page, pass > 2, offlining,
-					mode);
-
-			switch(rc) {
-			case -ENOMEM:
-				goto out;
-			case -EAGAIN:
-				retry++;
-				break;
-			case 0:
-				break;
-			default:
-				/* Permanent failure */
-				nr_failed++;
-				break;
-			}
+			break;
+		case 0:
+			goto out;
+		default:
+			rc = -EIO;
+			goto out;
 		}
 	}
-	rc = 0;
 out:
-	if (rc)
-		return rc;
-
-	return nr_failed + retry;
+	return rc;
 }
 
 #ifdef CONFIG_NUMA
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 07/14] memcg: Add HugeTLB extension
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  6:04   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This patch implements a memcg extension that allows us to control HugeTLB
allocations via memory controller. The extension allows to limit the
HugeTLB usage per control group and enforces the controller limit during
page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
at page fault time implies that, the application will get SIGBUS signal if it
tries to access HugeTLB pages beyond its limit. This requires the application
to know beforehand how much HugeTLB pages it would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    1 +
 include/linux/memcontrol.h |   42 ++++++++++++++
 init/Kconfig               |    8 +++
 mm/hugetlb.c               |    2 +-
 mm/memcontrol.c            |  132 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 46c6cbd..995c238 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -226,6 +226,7 @@ struct hstate *size_to_hstate(unsigned long size);
 #define HUGE_MAX_HSTATE 1
 #endif
 
+extern int hugetlb_max_hstate;
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 extern unsigned int default_hstate_idx;
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f94efd2..1d07e14 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -448,5 +448,47 @@ static inline void sock_release_memcg(struct sock *sk)
 {
 }
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+extern int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+					  struct mem_cgroup **ptr);
+extern void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+					     struct mem_cgroup *memcg,
+					     struct page *page);
+extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+					     struct page *page);
+extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+					      struct mem_cgroup *memcg);
+
+#else
+static inline int
+mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+						 struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline void
+mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+				 struct mem_cgroup *memcg,
+				 struct page *page)
+{
+	return;
+}
+
+static inline void
+mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+				 struct page *page)
+{
+	return;
+}
+
+static inline void
+mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+				  struct mem_cgroup *memcg)
+{
+	return;
+}
+#endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/init/Kconfig b/init/Kconfig
index 72f33fa..a3b5665 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -716,6 +716,14 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config MEM_RES_CTLR_HUGETLB
+	bool "Memory Resource Controller HugeTLB Extension (EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && HUGETLB_PAGE && EXPERIMENTAL
+	default n
+	help
+	  Add HugeTLB management to memory resource controller. When you
+	  enable this, you can put a per cgroup limit on HugeTLB usage.
+
 menuconfig CGROUP_SCHED
 	bool "Group CPU scheduler"
 	default n
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a3ac624..8cd89b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-static int hugetlb_max_hstate;
+int hugetlb_max_hstate;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d28359c..1a2e041 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -252,6 +252,10 @@ struct mem_cgroup {
 	};
 
 	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct res_counter hugepage[HUGE_MAX_HSTATE];
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -3213,6 +3217,114 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+{
+	int idx;
+	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
+		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
+			return 1;
+	}
+	return 0;
+}
+
+int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
+				   struct mem_cgroup **ptr)
+{
+	int ret = 0;
+	struct mem_cgroup *memcg = NULL;
+	struct res_counter *fail_res;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		goto done;
+again:
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
+	if (!css_tryget(&memcg->css)) {
+		rcu_read_unlock();
+		goto again;
+	}
+	rcu_read_unlock();
+
+	ret = res_counter_charge(&memcg->hugepage[idx], csize, &fail_res);
+	css_put(&memcg->css);
+done:
+	*ptr = memcg;
+	return ret;
+}
+
+void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
+				      struct mem_cgroup *memcg,
+				      struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		mem_cgroup_hugetlb_uncharge_memcg(idx, nr_pages, memcg);
+		return;
+	}
+	pc->mem_cgroup = memcg;
+	SetPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+	return;
+}
+
+void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
+				      struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!PageCgroupUsed(pc)))
+		return;
+
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc)) {
+		unlock_page_cgroup(pc);
+		return;
+	}
+	memcg = pc->mem_cgroup;
+	pc->mem_cgroup = root_mem_cgroup;
+	ClearPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+
+	res_counter_uncharge(&memcg->hugepage[idx], csize);
+	return;
+}
+
+void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
+				       struct mem_cgroup *memcg)
+{
+	unsigned long csize = nr_pages * PAGE_SIZE;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	res_counter_uncharge(&memcg->hugepage[idx], csize);
+	return;
+}
+#else
+static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
@@ -4962,6 +5074,7 @@ err_cleanup:
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup *cont)
 {
+	int idx;
 	struct mem_cgroup *memcg, *parent;
 	long error = -ENOMEM;
 	int node;
@@ -5004,9 +5117,22 @@ mem_cgroup_create(struct cgroup *cont)
 		 * mem_cgroup(see mem_cgroup_put).
 		 */
 		mem_cgroup_get(parent);
+		/*
+		 * We could get called before hugetlb init is called.
+		 * Use HUGE_MAX_HSTATE as the max index.
+		 */
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&memcg->hugepage[idx],
+					 &parent->hugepage[idx]);
 	} else {
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
+		/*
+		 * We could get called before hugetlb init is called.
+		 * Use HUGE_MAX_HSTATE as the max index.
+		 */
+		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+			res_counter_init(&memcg->hugepage[idx], NULL);
 	}
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
@@ -5026,6 +5152,12 @@ free_out:
 static int mem_cgroup_pre_destroy(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	/*
+	 * Don't allow memcg removal if we have HugeTLB resource
+	 * usage.
+	 */
+	if (mem_cgroup_have_hugetlb_usage(memcg))
+		return -EBUSY;
 
 	return mem_cgroup_force_empty(memcg, false);
 }
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This adds necessary charge/uncharge calls in the HugeTLB code. We do
memcg charge in page alloc and uncharge in compound page destructor.
We also need to ignore HugeTLB pages in __mem_cgroup_uncharge_common
because that get called from delete_from_page_cache

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/hugetlb.c    |   20 +++++++++++++++++++-
 mm/memcontrol.c |    5 +++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8cd89b4..dd00087 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,8 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -628,6 +630,8 @@ static void free_huge_page(struct page *page)
 	BUG_ON(page_mapcount(page));
 	INIT_LIST_HEAD(&page->lru);
 
+	mem_cgroup_hugetlb_uncharge_page(hstate_index(h),
+					 pages_per_huge_page(h), page);
 	spin_lock(&hugetlb_lock);
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		update_and_free_page(h, page);
@@ -1113,7 +1117,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	int ret, idx;
+	struct mem_cgroup *memcg;
 
+	idx = hstate_index(h);
 	/*
 	 * Processes that did not create the mapping will have no
 	 * reserves and will not have accounted against subpool
@@ -1129,6 +1136,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		if (hugepage_subpool_get_pages(spool, chg))
 			return ERR_PTR(-ENOSPC);
 
+	ret = mem_cgroup_hugetlb_charge_page(idx, pages_per_huge_page(h),
+					     &memcg);
+	if (ret) {
+		hugepage_subpool_put_pages(spool, chg);
+		return ERR_PTR(-ENOSPC);
+	}
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
 	spin_unlock(&hugetlb_lock);
@@ -1136,6 +1149,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
+			mem_cgroup_hugetlb_uncharge_memcg(idx,
+							  pages_per_huge_page(h),
+							  memcg);
 			hugepage_subpool_put_pages(spool, chg);
 			return ERR_PTR(-ENOSPC);
 		}
@@ -1144,7 +1160,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	set_page_private(page, (unsigned long)spool);
 
 	vma_commit_reservation(h, vma, addr);
-
+	/* update page cgroup details */
+	mem_cgroup_hugetlb_commit_charge(idx, pages_per_huge_page(h),
+					 memcg, page);
 	return page;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1a2e041..0a1f776 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2966,6 +2966,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	if (PageSwapCache(page))
 		return NULL;
+	/*
+	 * HugeTLB page uncharge happen in the HugeTLB compound page destructor
+	 */
+	if (PageHuge(page))
+		return NULL;
 
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 09/14] memcg: track resource index in cftype private
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  5:56   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This helps in using same memcg callbacks for non reclaim resource
control files.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/memcontrol.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a1f776..3bb3b42 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -381,9 +381,14 @@ enum charge_type {
 #define _MEM			(0)
 #define _MEMSWAP		(1)
 #define _OOM_TYPE		(2)
-#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
-#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
-#define MEMFILE_ATTR(val)	((val) & 0xffff)
+#define _MEMHUGETLB		(3)
+
+/*  0 ... val ...16.... x...24...idx...32*/
+#define __MEMFILE_PRIVATE(idx, x, val)	(((idx) << 24) | ((x) << 16) | (val))
+#define MEMFILE_PRIVATE(x, val)		__MEMFILE_PRIVATE(0, x, val)
+#define MEMFILE_TYPE(val)		(((val) >> 16) & 0xff)
+#define MEMFILE_IDX(val)		(((val) >> 24) & 0xff)
+#define MEMFILE_ATTR(val)		((val) & 0xffff)
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
@@ -4003,7 +4008,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	char str[64];
 	u64 val;
-	int type, name, len;
+	int type, name, len, idx;
 
 	type = MEMFILE_TYPE(cft->private);
 	name = MEMFILE_ATTR(cft->private);
@@ -4024,6 +4029,10 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 		else
 			val = res_counter_read_u64(&memcg->memsw, name);
 		break;
+	case _MEMHUGETLB:
+		idx = MEMFILE_IDX(cft->private);
+		val = res_counter_read_u64(&memcg->hugepage[idx], name);
+		break;
 	default:
 		BUG();
 	}
@@ -4061,7 +4070,10 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			break;
 		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
-		else
+		else if (type == _MEMHUGETLB) {
+			int idx = MEMFILE_IDX(cft->private);
+			ret = res_counter_set_limit(&memcg->hugepage[idx], val);
+		} else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	case RES_SOFT_LIMIT:
@@ -4127,7 +4139,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	case RES_MAX_USAGE:
 		if (type == _MEM)
 			res_counter_reset_max(&memcg->res);
-		else
+		else if (type == _MEMHUGETLB) {
+			int idx = MEMFILE_IDX(event);
+			res_counter_reset_max(&memcg->hugepage[idx]);
+		} else
 			res_counter_reset_max(&memcg->memsw);
 		break;
 	case RES_FAILCNT:
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  6:00   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This add control files for hugetlbfs in memcg

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    5 +++++
 include/linux/memcontrol.h |    7 ++++++
 mm/hugetlb.c               |    2 +-
 mm/memcontrol.c            |   51 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 995c238..d008342 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
 #include <linux/mm_types.h>
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
 
 struct ctl_table;
 struct user_struct;
@@ -203,6 +204,10 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+	/* mem cgroup control files */
+	struct cftype mem_cgroup_files[4];
+#endif
 	char name[HSTATE_NAME_LEN];
 };
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1d07e14..4f17574 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -459,6 +459,7 @@ extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
 					     struct page *page);
 extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 					      struct mem_cgroup *memcg);
+extern int mem_cgroup_hugetlb_file_init(int idx) __init;
 
 #else
 static inline int
@@ -489,6 +490,12 @@ mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 {
 	return;
 }
+
+static inline int mem_cgroup_hugetlb_file_init(int idx)
+{
+	return 0;
+}
+
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd00087..340e575 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1931,7 +1931,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-
+	mem_cgroup_hugetlb_file_init(hugetlb_max_hstate - 1);
 	parsed_hstate = h;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3bb3b42..7d3330e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5191,6 +5191,57 @@ static void mem_cgroup_destroy(struct cgroup *cont)
 	mem_cgroup_put(memcg);
 }
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+static char *mem_fmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%luGB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%luMB", n >> 20);
+	else
+		sprintf(buf, "%luKB", n >> 10);
+	return buf;
+}
+
+int __init mem_cgroup_hugetlb_file_init(int idx)
+{
+	char buf[32];
+	struct cftype *cft;
+	struct hstate *h = &hstates[idx];
+
+	/* format the size */
+	mem_fmt(buf, huge_page_size(h));
+
+	/* Add the limit file */
+	cft = &h->mem_cgroup_files[0];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.limit_in_bytes", buf);
+	cft->private = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_LIMIT);
+	cft->read = mem_cgroup_read;
+	cft->write_string = mem_cgroup_write;
+
+	/* Add the usage file */
+	cft = &h->mem_cgroup_files[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.usage_in_bytes", buf);
+	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_USAGE);
+	cft->read = mem_cgroup_read;
+
+	/* Add the MAX usage file */
+	cft = &h->mem_cgroup_files[2];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.max_usage_in_bytes", buf);
+	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_MAX_USAGE);
+	cft->trigger  = mem_cgroup_reset;
+	cft->read = mem_cgroup_read;
+
+	/* NULL terminate the last cft */
+	cft = &h->mem_cgroup_files[3];
+	memset(cft, 0, sizeof(*cft));
+
+	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, h->mem_cgroup_files));
+
+	return 0;
+}
+#endif
+
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (9 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:50 ` [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support memcg removal.
On memcg removal we update the page's memory cgroup to point to
parent cgroup.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h |    1 +
 mm/hugetlb.c            |   12 +++++++-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d008342..6bf6afc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -200,6 +200,7 @@ struct hstate {
 	unsigned long resv_huge_pages;
 	unsigned long surplus_huge_pages;
 	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 340e575..8a520b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -512,7 +512,7 @@ void copy_huge_page(struct page *dst, struct page *src)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
 }
@@ -524,7 +524,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 	if (list_empty(&h->hugepage_freelists[nid]))
 		return NULL;
 	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
-	list_del(&page->lru);
+	list_move(&page->lru, &h->hugepage_activelist);
 	set_page_refcounted(page);
 	h->free_huge_pages--;
 	h->free_huge_pages_node[nid]--;
@@ -628,12 +628,13 @@ static void free_huge_page(struct page *page)
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	BUG_ON(page_mapcount(page));
-	INIT_LIST_HEAD(&page->lru);
 
 	mem_cgroup_hugetlb_uncharge_page(hstate_index(h),
 					 pages_per_huge_page(h), page);
 	spin_lock(&hugetlb_lock);
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+		/* remove the page from active list */
+		list_del(&page->lru);
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -646,6 +647,7 @@ static void free_huge_page(struct page *page)
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
+	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
 	h->nr_huge_pages++;
@@ -894,6 +896,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
+		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -998,7 +1001,6 @@ retry:
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
-		list_del(&page->lru);
 		/*
 		 * This page is now managed by the hugetlb allocator and has
 		 * no users -- drop the buddy allocator's reference.
@@ -1013,7 +1015,6 @@ free:
 	/* Free unnecessary surplus pages to the buddy allocator */
 	if (!list_empty(&surplus_list)) {
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-			list_del(&page->lru);
 			put_page(page);
 		}
 	}
@@ -1927,6 +1928,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->free_huge_pages = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	INIT_LIST_HEAD(&h->hugepage_activelist);
 	h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
 	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (10 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-09  6:16   ` KAMEZAWA Hiroyuki
  2012-04-06 18:50 ` [PATCH -V5 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
  2012-04-06 18:51 ` [PATCH -V5 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
  13 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This add support for memcg removal with HugeTLB resource usage.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h    |    8 ++++++
 include/linux/memcontrol.h |   14 +++++++++
 mm/hugetlb.c               |   43 ++++++++++++++++++++++++++++
 mm/memcontrol.c            |   68 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 123 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6bf6afc..bada0ac 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -327,4 +327,12 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 #define hstate_index(h) 0
 #endif
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
+#else
+static inline int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+	return 0;
+}
+#endif
 #endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4f17574..70317e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -460,6 +460,9 @@ extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
 extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 					      struct mem_cgroup *memcg);
 extern int mem_cgroup_hugetlb_file_init(int idx) __init;
+extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+					  struct page *page);
+extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);
 
 #else
 static inline int
@@ -496,6 +499,17 @@ static inline int mem_cgroup_hugetlb_file_init(int idx)
 	return 0;
 }
 
+static inline int
+mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+			       struct page *page)
+{
+	return 0;
+}
+
+static inline bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
+{
+	return 0;
+}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8a520b5..1d3c8ea9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1909,6 +1909,49 @@ static int __init hugetlb_init(void)
 }
 module_init(hugetlb_init);
 
+#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
+/*
+ * Force the memcg to empty the hugetlb resources by moving them to
+ * the parent cgroup. We can fail if the parent cgroup's limit prevented
+ * the charging. This should only happen if use_hierarchy is not set.
+ */
+int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+	struct hstate *h;
+	struct page *page;
+	int ret = 0, idx = 0;
+
+	do {
+		if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+			goto out;
+		/*
+		 * If the task doing the cgroup_rmdir got a signal
+		 * we don't really need to loop till the hugetlb resource
+		 * usage become zero.
+		 */
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+		for_each_hstate(h) {
+			spin_lock(&hugetlb_lock);
+			list_for_each_entry(page, &h->hugepage_activelist, lru) {
+				ret = mem_cgroup_move_hugetlb_parent(idx, cgroup, page);
+				if (ret) {
+					spin_unlock(&hugetlb_lock);
+					goto out;
+				}
+			}
+			spin_unlock(&hugetlb_lock);
+			idx++;
+		}
+		cond_resched();
+	} while (mem_cgroup_have_hugetlb_usage(cgroup));
+out:
+	return ret;
+}
+#endif
+
 /* Should be called on processing a hugepagesz=... option */
 void __init hugetlb_add_hstate(unsigned order)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7d3330e..7b6e79a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3228,9 +3228,11 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 #endif
 
 #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
 {
 	int idx;
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
 		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
 			return 1;
@@ -3328,10 +3330,57 @@ void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
 	res_counter_uncharge(&memcg->hugepage[idx], csize);
 	return;
 }
-#else
-static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
+
+int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
+				   struct page *page)
 {
-	return 0;
+	struct page_cgroup *pc;
+	int csize,  ret = 0;
+	struct res_counter *fail_res;
+	struct cgroup *pcgrp = cgroup->parent;
+	struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
+	struct mem_cgroup *memcg  = mem_cgroup_from_cont(cgroup);
+
+	if (!get_page_unless_zero(page))
+		goto out;
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc) || pc->mem_cgroup != memcg)
+		goto err_out;
+
+	csize = PAGE_SIZE << compound_order(page);
+	/*
+	 * uncharge from child and charge the parent. If we have
+	 * use_hierarchy set, we can never fail here. In-order to make
+	 * sure we don't get -ENOMEM on parent charge, we first uncharge
+	 * the child and then charge the parent.
+	 */
+	if (parent->use_hierarchy) {
+		res_counter_uncharge(&memcg->hugepage[idx], csize);
+		if (!mem_cgroup_is_root(parent))
+			ret = res_counter_charge(&parent->hugepage[idx],
+						 csize, &fail_res);
+	} else {
+		if (!mem_cgroup_is_root(parent)) {
+			ret = res_counter_charge(&parent->hugepage[idx],
+						 csize, &fail_res);
+			if (ret) {
+				ret = -EBUSY;
+				goto err_out;
+			}
+		}
+		res_counter_uncharge(&memcg->hugepage[idx], csize);
+	}
+	/*
+	 * caller should have done css_get
+	 */
+	pc->mem_cgroup = parent;
+err_out:
+	unlock_page_cgroup(pc);
+	put_page(page);
+out:
+	return ret;
 }
 #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
 
@@ -3852,6 +3901,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 	/* should free all ? */
 	if (free_all)
 		goto try_to_free;
+
+	/* move the hugetlb charges */
+	ret = hugetlb_force_memcg_empty(cgrp);
+	if (ret)
+		goto out;
 move_account:
 	do {
 		ret = -EBUSY;
@@ -5172,12 +5226,6 @@ free_out:
 static int mem_cgroup_pre_destroy(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
-	/*
-	 * Don't allow memcg removal if we have HugeTLB resource
-	 * usage.
-	 */
-	if (mem_cgroup_have_hugetlb_usage(memcg))
-		return -EBUSY;
 
 	return mem_cgroup_force_empty(memcg, false);
 }
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 13/14] hugetlb: migrate memcg info from oldpage to new page during migration
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (11 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
@ 2012-04-06 18:50 ` Aneesh Kumar K.V
  2012-04-06 18:51 ` [PATCH -V5 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:50 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

With HugeTLB pages, memcg is uncharged in compound page destructor.
Since we are holding a hugepage reference, we can be sure that old
page won't get uncharged till the last put_page(). On successful
migrate, we can move the memcg information to new page's page_cgroup
and mark the old page's page_cgroup unused.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/memcontrol.h |    8 ++++++++
 mm/memcontrol.c            |   28 ++++++++++++++++++++++++++++
 mm/migrate.c               |    4 ++++
 3 files changed, 40 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70317e5..6f2d392 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -464,6 +464,8 @@ extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 					  struct page *page);
 extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);
 
+extern void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+					struct page *newhpage);
 #else
 static inline int
 mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
@@ -510,6 +512,12 @@ static inline bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
 {
 	return 0;
 }
+
+static inline  void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+						struct page *newhpage)
+{
+	return;
+}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b6e79a..7b373a2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3382,6 +3382,34 @@ err_out:
 out:
 	return ret;
 }
+
+void  mem_cgroup_hugetlb_migrate(struct page *oldhpage, struct page *newhpage)
+{
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON(!PageHuge(oldhpage));
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldhpage);
+	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	pc->mem_cgroup = root_mem_cgroup;
+	ClearPageCgroupUsed(pc);
+	cgroup_exclude_rmdir(&memcg->css);
+	unlock_page_cgroup(pc);
+
+	/* move the mem_cg details to new cgroup */
+	pc = lookup_page_cgroup(newhpage);
+	lock_page_cgroup(pc);
+	pc->mem_cgroup = memcg;
+	SetPageCgroupUsed(pc);
+	unlock_page_cgroup(pc);
+	cgroup_release_and_wakeup_rmdir(&memcg->css);
+	return;
+}
 #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index d7eb82d..2b931e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -928,6 +928,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (anon_vma)
 		put_anon_vma(anon_vma);
+
+	if (!rc)
+		mem_cgroup_hugetlb_migrate(hpage, new_hpage);
+
 	unlock_page(hpage);
 out:
 	put_page(new_hpage);
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH -V5 14/14] memcg: Add memory controller documentation for hugetlb management
  2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
                   ` (12 preceding siblings ...)
  2012-04-06 18:50 ` [PATCH -V5 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
@ 2012-04-06 18:51 ` Aneesh Kumar K.V
  13 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-06 18:51 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
	akpm, hannes
  Cc: linux-kernel, cgroups, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 Documentation/cgroups/memory.txt |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4c95c00..d99c41b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -43,6 +43,7 @@ Features:
  - usage threshold notifier
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
+ - resource accounting for HugeTLB pages
 
  Kernel memory support is work in progress, and the current version provides
  basically functionality. (See Section 2.7)
@@ -75,6 +76,12 @@ Brief summary of control files.
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
  memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
 
+
+ memory.hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
+ memory.hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
+ memory.hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
+						  # see 5.7 for details
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -279,6 +286,15 @@ per cgroup, instead of globally.
 
 * tcp memory pressure: sockets memory pressure for the tcp protocol.
 
+2.8 HugeTLB extension
+
+This extension allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
 3. User Interface
 
 0. Configuration
@@ -287,6 +303,7 @@ a. Enable CONFIG_CGROUPS
 b. Enable CONFIG_RESOURCE_COUNTERS
 c. Enable CONFIG_CGROUP_MEM_RES_CTLR
 d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
+f. Enable CONFIG_MEM_RES_CTLR_HUGETLB (to use HugeTLB extension)
 
 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
 # mount -t tmpfs none /sys/fs/cgroup
@@ -510,6 +527,18 @@ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 
 And we have total = file + anon + unevictable.
 
+5.7 HugeTLB resource control files
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+ memory.hugetlb.16GB.limit_in_bytes
+ memory.hugetlb.16GB.max_usage_in_bytes
+ memory.hugetlb.16GB.usage_in_bytes
+ memory.hugetlb.16MB.limit_in_bytes
+ memory.hugetlb.16MB.max_usage_in_bytes
+ memory.hugetlb.16MB.usage_in_bytes
+
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.10.rc3.3.g19a6c


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages
  2012-04-06 18:50 ` [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
@ 2012-04-09  5:36   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  5:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Use mmu_gather instead of temporary linked list for accumulating pages when
> we unmap a hugepage range. This also allows us to get rid of i_mmap_mutex
> unmap_hugepage_range in the following patch.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


seems nice to me.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page
  2012-04-06 18:50 ` [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
@ 2012-04-09  5:47   ` KAMEZAWA Hiroyuki
  2012-04-09  8:36     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  5:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Since we migrate only one hugepage don't use linked list for passing
> the page around. Directly pass page that need to be migrated as argument.
> This also remove the usage page->lru in migrate path.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


seems good to me. I have one question below.


> ---
>  include/linux/migrate.h |    4 +--
>  mm/memory-failure.c     |   13 ++--------
>  mm/migrate.c            |   65 +++++++++++++++--------------------------------
>  3 files changed, 25 insertions(+), 57 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 855c337..ce7e667 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
>  extern int migrate_pages(struct list_head *l, new_page_t x,
>  			unsigned long private, bool offlining,
>  			enum migrate_mode mode);
> -extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> +extern int migrate_huge_page(struct page *, new_page_t x,
>  			unsigned long private, bool offlining,
>  			enum migrate_mode mode);
>  
> @@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, bool offlining,
>  		enum migrate_mode mode) { return -ENOSYS; }
> -static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> +static inline int migrate_huge_page(struct page *page, new_page_t x,
>  		unsigned long private, bool offlining,
>  		enum migrate_mode mode) { return -ENOSYS; }
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 97cc273..1f092db 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
> -	LIST_HEAD(pagelist);
>  
>  	ret = get_any_page(page, pfn, flags);
>  	if (ret < 0)
> @@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	}
>  
>  	/* Keep page count to indicate a given hugepage is isolated. */
> -
> -	list_add(&hpage->lru, &pagelist);
> -	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
> -				true);
> +	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
> +	put_page(page);
>  	if (ret) {
> -		struct page *page1, *page2;
> -		list_for_each_entry_safe(page1, page2, &pagelist, lru)
> -			put_page(page1);
> -
>  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
>  			pfn, ret, page->flags);
> -		if (ret > 0)
> -			ret = -EIO;
>  		return ret;
>  	}
>  done:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 51c08a0..d7eb82d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -929,15 +929,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
>  	unlock_page(hpage);
> -
>  out:
> -	if (rc != -EAGAIN) {
> -		list_del(&hpage->lru);
> -		put_page(hpage);
> -	}
> -
>  	put_page(new_hpage);
> -
>  	if (result) {
>  		if (rc)
>  			*result = rc;
> @@ -1013,48 +1006,32 @@ out:
>  	return nr_failed + retry;
>  }
>  
> -int migrate_huge_pages(struct list_head *from,
> -		new_page_t get_new_page, unsigned long private, bool offlining,
> -		enum migrate_mode mode)
> +int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
> +		      unsigned long private, bool offlining,
> +		      enum migrate_mode mode)
>  {
> -	int retry = 1;
> -	int nr_failed = 0;
> -	int pass = 0;
> -	struct page *page;
> -	struct page *page2;
> -	int rc;
> -
> -	for (pass = 0; pass < 10 && retry; pass++) {
> -		retry = 0;
> -
> -		list_for_each_entry_safe(page, page2, from, lru) {
> +	int pass, rc;
> +
> +	for (pass = 0; pass < 10; pass++) {
> +		rc = unmap_and_move_huge_page(get_new_page,
> +					      private, hpage, pass > 2, offlining,
> +					      mode);
> +		switch (rc) {
> +		case -ENOMEM:
> +			goto out;
> +		case -EAGAIN:
> +			/* try again */
>  			cond_resched();
> -
> -			rc = unmap_and_move_huge_page(get_new_page,
> -					private, page, pass > 2, offlining,
> -					mode);
> -
> -			switch(rc) {
> -			case -ENOMEM:
> -				goto out;
> -			case -EAGAIN:
> -				retry++;
> -				break;
> -			case 0:
> -				break;
> -			default:
> -				/* Permanent failure */
> -				nr_failed++;
> -				break;
> -			}
> +			break;
> +		case 0:
> +			goto out;
> +		default:
> +			rc = -EIO;
> +			goto out;


why -EIO ? Isn't this BUG() ??

Thanks,
-Kame

>  		}
>  	}
> -	rc = 0;
>  out:
> -	if (rc)
> -		return rc;
> -
> -	return nr_failed + retry;
> +	return rc;
>  }
>  
>  #ifdef CONFIG_NUMA




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 09/14] memcg: track resource index in cftype private
  2012-04-06 18:50 ` [PATCH -V5 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
@ 2012-04-09  5:56   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  5:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This helps in using same memcg callbacks for non reclaim resource
> control files.
>


please modify the changelog. This doesn't explain any contents in this patch.

 - add 'index' field to memcg files's attribute
 - support hugetlb type.
 - support modification of hugetlb res_counter.

Thanks,
-Kame

 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  mm/memcontrol.c |   27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0a1f776..3bb3b42 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -381,9 +381,14 @@ enum charge_type {
>  #define _MEM			(0)
>  #define _MEMSWAP		(1)
>  #define _OOM_TYPE		(2)
> -#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> -#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> -#define MEMFILE_ATTR(val)	((val) & 0xffff)
> +#define _MEMHUGETLB		(3)
> +
> +/*  0 ... val ...16.... x...24...idx...32*/
> +#define __MEMFILE_PRIVATE(idx, x, val)	(((idx) << 24) | ((x) << 16) | (val))
> +#define MEMFILE_PRIVATE(x, val)		__MEMFILE_PRIVATE(0, x, val)
> +#define MEMFILE_TYPE(val)		(((val) >> 16) & 0xff)
> +#define MEMFILE_IDX(val)		(((val) >> 24) & 0xff)
> +#define MEMFILE_ATTR(val)		((val) & 0xffff)
>  /* Used for OOM nofiier */
>  #define OOM_CONTROL		(0)
>  
> @@ -4003,7 +4008,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  	char str[64];
>  	u64 val;
> -	int type, name, len;
> +	int type, name, len, idx;
>  
>  	type = MEMFILE_TYPE(cft->private);
>  	name = MEMFILE_ATTR(cft->private);
> @@ -4024,6 +4029,10 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  		else
>  			val = res_counter_read_u64(&memcg->memsw, name);
>  		break;
> +	case _MEMHUGETLB:
> +		idx = MEMFILE_IDX(cft->private);
> +		val = res_counter_read_u64(&memcg->hugepage[idx], name);
> +		break;
>  	default:
>  		BUG();
>  	}
> @@ -4061,7 +4070,10 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>  			break;
>  		if (type == _MEM)
>  			ret = mem_cgroup_resize_limit(memcg, val);
> -		else
> +		else if (type == _MEMHUGETLB) {
> +			int idx = MEMFILE_IDX(cft->private);
> +			ret = res_counter_set_limit(&memcg->hugepage[idx], val);
> +		} else
>  			ret = mem_cgroup_resize_memsw_limit(memcg, val);
>  		break;
>  	case RES_SOFT_LIMIT:
> @@ -4127,7 +4139,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>  	case RES_MAX_USAGE:
>  		if (type == _MEM)
>  			res_counter_reset_max(&memcg->res);
> -		else
> +		else if (type == _MEMHUGETLB) {
> +			int idx = MEMFILE_IDX(event);
> +			res_counter_reset_max(&memcg->hugepage[idx]);
> +		} else
>  			res_counter_reset_max(&memcg->memsw);
>  		break;
>  	case RES_FAILCNT:




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-06 18:50 ` [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
@ 2012-04-09  6:00   ` KAMEZAWA Hiroyuki
  2012-04-09  8:46     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  6:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This add control files for hugetlbfs in memcg
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


seems ok. This uses Tejun's new interface...right ?

It might be better to explain "this patch depends on "cgroup: implement
cgroup_add_cftypes() and friends" patch via cgroup development tree.

Anyway,
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>  include/linux/hugetlb.h    |    5 +++++
>  include/linux/memcontrol.h |    7 ++++++
>  mm/hugetlb.c               |    2 +-
>  mm/memcontrol.c            |   51 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 995c238..d008342 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -4,6 +4,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
> +#include <linux/cgroup.h>
>  
>  struct ctl_table;
>  struct user_struct;
> @@ -203,6 +204,10 @@ struct hstate {
>  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
>  	unsigned int free_huge_pages_node[MAX_NUMNODES];
>  	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +	/* mem cgroup control files */
> +	struct cftype mem_cgroup_files[4];
> +#endif
>  	char name[HSTATE_NAME_LEN];
>  };
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1d07e14..4f17574 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -459,6 +459,7 @@ extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
>  					     struct page *page);
>  extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
>  					      struct mem_cgroup *memcg);
> +extern int mem_cgroup_hugetlb_file_init(int idx) __init;
>  
>  #else
>  static inline int
> @@ -489,6 +490,12 @@ mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
>  {
>  	return;
>  }
> +
> +static inline int mem_cgroup_hugetlb_file_init(int idx)
> +{
> +	return 0;
> +}
> +
>  #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
>  #endif /* _LINUX_MEMCONTROL_H */
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dd00087..340e575 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1931,7 +1931,7 @@ void __init hugetlb_add_hstate(unsigned order)
>  	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>  					huge_page_size(h)/1024);
> -
> +	mem_cgroup_hugetlb_file_init(hugetlb_max_hstate - 1);
>  	parsed_hstate = h;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3bb3b42..7d3330e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5191,6 +5191,57 @@ static void mem_cgroup_destroy(struct cgroup *cont)
>  	mem_cgroup_put(memcg);
>  }
>  
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +static char *mem_fmt(char *buf, unsigned long n)
> +{
> +	if (n >= (1UL << 30))
> +		sprintf(buf, "%luGB", n >> 30);
> +	else if (n >= (1UL << 20))
> +		sprintf(buf, "%luMB", n >> 20);
> +	else
> +		sprintf(buf, "%luKB", n >> 10);
> +	return buf;
> +}
> +
> +int __init mem_cgroup_hugetlb_file_init(int idx)
> +{
> +	char buf[32];
> +	struct cftype *cft;
> +	struct hstate *h = &hstates[idx];
> +
> +	/* format the size */
> +	mem_fmt(buf, huge_page_size(h));
> +
> +	/* Add the limit file */
> +	cft = &h->mem_cgroup_files[0];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.limit_in_bytes", buf);
> +	cft->private = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_LIMIT);
> +	cft->read = mem_cgroup_read;
> +	cft->write_string = mem_cgroup_write;
> +
> +	/* Add the usage file */
> +	cft = &h->mem_cgroup_files[1];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.usage_in_bytes", buf);
> +	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_USAGE);
> +	cft->read = mem_cgroup_read;
> +
> +	/* Add the MAX usage file */
> +	cft = &h->mem_cgroup_files[2];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.max_usage_in_bytes", buf);
> +	cft->private  = __MEMFILE_PRIVATE(idx, _MEMHUGETLB, RES_MAX_USAGE);
> +	cft->trigger  = mem_cgroup_reset;
> +	cft->read = mem_cgroup_read;
> +
> +	/* NULL terminate the last cft */
> +	cft = &h->mem_cgroup_files[3];
> +	memset(cft, 0, sizeof(*cft));
> +
> +	WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, h->mem_cgroup_files));
> +
> +	return 0;
> +}
> +#endif
> +
>  static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 07/14] memcg: Add HugeTLB extension
  2012-04-06 18:50 ` [PATCH -V5 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
@ 2012-04-09  6:04   ` KAMEZAWA Hiroyuki
  2012-04-09  8:43     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  6:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This patch implements a memcg extension that allows us to control HugeTLB
> allocations via memory controller. The extension allows to limit the
> HugeTLB usage per control group and enforces the controller limit during
> page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
> at page fault time implies that, the application will get SIGBUS signal if it
> tries to access HugeTLB pages beyond its limit. This requires the application
> to know beforehand how much HugeTLB pages it would require for its use.
> 
> The charge/uncharge calls will be added to HugeTLB code in later patch.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


Hmm, seems ok to me. please explain 'this patch doesn't include updates
for memcg destroying, it will be in patch 12/14' or some...

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


BTW, you don't put res_counter for hugeltb under CONFIG_MEM_RES_CTLR_HUGETLB...
do you think we need the config ?



> ---
>  include/linux/hugetlb.h    |    1 +
>  include/linux/memcontrol.h |   42 ++++++++++++++
>  init/Kconfig               |    8 +++
>  mm/hugetlb.c               |    2 +-
>  mm/memcontrol.c            |  132 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 184 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 46c6cbd..995c238 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -226,6 +226,7 @@ struct hstate *size_to_hstate(unsigned long size);
>  #define HUGE_MAX_HSTATE 1
>  #endif
>  
> +extern int hugetlb_max_hstate;
>  extern struct hstate hstates[HUGE_MAX_HSTATE];
>  extern unsigned int default_hstate_idx;
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f94efd2..1d07e14 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -448,5 +448,47 @@ static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +extern int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +					  struct mem_cgroup **ptr);
> +extern void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +					     struct mem_cgroup *memcg,
> +					     struct page *page);
> +extern void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +					     struct page *page);
> +extern void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +					      struct mem_cgroup *memcg);
> +
> +#else
> +static inline int
> +mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +						 struct mem_cgroup **ptr)
> +{
> +	return 0;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +				 struct mem_cgroup *memcg,
> +				 struct page *page)
> +{
> +	return;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +				 struct page *page)
> +{
> +	return;
> +}
> +
> +static inline void
> +mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +				  struct mem_cgroup *memcg)
> +{
> +	return;
> +}
> +#endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
>  #endif /* _LINUX_MEMCONTROL_H */
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index 72f33fa..a3b5665 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -716,6 +716,14 @@ config CGROUP_PERF
>  
>  	  Say N if unsure.
>  
> +config MEM_RES_CTLR_HUGETLB
> +	bool "Memory Resource Controller HugeTLB Extension (EXPERIMENTAL)"
> +	depends on CGROUP_MEM_RES_CTLR && HUGETLB_PAGE && EXPERIMENTAL
> +	default n
> +	help
> +	  Add HugeTLB management to memory resource controller. When you
> +	  enable this, you can put a per cgroup limit on HugeTLB usage.
> +
>  menuconfig CGROUP_SCHED
>  	bool "Group CPU scheduler"
>  	default n
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a3ac624..8cd89b4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
>  static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
>  unsigned long hugepages_treat_as_movable;
>  
> -static int hugetlb_max_hstate;
> +int hugetlb_max_hstate;
>  unsigned int default_hstate_idx;
>  struct hstate hstates[HUGE_MAX_HSTATE];
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d28359c..1a2e041 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -252,6 +252,10 @@ struct mem_cgroup {
>  	};
>  
>  	/*
> +	 * the counter to account for hugepages from hugetlb.
> +	 */
> +	struct res_counter hugepage[HUGE_MAX_HSTATE];
> +	/*
>  	 * Per cgroup active and inactive list, similar to the
>  	 * per zone LRU lists.
>  	 */
> @@ -3213,6 +3217,114 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  }
>  #endif
>  
> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +{
> +	int idx;
> +	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> +		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +int mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
> +				   struct mem_cgroup **ptr)
> +{
> +	int ret = 0;
> +	struct mem_cgroup *memcg = NULL;
> +	struct res_counter *fail_res;
> +	unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +	if (mem_cgroup_disabled())
> +		goto done;
> +again:
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (!memcg)
> +		memcg = root_mem_cgroup;
> +
> +	if (!css_tryget(&memcg->css)) {
> +		rcu_read_unlock();
> +		goto again;
> +	}
> +	rcu_read_unlock();
> +
> +	ret = res_counter_charge(&memcg->hugepage[idx], csize, &fail_res);
> +	css_put(&memcg->css);
> +done:
> +	*ptr = memcg;
> +	return ret;
> +}
> +
> +void mem_cgroup_hugetlb_commit_charge(int idx, unsigned long nr_pages,
> +				      struct mem_cgroup *memcg,
> +				      struct page *page)
> +{
> +	struct page_cgroup *pc;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +	lock_page_cgroup(pc);
> +	if (unlikely(PageCgroupUsed(pc))) {
> +		unlock_page_cgroup(pc);
> +		mem_cgroup_hugetlb_uncharge_memcg(idx, nr_pages, memcg);
> +		return;
> +	}
> +	pc->mem_cgroup = memcg;
> +	SetPageCgroupUsed(pc);
> +	unlock_page_cgroup(pc);
> +	return;
> +}
> +
> +void mem_cgroup_hugetlb_uncharge_page(int idx, unsigned long nr_pages,
> +				      struct page *page)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg;
> +	unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (unlikely(!PageCgroupUsed(pc)))
> +		return;
> +
> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc)) {
> +		unlock_page_cgroup(pc);
> +		return;
> +	}
> +	memcg = pc->mem_cgroup;
> +	pc->mem_cgroup = root_mem_cgroup;
> +	ClearPageCgroupUsed(pc);
> +	unlock_page_cgroup(pc);
> +
> +	res_counter_uncharge(&memcg->hugepage[idx], csize);
> +	return;
> +}
> +
> +void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
> +				       struct mem_cgroup *memcg)
> +{
> +	unsigned long csize = nr_pages * PAGE_SIZE;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	res_counter_uncharge(&memcg->hugepage[idx], csize);
> +	return;
> +}
> +#else
> +static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> +
>  /*
>   * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
>   * page belongs to.
> @@ -4962,6 +5074,7 @@ err_cleanup:
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup *cont)
>  {
> +	int idx;
>  	struct mem_cgroup *memcg, *parent;
>  	long error = -ENOMEM;
>  	int node;
> @@ -5004,9 +5117,22 @@ mem_cgroup_create(struct cgroup *cont)
>  		 * mem_cgroup(see mem_cgroup_put).
>  		 */
>  		mem_cgroup_get(parent);
> +		/*
> +		 * We could get called before hugetlb init is called.
> +		 * Use HUGE_MAX_HSTATE as the max index.
> +		 */
> +		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> +			res_counter_init(&memcg->hugepage[idx],
> +					 &parent->hugepage[idx]);
>  	} else {
>  		res_counter_init(&memcg->res, NULL);
>  		res_counter_init(&memcg->memsw, NULL);
> +		/*
> +		 * We could get called before hugetlb init is called.
> +		 * Use HUGE_MAX_HSTATE as the max index.
> +		 */
> +		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> +			res_counter_init(&memcg->hugepage[idx], NULL);
>  	}
>  	memcg->last_scanned_node = MAX_NUMNODES;
>  	INIT_LIST_HEAD(&memcg->oom_notify);
> @@ -5026,6 +5152,12 @@ free_out:
>  static int mem_cgroup_pre_destroy(struct cgroup *cont)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +	/*
> +	 * Don't allow memcg removal if we have HugeTLB resource
> +	 * usage.
> +	 */
> +	if (mem_cgroup_have_hugetlb_usage(memcg))
> +		return -EBUSY;
>  



>  	return mem_cgroup_force_empty(memcg, false);
>  }




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-06 18:50 ` [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
@ 2012-04-09  6:16   ` KAMEZAWA Hiroyuki
  2012-04-09 10:00     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  6:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/07 3:50), Aneesh Kumar K.V wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> This add support for memcg removal with HugeTLB resource usage.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


Hmm 


> +#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> +/*
> + * Force the memcg to empty the hugetlb resources by moving them to
> + * the parent cgroup. We can fail if the parent cgroup's limit prevented
> + * the charging. This should only happen if use_hierarchy is not set.
> + */
> +int hugetlb_force_memcg_empty(struct cgroup *cgroup)
> +{
> +	struct hstate *h;
> +	struct page *page;
> +	int ret = 0, idx = 0;
> +
> +	do {
> +		if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
> +			goto out;
> +		/*
> +		 * If the task doing the cgroup_rmdir got a signal
> +		 * we don't really need to loop till the hugetlb resource
> +		 * usage become zero.
> +		 */
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out;
> +		}
> +		for_each_hstate(h) {
> +			spin_lock(&hugetlb_lock);
> +			list_for_each_entry(page, &h->hugepage_activelist, lru) {
> +				ret = mem_cgroup_move_hugetlb_parent(idx, cgroup, page);
> +				if (ret) {
> +					spin_unlock(&hugetlb_lock);
> +					goto out;
> +				}
> +			}
> +			spin_unlock(&hugetlb_lock);
> +			idx++;
> +		}
> +		cond_resched();
> +	} while (mem_cgroup_have_hugetlb_usage(cgroup));
> +out:
> +	return ret;
> +}
> +#endif
> +
>  /* Should be called on processing a hugepagesz=... option */
>  void __init hugetlb_add_hstate(unsigned order)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7d3330e..7b6e79a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3228,9 +3228,11 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  #endif
>  
>  #ifdef CONFIG_MEM_RES_CTLR_HUGETLB
> -static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup)
>  {
>  	int idx;
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
> +
>  	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
>  		if ((res_counter_read_u64(&memcg->hugepage[idx], RES_USAGE)) > 0)
>  			return 1;
> @@ -3328,10 +3330,57 @@ void mem_cgroup_hugetlb_uncharge_memcg(int idx, unsigned long nr_pages,
>  	res_counter_uncharge(&memcg->hugepage[idx], csize);
>  	return;
>  }
> -#else
> -static bool mem_cgroup_have_hugetlb_usage(struct mem_cgroup *memcg)
> +
> +int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
> +				   struct page *page)
>  {
> -	return 0;
> +	struct page_cgroup *pc;
> +	int csize,  ret = 0;
> +	struct res_counter *fail_res;
> +	struct cgroup *pcgrp = cgroup->parent;
> +	struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
> +	struct mem_cgroup *memcg  = mem_cgroup_from_cont(cgroup);
> +
> +	if (!get_page_unless_zero(page))
> +		goto out;
> +
> +	pc = lookup_page_cgroup(page);
> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc) || pc->mem_cgroup != memcg)
> +		goto err_out;
> +
> +	csize = PAGE_SIZE << compound_order(page);
> +	/*
> +	 * uncharge from child and charge the parent. If we have
> +	 * use_hierarchy set, we can never fail here. In-order to make
> +	 * sure we don't get -ENOMEM on parent charge, we first uncharge
> +	 * the child and then charge the parent.
> +	 */
> +	if (parent->use_hierarchy) {


> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
> +		if (!mem_cgroup_is_root(parent))
> +			ret = res_counter_charge(&parent->hugepage[idx],
> +						 csize, &fail_res);


Ah, why is !mem_cgroup_is_root() checked ? no res_counter update for
root cgroup ?

I think it's better to have res_counter_move_parent()...to do ops in atomic.
(I'll post a patch for that for my purpose). OR, just ignore res->usage if
parent->use_hierarchy == 1.

uncharge->charge will have a race.

> +	} else {
> +		if (!mem_cgroup_is_root(parent)) {
> +			ret = res_counter_charge(&parent->hugepage[idx],
> +						 csize, &fail_res);
> +			if (ret) {
> +				ret = -EBUSY;
> +				goto err_out;
> +			}
> +		}
> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
> +	}


Just a notice. Recently, Tejun changed failure of pre_destory() to show WARNING.
Then, I'd like to move the usage to the root cgroup if use_hierarchy=0.
Will it work for you ?

> +	/*
> +	 * caller should have done css_get
> +	 */


Could you explain meaning of this comment ?


Thanks,
-Kame

> +	pc->mem_cgroup = parent;
> +err_out:
> +	unlock_page_cgroup(pc);
> +	put_page(page);
> +out:
> +	return ret;
>  }
>  #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>  
> @@ -3852,6 +3901,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>  	/* should free all ? */
>  	if (free_all)
>  		goto try_to_free;
> +
> +	/* move the hugetlb charges */
> +	ret = hugetlb_force_memcg_empty(cgrp);
> +	if (ret)
> +		goto out;
>  move_account:
>  	do {
>  		ret = -EBUSY;
> @@ -5172,12 +5226,6 @@ free_out:
>  static int mem_cgroup_pre_destroy(struct cgroup *cont)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> -	/*
> -	 * Don't allow memcg removal if we have HugeTLB resource
> -	 * usage.
> -	 */
> -	if (mem_cgroup_have_hugetlb_usage(memcg))
> -		return -EBUSY;
>  
>  	return mem_cgroup_force_empty(memcg, false);
>  }




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page
  2012-04-09  5:47   ` KAMEZAWA Hiroyuki
@ 2012-04-09  8:36     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-09  8:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> Since we migrate only one hugepage don't use linked list for passing
>> the page around. Directly pass page that need to be migrated as argument.
>> This also remove the usage page->lru in migrate path.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
>
> seems good to me. I have one question below.
>
>
>> ---

...... snip ......


>> -	list_add(&hpage->lru, &pagelist);
>> -	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
>> -				true);
>> +	ret = migrate_huge_page(page, new_page, MPOL_MF_MOVE_ALL, 0, true);
>> +	put_page(page);
>>  	if (ret) {
>> -		struct page *page1, *page2;
>> -		list_for_each_entry_safe(page1, page2, &pagelist, lru)
>> -			put_page(page1);
>> -
>>  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
>>  			pfn, ret, page->flags);
>> -		if (ret > 0)
>> -			ret = -EIO;   <---------------------------- here
>>  		return ret;
>>  	}
>>  done:
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 51c08a0..d7eb82d 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -929,15 +929,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>>  	if (anon_vma)
>>  		put_anon_vma(anon_vma);
>>  	unlock_page(hpage);

.... snip .....

>> -
>> -			rc = unmap_and_move_huge_page(get_new_page,
>> -					private, page, pass > 2, offlining,
>> -					mode);
>> -
>> -			switch(rc) {
>> -			case -ENOMEM:
>> -				goto out;
>> -			case -EAGAIN:
>> -				retry++;
>> -				break;
>> -			case 0:
>> -				break;
>> -			default:
>> -				/* Permanent failure */
>> -				nr_failed++;
>> -				break;
>> -			}
>> +			break;
>> +		case 0:
>> +			goto out;
>> +		default:
>> +			rc = -EIO;
>> +			goto out;
>
>
> why -EIO ? Isn't this BUG() ??

I am not sure doing a BUG() for migrate is a good idea. We may want to
return error and let the higher layer handle this. Also as you see in
the two hunks I listed above, default is mapped to -EIO in the current
code. I didn't want to change that.

-aneesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 07/14] memcg: Add HugeTLB extension
  2012-04-09  6:04   ` KAMEZAWA Hiroyuki
@ 2012-04-09  8:43     ` Aneesh Kumar K.V
  2012-04-09  9:00       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-09  8:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> This patch implements a memcg extension that allows us to control HugeTLB
>> allocations via memory controller. The extension allows to limit the
>> HugeTLB usage per control group and enforces the controller limit during
>> page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
>> at page fault time implies that, the application will get SIGBUS signal if it
>> tries to access HugeTLB pages beyond its limit. This requires the application
>> to know beforehand how much HugeTLB pages it would require for its use.
>> 
>> The charge/uncharge calls will be added to HugeTLB code in later patch.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
>
> Hmm, seems ok to me. please explain 'this patch doesn't include updates
> for memcg destroying, it will be in patch 12/14' or some...
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
>
> BTW, you don't put res_counter for hugeltb under CONFIG_MEM_RES_CTLR_HUGETLB...
> do you think we need the config ?


That results in more #ifdef CONFIG_MEM_RES_CTLR_HUGETLB in the
memcg code (mem_cgroup_create/mem_cgroup_read/write etc). I was not
sure we want to do that. Let me know if you think we really need to do this.


-aneesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs
  2012-04-09  6:00   ` KAMEZAWA Hiroyuki
@ 2012-04-09  8:46     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-09  8:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> This add control files for hugetlbfs in memcg
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
>
> seems ok. This uses Tejun's new interface...right ?
>
> It might be better to explain "this patch depends on "cgroup: implement
> cgroup_add_cftypes() and friends" patch via cgroup development tree.

I have mentioned in the 0th patch that the series is on top of cgroup development
tree.

-aneesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 07/14] memcg: Add HugeTLB extension
  2012-04-09  8:43     ` Aneesh Kumar K.V
@ 2012-04-09  9:00       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-09  9:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/09 17:43), Aneesh Kumar K.V wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
>> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>>
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>>
>>> This patch implements a memcg extension that allows us to control HugeTLB
>>> allocations via memory controller. The extension allows to limit the
>>> HugeTLB usage per control group and enforces the controller limit during
>>> page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit
>>> at page fault time implies that, the application will get SIGBUS signal if it
>>> tries to access HugeTLB pages beyond its limit. This requires the application
>>> to know beforehand how much HugeTLB pages it would require for its use.
>>>
>>> The charge/uncharge calls will be added to HugeTLB code in later patch.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>
>>
>> Hmm, seems ok to me. please explain 'this patch doesn't include updates
>> for memcg destroying, it will be in patch 12/14' or some...
>>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>>
>> BTW, you don't put res_counter for hugeltb under CONFIG_MEM_RES_CTLR_HUGETLB...
>> do you think we need the config ?
> 
> 
> That results in more #ifdef CONFIG_MEM_RES_CTLR_HUGETLB in the
> memcg code (mem_cgroup_create/mem_cgroup_read/write etc). I was not
> sure we want to do that. Let me know if you think we really need to do this.
> 


Hm. ok. BTW, how about removing all CONFIG_MEM_RES_CTLR_HUGETLB and makes 
all codes just depends on CONFIG_CGROUP_MEM_RES_CTLR && CONFIG_HUGETLB ?

How other guys thinks ? (Anyway we can do it later....)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-09  6:16   ` KAMEZAWA Hiroyuki
@ 2012-04-09 10:00     ` Aneesh Kumar K.V
  2012-04-10  6:55       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-09 10:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> This add support for memcg removal with HugeTLB resource usage.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
>
> Hmm 
>
>

....
...

>> +	csize = PAGE_SIZE << compound_order(page);
>> +	/*
>> +	 * uncharge from child and charge the parent. If we have
>> +	 * use_hierarchy set, we can never fail here. In-order to make
>> +	 * sure we don't get -ENOMEM on parent charge, we first uncharge
>> +	 * the child and then charge the parent.
>> +	 */
>> +	if (parent->use_hierarchy) {
>
>
>> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
>> +		if (!mem_cgroup_is_root(parent))
>> +			ret = res_counter_charge(&parent->hugepage[idx],
>> +						 csize, &fail_res);
>
>
> Ah, why is !mem_cgroup_is_root() checked ? no res_counter update for
> root cgroup ?

My mistake. Earlier version of the patch series didn't charge/uncharge the root
cgroup during different operations. Later as per your review I updated
the charge/uncharge path to charge root cgroup. I missed to update this code.

>
> I think it's better to have res_counter_move_parent()...to do ops in atomic.
> (I'll post a patch for that for my purpose). OR, just ignore res->usage if
> parent->use_hierarchy == 1.
>
> uncharge->charge will have a race.



How about the below

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b6e79a..5b4bc98 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3351,24 +3351,24 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 
 	csize = PAGE_SIZE << compound_order(page);
 	/*
-	 * uncharge from child and charge the parent. If we have
-	 * use_hierarchy set, we can never fail here. In-order to make
-	 * sure we don't get -ENOMEM on parent charge, we first uncharge
-	 * the child and then charge the parent.
+	 * If we have use_hierarchy set we can never fail here. So instead of
+	 * using res_counter_uncharge use the open-coded variant which just
+	 * uncharge the child res_counter. The parent will retain the charge.
 	 */
 	if (parent->use_hierarchy) {
-		res_counter_uncharge(&memcg->hugepage[idx], csize);
-		if (!mem_cgroup_is_root(parent))
-			ret = res_counter_charge(&parent->hugepage[idx],
-						 csize, &fail_res);
+		unsigned long flags;
+		struct res_counter *counter;
+
+		counter = &memcg->hugepage[idx];
+		spin_lock_irqsave(&counter->lock, flags);
+		res_counter_uncharge_locked(counter, csize);
+		spin_unlock_irqrestore(&counter->lock, flags);
 	} else {
-		if (!mem_cgroup_is_root(parent)) {
-			ret = res_counter_charge(&parent->hugepage[idx],
-						 csize, &fail_res);
-			if (ret) {
-				ret = -EBUSY;
-				goto err_out;
-			}
+		ret = res_counter_charge(&parent->hugepage[idx],
+					 csize, &fail_res);
+		if (ret) {
+			ret = -EBUSY;
+			goto err_out;
 		}
 		res_counter_uncharge(&memcg->hugepage[idx], csize);
 	}


>
>> +	} else {
>> +		if (!mem_cgroup_is_root(parent)) {
>> +			ret = res_counter_charge(&parent->hugepage[idx],
>> +						 csize, &fail_res);
>> +			if (ret) {
>> +				ret = -EBUSY;
>> +				goto err_out;
>> +			}
>> +		}
>> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
>> +	}
>
>
> Just a notice. Recently, Tejun changed failure of pre_destory() to show WARNING.
> Then, I'd like to move the usage to the root cgroup if use_hierarchy=0.
> Will it work for you ?

That should work.


>
>> +	/*
>> +	 * caller should have done css_get
>> +	 */
>
>
> Could you explain meaning of this comment ?
>

inherited from mem_cgroup_move_account. I guess it means css cannot go
away at this point. We have done a css_get on the child. For a generic
move_account function may be the comment is needed. I guess in our case
the comment is redundant ?

-aneesh


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal
  2012-04-09 10:00     ` Aneesh Kumar K.V
@ 2012-04-10  6:55       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-10  6:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
	linux-kernel, cgroups

(2012/04/09 19:00), Aneesh Kumar K.V wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
>> (2012/04/07 3:50), Aneesh Kumar K.V wrote:
>>
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>>
>>> This add support for memcg removal with HugeTLB resource usage.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>
>>
>> Hmm 
>>
>>
> 
> ....
> ...
> 
>>> +	csize = PAGE_SIZE << compound_order(page);
>>> +	/*
>>> +	 * uncharge from child and charge the parent. If we have
>>> +	 * use_hierarchy set, we can never fail here. In-order to make
>>> +	 * sure we don't get -ENOMEM on parent charge, we first uncharge
>>> +	 * the child and then charge the parent.
>>> +	 */
>>> +	if (parent->use_hierarchy) {
>>
>>
>>> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
>>> +		if (!mem_cgroup_is_root(parent))
>>> +			ret = res_counter_charge(&parent->hugepage[idx],
>>> +						 csize, &fail_res);
>>
>>
>> Ah, why is !mem_cgroup_is_root() checked ? no res_counter update for
>> root cgroup ?
> 
> My mistake. Earlier version of the patch series didn't charge/uncharge the root
> cgroup during different operations. Later as per your review I updated
> the charge/uncharge path to charge root cgroup. I missed to update this code.
> 
>>
>> I think it's better to have res_counter_move_parent()...to do ops in atomic.
>> (I'll post a patch for that for my purpose). OR, just ignore res->usage if
>> parent->use_hierarchy == 1.
>>
>> uncharge->charge will have a race.
> 
> 
> 
> How about the below
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b6e79a..5b4bc98 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3351,24 +3351,24 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>  
>  	csize = PAGE_SIZE << compound_order(page);
>  	/*
> -	 * uncharge from child and charge the parent. If we have
> -	 * use_hierarchy set, we can never fail here. In-order to make
> -	 * sure we don't get -ENOMEM on parent charge, we first uncharge
> -	 * the child and then charge the parent.
> +	 * If we have use_hierarchy set we can never fail here. So instead of
> +	 * using res_counter_uncharge use the open-coded variant which just
> +	 * uncharge the child res_counter. The parent will retain the charge.
>  	 */
>  	if (parent->use_hierarchy) {
> -		res_counter_uncharge(&memcg->hugepage[idx], csize);
> -		if (!mem_cgroup_is_root(parent))
> -			ret = res_counter_charge(&parent->hugepage[idx],
> -						 csize, &fail_res);
> +		unsigned long flags;
> +		struct res_counter *counter;
> +
> +		counter = &memcg->hugepage[idx];
> +		spin_lock_irqsave(&counter->lock, flags);
> +		res_counter_uncharge_locked(counter, csize);


Hm, uncharge_locked is not propagated to parent, I see.
Ok, it seems to work...but please add enough comment here. Or define
res_counter_move_parent().

> +		spin_unlock_irqrestore(&counter->lock, flags);
>  	} else {
> -		if (!mem_cgroup_is_root(parent)) {
> -			ret = res_counter_charge(&parent->hugepage[idx],
> -						 csize, &fail_res);
> -			if (ret) {
> -				ret = -EBUSY;
> -				goto err_out;
> -			}
> +		ret = res_counter_charge(&parent->hugepage[idx],
> +					 csize, &fail_res);
> +		if (ret) {
> +			ret = -EBUSY;
> +			goto err_out;
>  		}
>  		res_counter_uncharge(&memcg->hugepage[idx], csize);
>  	}
> 
> 
>>
>>> +	} else {
>>> +		if (!mem_cgroup_is_root(parent)) {
>>> +			ret = res_counter_charge(&parent->hugepage[idx],
>>> +						 csize, &fail_res);
>>> +			if (ret) {
>>> +				ret = -EBUSY;
>>> +				goto err_out;
>>> +			}
>>> +		}
>>> +		res_counter_uncharge(&memcg->hugepage[idx], csize);
>>> +	}
>>
>>
>> Just a notice. Recently, Tejun changed failure of pre_destory() to show WARNING.
>> Then, I'd like to move the usage to the root cgroup if use_hierarchy=0.
>> Will it work for you ?
> 
> That should work.
> 

ok, I'll go ahead in that way.

> 
>>
>>> +	/*
>>> +	 * caller should have done css_get
>>> +	 */
>>
>>
>> Could you explain meaning of this comment ?
>>
> 
> inherited from mem_cgroup_move_account. I guess it means css cannot go
> away at this point. We have done a css_get on the child. For a generic
> move_account function may be the comment is needed. I guess in our case
> the comment is redundant ?
> 


Ah, IIUC, this code is hugetlb version of mem_cgroup_move_parent().
At move_parent(), we don't need to take care of css counting because we're
moving from an exisiting cgroup to an cgroup which cannot be destroyed.
(move_account() is function to move account between arbitrary cgroup.)

So, yes, please remove comment.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-04-10  6:57 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-06 18:50 [PATCH -V5 00/14] memcg: Add memcg extension to control HugeTLB allocation Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 01/14] hugetlb: rename max_hstate to hugetlb_max_hstate Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 02/14] hugetlbfs: don't use ERR_PTR with VM_FAULT* values Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 03/14] hugetlbfs: Add an inline helper for finding hstate index Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 04/14] hugetlb: Use mmu_gather instead of a temporary linked list for accumulating pages Aneesh Kumar K.V
2012-04-09  5:36   ` KAMEZAWA Hiroyuki
2012-04-06 18:50 ` [PATCH -V5 05/14] hugetlb: Avoid taking i_mmap_mutex in unmap_single_vma for hugetlb Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 06/14] hugetlb: Simplify migrate_huge_page Aneesh Kumar K.V
2012-04-09  5:47   ` KAMEZAWA Hiroyuki
2012-04-09  8:36     ` Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 07/14] memcg: Add HugeTLB extension Aneesh Kumar K.V
2012-04-09  6:04   ` KAMEZAWA Hiroyuki
2012-04-09  8:43     ` Aneesh Kumar K.V
2012-04-09  9:00       ` KAMEZAWA Hiroyuki
2012-04-06 18:50 ` [PATCH -V5 08/14] hugetlb: add charge/uncharge calls for HugeTLB alloc/free Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 09/14] memcg: track resource index in cftype private Aneesh Kumar K.V
2012-04-09  5:56   ` KAMEZAWA Hiroyuki
2012-04-06 18:50 ` [PATCH -V5 10/14] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
2012-04-09  6:00   ` KAMEZAWA Hiroyuki
2012-04-09  8:46     ` Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 11/14] hugetlbfs: Add a list for tracking in-use HugeTLB pages Aneesh Kumar K.V
2012-04-06 18:50 ` [PATCH -V5 12/14] memcg: move HugeTLB resource count to parent cgroup on memcg removal Aneesh Kumar K.V
2012-04-09  6:16   ` KAMEZAWA Hiroyuki
2012-04-09 10:00     ` Aneesh Kumar K.V
2012-04-10  6:55       ` KAMEZAWA Hiroyuki
2012-04-06 18:50 ` [PATCH -V5 13/14] hugetlb: migrate memcg info from oldpage to new page during migration Aneesh Kumar K.V
2012-04-06 18:51 ` [PATCH -V5 14/14] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).