All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-08-09  9:26 ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this cause faulting thread to get a SIGBUS
signal, although there are enough hugepages.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.
    
Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

This patchset consist of 4 parts roughly.

Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
	
	These can be merged into mainline separately.

Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
	the hugetlb_instantiation_mutex.
	
	Breaking dependency on the hugetlb_instantiation_mutex for
	tracking a region is also needed by other approaches like as
	'table mutexes', so these can be merged into mainline separately.

Part 3. (10-13) Clean-up.
	
	IMO, these make code really simple, so these are worth to go into
	mainline separately, regardless success of my approach.

Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
	
	Almost patches are just for clean-up to error handling path.
	In patch 19, retry approach is implemented that if faulted thread
	failed to allocate a hugepage, it continue to run a fault handler
	until there is no concurrent thread having a hugepage. This causes
	threads who want to get a last hugepage to be serialized, so
	threads don't get a SIGBUS if enough hugepage exist.
	In patch 20, remove a hugetlb_instantiation_mutex.

These patches are based on my previous patchset [2] which is now on mmotm.
In my compile testing, [2] and this patchset can be applied to
v3.11-rc4 cleanly, but, I do running test of this patchset on top of v3.10 :)

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is a something I should consider, please let me know!
Thanks.

* Changes in v2
- Re-order patches to clear it's relationship
- sleepable object allocation(kmalloc) without holding a spinlock
	(Pointed by Hillf)
- Remove vma_has_reserves, instead of vma_needs_reservation.
	(Suggest by Aneesh and Naoya)
- Change a way of returning a hugepage back to reserved pool
	(Suggedt by Naoya)


[1] http://lwn.net/Articles/558863/ 
	"[PATCH] mm/hugetlb: per-vma instantiation mutexes"
[2] https://lkml.org/lkml/2013/7/22/96
	"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix"

Joonsoo Kim (20):
  mm, hugetlb: protect reserved pages when soft offlining a hugepage
  mm, hugetlb: change variable name reservations to resv
  mm, hugetlb: fix subpool accounting handling
  mm, hugetlb: remove useless check about mapping type
  mm, hugetlb: grab a page_table_lock after page_cache_release
  mm, hugetlb: return a reserved page to a reserved pool if failed
  mm, hugetlb: unify region structure handling
  mm, hugetlb: region manipulation functions take resv_map rather
    list_head
  mm, hugetlb: protect region tracking via newly introduced resv_map
    lock
  mm, hugetlb: remove resv_map_put()
  mm, hugetlb: make vma_resv_map() works for all mapping type
  mm, hugetlb: remove vma_has_reserves()
  mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  mm, hugetlb: call vma_needs_reservation before entering
    alloc_huge_page()
  mm, hugetlb: remove a check for return value of alloc_huge_page()
  mm, hugetlb: move down outside_reserve check
  mm, hugetlb: move up anon_vma_prepare()
  mm, hugetlb: clean-up error handling in hugetlb_cow()
  mm, hugetlb: retry if failed to allocate and there is concurrent user
  mm, hugetlb: remove a hugetlb_instantiation_mutex

 fs/hugetlbfs/inode.c    |   16 +-
 include/linux/hugetlb.h |   11 ++
 mm/hugetlb.c            |  419 +++++++++++++++++++++++++----------------------
 3 files changed, 250 insertions(+), 196 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-08-09  9:26 ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this cause faulting thread to get a SIGBUS
signal, although there are enough hugepages.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.
    
Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

This patchset consist of 4 parts roughly.

Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
	
	These can be merged into mainline separately.

Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
	the hugetlb_instantiation_mutex.
	
	Breaking dependency on the hugetlb_instantiation_mutex for
	tracking a region is also needed by other approaches like as
	'table mutexes', so these can be merged into mainline separately.

Part 3. (10-13) Clean-up.
	
	IMO, these make code really simple, so these are worth to go into
	mainline separately, regardless success of my approach.

Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
	
	Almost patches are just for clean-up to error handling path.
	In patch 19, retry approach is implemented that if faulted thread
	failed to allocate a hugepage, it continue to run a fault handler
	until there is no concurrent thread having a hugepage. This causes
	threads who want to get a last hugepage to be serialized, so
	threads don't get a SIGBUS if enough hugepage exist.
	In patch 20, remove a hugetlb_instantiation_mutex.

These patches are based on my previous patchset [2] which is now on mmotm.
In my compile testing, [2] and this patchset can be applied to
v3.11-rc4 cleanly, but, I do running test of this patchset on top of v3.10 :)

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is a something I should consider, please let me know!
Thanks.

* Changes in v2
- Re-order patches to clear it's relationship
- sleepable object allocation(kmalloc) without holding a spinlock
	(Pointed by Hillf)
- Remove vma_has_reserves, instead of vma_needs_reservation.
	(Suggest by Aneesh and Naoya)
- Change a way of returning a hugepage back to reserved pool
	(Suggedt by Naoya)


[1] http://lwn.net/Articles/558863/ 
	"[PATCH] mm/hugetlb: per-vma instantiation mutexes"
[2] https://lkml.org/lkml/2013/7/22/96
	"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix"

Joonsoo Kim (20):
  mm, hugetlb: protect reserved pages when soft offlining a hugepage
  mm, hugetlb: change variable name reservations to resv
  mm, hugetlb: fix subpool accounting handling
  mm, hugetlb: remove useless check about mapping type
  mm, hugetlb: grab a page_table_lock after page_cache_release
  mm, hugetlb: return a reserved page to a reserved pool if failed
  mm, hugetlb: unify region structure handling
  mm, hugetlb: region manipulation functions take resv_map rather
    list_head
  mm, hugetlb: protect region tracking via newly introduced resv_map
    lock
  mm, hugetlb: remove resv_map_put()
  mm, hugetlb: make vma_resv_map() works for all mapping type
  mm, hugetlb: remove vma_has_reserves()
  mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  mm, hugetlb: call vma_needs_reservation before entering
    alloc_huge_page()
  mm, hugetlb: remove a check for return value of alloc_huge_page()
  mm, hugetlb: move down outside_reserve check
  mm, hugetlb: move up anon_vma_prepare()
  mm, hugetlb: clean-up error handling in hugetlb_cow()
  mm, hugetlb: retry if failed to allocate and there is concurrent user
  mm, hugetlb: remove a hugetlb_instantiation_mutex

 fs/hugetlbfs/inode.c    |   16 +-
 include/linux/hugetlb.h |   11 ++
 mm/hugetlb.c            |  419 +++++++++++++++++++++++++----------------------
 3 files changed, 250 insertions(+), 196 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 01/20] mm, hugetlb: protect reserved pages when soft offlining a hugepage
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Don't use the reserve pool when soft offlining a hugepage.
Check we have free pages outside the reserve pool before we
dequeue the huge page. Otherwise, we can steal other's reserve page.

Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6782b41..d971233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
  */
 struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_node(h, nid);
+	if (h->free_huge_pages - h->resv_huge_pages > 0)
+		page = dequeue_huge_page_node(h, nid);
 	spin_unlock(&hugetlb_lock);
 
 	if (!page)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 01/20] mm, hugetlb: protect reserved pages when soft offlining a hugepage
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Don't use the reserve pool when soft offlining a hugepage.
Check we have free pages outside the reserve pool before we
dequeue the huge page. Otherwise, we can steal other's reserve page.

Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6782b41..d971233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
  */
 struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_node(h, nid);
+	if (h->free_huge_pages - h->resv_huge_pages > 0)
+		page = dequeue_huge_page_node(h, nid);
 	spin_unlock(&hugetlb_lock);
 
 	if (!page)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 02/20] mm, hugetlb: change variable name reservations to resv
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

'reservations' is so long name as a variable and we use 'resv_map'
to represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d971233..12b6581 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
 	} else  {
 		long err;
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *reservations = vma_resv_map(vma);
+		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&reservations->regions, idx, idx + 1);
+		err = region_chg(&resv->regions, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *reservations = vma_resv_map(vma);
+		struct resv_map *resv = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&reservations->regions, idx, idx + 1);
+		region_add(&resv->regions, idx, idx + 1);
 	}
 }
 
@@ -2168,7 +2168,7 @@ out:
 
 static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 {
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 
 	/*
 	 * This new VMA should share its siblings reservation map if present.
@@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (reservations)
-		kref_get(&reservations->refs);
+	if (resv)
+		kref_get(&resv->refs);
 }
 
 static void resv_map_put(struct vm_area_struct *vma)
 {
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 
-	if (!reservations)
+	if (!resv)
 		return;
-	kref_put(&reservations->refs, resv_map_release);
+	kref_put(&resv->refs, resv_map_release);
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	unsigned long reserve;
 	unsigned long start;
 	unsigned long end;
 
-	if (reservations) {
+	if (resv) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&reservations->regions, start, end);
+			region_count(&resv->regions, start, end);
 
 		resv_map_put(vma);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 02/20] mm, hugetlb: change variable name reservations to resv
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

'reservations' is so long name as a variable and we use 'resv_map'
to represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d971233..12b6581 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
 	} else  {
 		long err;
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *reservations = vma_resv_map(vma);
+		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&reservations->regions, idx, idx + 1);
+		err = region_chg(&resv->regions, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *reservations = vma_resv_map(vma);
+		struct resv_map *resv = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&reservations->regions, idx, idx + 1);
+		region_add(&resv->regions, idx, idx + 1);
 	}
 }
 
@@ -2168,7 +2168,7 @@ out:
 
 static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 {
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 
 	/*
 	 * This new VMA should share its siblings reservation map if present.
@@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (reservations)
-		kref_get(&reservations->refs);
+	if (resv)
+		kref_get(&resv->refs);
 }
 
 static void resv_map_put(struct vm_area_struct *vma)
 {
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 
-	if (!reservations)
+	if (!resv)
 		return;
-	kref_put(&reservations->refs, resv_map_release);
+	kref_put(&resv->refs, resv_map_release);
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
-	struct resv_map *reservations = vma_resv_map(vma);
+	struct resv_map *resv = vma_resv_map(vma);
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	unsigned long reserve;
 	unsigned long start;
 	unsigned long end;
 
-	if (reservations) {
+	if (resv) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&reservations->regions, start, end);
+			region_count(&resv->regions, start, end);
 
 		resv_map_put(vma);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
So, we should check subpool counter when avoid_reserve.
This patch implement it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 12b6581..ea1ae0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1144,13 +1144,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg)
-		if (hugepage_subpool_get_pages(spool, chg))
+	if (chg || avoid_reserve)
+		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		hugepage_subpool_put_pages(spool, chg);
+		if (chg || avoid_reserve)
+			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
@@ -1162,7 +1163,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			hugepage_subpool_put_pages(spool, chg);
+			if (chg || avoid_reserve)
+				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
 		spin_lock(&hugetlb_lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
So, we should check subpool counter when avoid_reserve.
This patch implement it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 12b6581..ea1ae0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1144,13 +1144,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg)
-		if (hugepage_subpool_get_pages(spool, chg))
+	if (chg || avoid_reserve)
+		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		hugepage_subpool_put_pages(spool, chg);
+		if (chg || avoid_reserve)
+			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
@@ -1162,7 +1163,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			hugepage_subpool_put_pages(spool, chg);
+			if (chg || avoid_reserve)
+				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
 		spin_lock(&hugetlb_lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
for private. So we don't need to check whether this mapping is for
shared or not.

This patch is just for clean-up.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea1ae0a..c017c52 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2544,8 +2544,7 @@ retry_avoidcopy:
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (!(vma->vm_flags & VM_MAYSHARE) &&
-			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
for private. So we don't need to check whether this mapping is for
shared or not.

This patch is just for clean-up.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea1ae0a..c017c52 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2544,8 +2544,7 @@ retry_avoidcopy:
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (!(vma->vm_flags & VM_MAYSHARE) &&
-			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c017c52..6c8eec2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2627,10 +2627,11 @@ retry_avoidcopy:
 	}
 	spin_unlock(&mm->page_table_lock);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
+
+	/* Caller expects lock to be held */
+	spin_lock(&mm->page_table_lock);
 	return 0;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c017c52..6c8eec2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2627,10 +2627,11 @@ retry_avoidcopy:
 	}
 	spin_unlock(&mm->page_table_lock);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
+
+	/* Caller expects lock to be held */
+	spin_lock(&mm->page_table_lock);
 	return 0;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a reserved page, just calling put_page() is not sufficient,
because put_page() invoke free_huge_page() at last step and it doesn't
know whether a page comes from a reserved pool or not. So it doesn't do
anything related to reserved count. This makes reserve count lower
than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c8eec2..3f834f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -572,6 +572,7 @@ retry_cpuset:
 				if (!vma_has_reserves(vma, chg))
 					break;
 
+				SetPagePrivate(page);
 				h->resv_huge_pages--;
 				break;
 			}
@@ -626,15 +627,20 @@ static void free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool =
 		(struct hugepage_subpool *)page_private(page);
+	bool restore_reserve;
 
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	BUG_ON(page_mapcount(page));
+	restore_reserve = PagePrivate(page);
 
 	spin_lock(&hugetlb_lock);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
+	if (restore_reserve)
+		h->resv_huge_pages++;
+
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		/* remove the page from active list */
 		list_del(&page->lru);
@@ -2616,6 +2622,8 @@ retry_avoidcopy:
 	spin_lock(&mm->page_table_lock);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
+		ClearPagePrivate(new_page);
+
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
@@ -2727,6 +2735,7 @@ retry:
 					goto retry;
 				goto out;
 			}
+			ClearPagePrivate(page);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
@@ -2773,8 +2782,10 @@ retry:
 	if (!huge_pte_none(huge_ptep_get(ptep)))
 		goto backout;
 
-	if (anon_rmap)
+	if (anon_rmap) {
+		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
+	}
 	else
 		page_dup_rmap(page);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a reserved page, just calling put_page() is not sufficient,
because put_page() invoke free_huge_page() at last step and it doesn't
know whether a page comes from a reserved pool or not. So it doesn't do
anything related to reserved count. This makes reserve count lower
than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c8eec2..3f834f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -572,6 +572,7 @@ retry_cpuset:
 				if (!vma_has_reserves(vma, chg))
 					break;
 
+				SetPagePrivate(page);
 				h->resv_huge_pages--;
 				break;
 			}
@@ -626,15 +627,20 @@ static void free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool =
 		(struct hugepage_subpool *)page_private(page);
+	bool restore_reserve;
 
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	BUG_ON(page_mapcount(page));
+	restore_reserve = PagePrivate(page);
 
 	spin_lock(&hugetlb_lock);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
+	if (restore_reserve)
+		h->resv_huge_pages++;
+
 	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		/* remove the page from active list */
 		list_del(&page->lru);
@@ -2616,6 +2622,8 @@ retry_avoidcopy:
 	spin_lock(&mm->page_table_lock);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
+		ClearPagePrivate(new_page);
+
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
@@ -2727,6 +2735,7 @@ retry:
 					goto retry;
 				goto out;
 			}
+			ClearPagePrivate(page);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
@@ -2773,8 +2782,10 @@ retry:
 	if (!huge_pte_none(huge_ptep_get(ptep)))
 		goto backout;
 
-	if (anon_rmap)
+	if (anon_rmap) {
+		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
+	}
 	else
 		page_dup_rmap(page);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 07/20] mm, hugetlb: unify region structure handling
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, to track a reserved and allocated region, we use two different
ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
Now, we are preparing to change a coarse grained lock which protect
a region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..9bf2c4a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
+	struct resv_map *resv_map;
+
 	truncate_hugepages(inode, 0);
+	resv_map = (struct resv_map *)inode->i_mapping->private_data;
+	if (resv_map)
+		kref_put(&resv_map->refs, resv_map_release);
 	clear_inode(inode);
 }
 
@@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
+	struct resv_map *resv_map;
+
+	resv_map = resv_map_alloc();
+	if (!resv_map)
+		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		INIT_LIST_HEAD(&inode->i_mapping->private_list);
+		inode->i_mapping->private_data = resv_map;
 		info = HUGETLBFS_I(inode);
 		/*
 		 * The policy is initialized here even if we are creating a
@@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	}
+	} else
+		kref_put(&resv_map->refs, resv_map_release);
+
 	return inode;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6b4890f..2677c07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>
 
 struct ctl_table;
 struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
 	long max_hpages, used_hpages;
 };
 
+struct resv_map {
+	struct kref refs;
+	struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3f834f1..8751e2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
@@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
@@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		return region_chg(&resv->regions, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		region_add(&resv->regions, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
+	struct resv_map *resv_map;
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
-		struct resv_map *resv_map = resv_map_alloc();
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		resv_map = inode->i_mapping->private_data;
+
+		chg = region_chg(&resv_map->regions, from, to);
+
+	} else {
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		region_add(&resv_map->regions, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3146,9 +3148,12 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct resv_map *resv_map = inode->i_mapping->private_data;
+	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
+	if (resv_map)
+		chg = region_truncate(&resv_map->regions, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 07/20] mm, hugetlb: unify region structure handling
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, to track a reserved and allocated region, we use two different
ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
Now, we are preparing to change a coarse grained lock which protect
a region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..9bf2c4a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
+	struct resv_map *resv_map;
+
 	truncate_hugepages(inode, 0);
+	resv_map = (struct resv_map *)inode->i_mapping->private_data;
+	if (resv_map)
+		kref_put(&resv_map->refs, resv_map_release);
 	clear_inode(inode);
 }
 
@@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
+	struct resv_map *resv_map;
+
+	resv_map = resv_map_alloc();
+	if (!resv_map)
+		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		INIT_LIST_HEAD(&inode->i_mapping->private_list);
+		inode->i_mapping->private_data = resv_map;
 		info = HUGETLBFS_I(inode);
 		/*
 		 * The policy is initialized here even if we are creating a
@@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	}
+	} else
+		kref_put(&resv_map->refs, resv_map_release);
+
 	return inode;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6b4890f..2677c07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>
 
 struct ctl_table;
 struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
 	long max_hpages, used_hpages;
 };
 
+struct resv_map {
+	struct kref refs;
+	struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3f834f1..8751e2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
@@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
@@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		return region_chg(&resv->regions, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		region_add(&resv->regions, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
+	struct resv_map *resv_map;
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
-		struct resv_map *resv_map = resv_map_alloc();
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		resv_map = inode->i_mapping->private_data;
+
+		chg = region_chg(&resv_map->regions, from, to);
+
+	} else {
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		region_add(&resv_map->regions, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3146,9 +3148,12 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct resv_map *resv_map = inode->i_mapping->private_data;
+	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
+	if (resv_map)
+		chg = region_truncate(&resv_map->regions, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 08/20] mm, hugetlb: region manipulation functions take resv_map rather list_head
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8751e2c..d9cabf6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -150,8 +150,9 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
 	/* Locate the region we are either in or before. */
@@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
@@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
 	long chg = 0;
 
@@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg;
 	long chg = 0;
 
@@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
+	region_truncate(resv_map, 0);
 	kfree(resv_map);
 }
 
@@ -1099,7 +1103,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&resv->regions, idx, idx + 1);
+		err = region_chg(resv, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1121,9 +1125,8 @@ static void vma_commit_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		/* Mark this page used in the map. */
-		region_add(&resv->regions, idx, idx + 1);
-	}
+	idx = vma_hugecache_offset(h, vma, addr);
+	region_add(resv, idx, idx + 1);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2211,7 +2214,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&resv->regions, start, end);
+			region_count(resv, start, end);
 
 		resv_map_put(vma);
 
@@ -3091,7 +3094,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode->i_mapping->private_data;
 
-		chg = region_chg(&resv_map->regions, from, to);
+		chg = region_chg(resv_map, from, to);
 
 	} else {
 		resv_map = resv_map_alloc();
@@ -3137,7 +3140,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&resv_map->regions, from, to);
+		region_add(resv_map, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3153,7 +3156,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(&resv_map->regions, offset);
+		chg = region_truncate(resv_map, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 08/20] mm, hugetlb: region manipulation functions take resv_map rather list_head
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8751e2c..d9cabf6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -150,8 +150,9 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
 	/* Locate the region we are either in or before. */
@@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
@@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
 	long chg = 0;
 
@@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg;
 	long chg = 0;
 
@@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
+	region_truncate(resv_map, 0);
 	kfree(resv_map);
 }
 
@@ -1099,7 +1103,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&resv->regions, idx, idx + 1);
+		err = region_chg(resv, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1121,9 +1125,8 @@ static void vma_commit_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		/* Mark this page used in the map. */
-		region_add(&resv->regions, idx, idx + 1);
-	}
+	idx = vma_hugecache_offset(h, vma, addr);
+	region_add(resv, idx, idx + 1);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2211,7 +2214,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&resv->regions, start, end);
+			region_count(resv, start, end);
 
 		resv_map_put(vma);
 
@@ -3091,7 +3094,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode->i_mapping->private_data;
 
-		chg = region_chg(&resv_map->regions, from, to);
+		chg = region_chg(resv_map, from, to);
 
 	} else {
 		resv_map = resv_map_alloc();
@@ -3137,7 +3140,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&resv_map->regions, from, to);
+		region_add(resv_map, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3153,7 +3156,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(&resv_map->regions, offset);
+		chg = region_truncate(resv_map, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2677c07..e29e28f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {
 
 struct resv_map {
 	struct kref refs;
+	spinlock_t lock;
 	struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d9cabf6..73034dd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
 	struct list_head link;
@@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+	spin_unlock(&resv->lock);
 	return 0;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
+	struct file_region *rg, *nrg = NULL;
 	long chg = 0;
 
+retry:
+	spin_lock(&resv->lock);
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	 * Subtle, allocate a new region at the position but make it zero
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg) {
+					chg = -ENOMEM;
+					goto out;
+				}
+				goto retry;
+			}
+		}
+
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
+		nrg = NULL;
 
-		return t - f;
+		chg = t - f;
+		goto out_locked;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
@@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			goto out_locked;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+out_locked:
+	spin_unlock(&resv->lock);
+out:
+	kfree(nrg);
 	return chg;
 }
 
@@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto out;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+out:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 
 		chg += seg_to - seg_from;
 	}
+	spin_unlock(&resv->lock);
 
 	return chg;
 }
@@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
 		return NULL;
 
 	kref_init(&resv_map->refs);
+	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	return resv_map;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2677c07..e29e28f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {
 
 struct resv_map {
 	struct kref refs;
+	spinlock_t lock;
 	struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d9cabf6..73034dd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
 	struct list_head link;
@@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+	spin_unlock(&resv->lock);
 	return 0;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
+	struct file_region *rg, *nrg = NULL;
 	long chg = 0;
 
+retry:
+	spin_lock(&resv->lock);
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	 * Subtle, allocate a new region at the position but make it zero
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg) {
+					chg = -ENOMEM;
+					goto out;
+				}
+				goto retry;
+			}
+		}
+
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
+		nrg = NULL;
 
-		return t - f;
+		chg = t - f;
+		goto out_locked;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
@@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			goto out_locked;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+out_locked:
+	spin_unlock(&resv->lock);
+out:
+	kfree(nrg);
 	return chg;
 }
 
@@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto out;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+out:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 
 		chg += seg_to - seg_from;
 	}
+	spin_unlock(&resv->lock);
 
 	return chg;
 }
@@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
 		return NULL;
 
 	kref_init(&resv_map->refs);
+	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	return resv_map;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In following patch, I change vma_resv_map() to return resv_map
for all case. This patch prepares it by removing resv_map_put() which
doesn't works properly with following change, because it works only for
HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73034dd..869c3e0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		kref_get(&resv->refs);
 }
 
-static void resv_map_put(struct vm_area_struct *vma)
-{
-	struct resv_map *resv = vma_resv_map(vma);
-
-	if (!resv)
-		return;
-	kref_put(&resv->refs, resv_map_release);
-}
-
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -2237,7 +2228,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		reserve = (end - start) -
 			region_count(resv, start, end);
 
-		resv_map_put(vma);
+		kref_put(&resv->refs, resv_map_release);
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
@@ -3164,8 +3155,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 		region_add(resv_map, from, to);
 	return 0;
 out_err:
-	if (vma)
-		resv_map_put(vma);
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In following patch, I change vma_resv_map() to return resv_map
for all case. This patch prepares it by removing resv_map_put() which
doesn't works properly with following change, because it works only for
HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73034dd..869c3e0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		kref_get(&resv->refs);
 }
 
-static void resv_map_put(struct vm_area_struct *vma)
-{
-	struct resv_map *resv = vma_resv_map(vma);
-
-	if (!resv)
-		return;
-	kref_put(&resv->refs, resv_map_release);
-}
-
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -2237,7 +2228,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		reserve = (end - start) -
 			region_count(resv, start, end);
 
-		resv_map_put(vma);
+		kref_put(&resv->refs, resv_map_release);
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
@@ -3164,8 +3155,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 		region_add(resv_map, from, to);
 	return 0;
 out_err:
-	if (vma)
-		resv_map_put(vma);
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. So unfiying it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 869c3e0..e6c0c77 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	return inode->i_mapping->private_data;
+}
+
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_MAYSHARE))
+	if (vma->vm_flags & VM_MAYSHARE) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		struct inode *inode = mapping->host;
+
+		return inode_resv_map(inode);
+
+	} else {
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
-	return NULL;
+	}
 }
 
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1107,44 +1118,31 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		return region_chg(&resv->regions, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
+	long chg;
 
-	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	resv = vma_resv_map(vma);
+	if (!resv)
 		return 1;
 
-	} else  {
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	idx = vma_hugecache_offset(h, vma, addr);
+	chg = region_chg(resv, idx, idx + 1);
 
-		err = region_chg(resv, idx, idx + 1);
-		if (err < 0)
-			return err;
-		return 0;
-	}
+	if (vma->vm_flags & VM_MAYSHARE)
+		return chg;
+	else
+		return chg < 0 ? chg : 0;
 }
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		region_add(&resv->regions, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
 
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	resv = vma_resv_map(vma);
+	if (!resv)
+		return;
 
 	idx = vma_hugecache_offset(h, vma, addr);
 	region_add(resv, idx, idx + 1);
@@ -2208,7 +2206,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv)
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_get(&resv->refs);
 }
 
@@ -2221,7 +2219,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	unsigned long start;
 	unsigned long end;
 
-	if (resv) {
+	if (!resv)
+		return;
+
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
@@ -3104,7 +3105,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		resv_map = inode->i_mapping->private_data;
+		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to);
 
@@ -3163,7 +3164,7 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	struct resv_map *resv_map = inode->i_mapping->private_data;
+	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. So unfiying it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 869c3e0..e6c0c77 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	return inode->i_mapping->private_data;
+}
+
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_MAYSHARE))
+	if (vma->vm_flags & VM_MAYSHARE) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		struct inode *inode = mapping->host;
+
+		return inode_resv_map(inode);
+
+	} else {
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
-	return NULL;
+	}
 }
 
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1107,44 +1118,31 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		return region_chg(&resv->regions, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
+	long chg;
 
-	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	resv = vma_resv_map(vma);
+	if (!resv)
 		return 1;
 
-	} else  {
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	idx = vma_hugecache_offset(h, vma, addr);
+	chg = region_chg(resv, idx, idx + 1);
 
-		err = region_chg(resv, idx, idx + 1);
-		if (err < 0)
-			return err;
-		return 0;
-	}
+	if (vma->vm_flags & VM_MAYSHARE)
+		return chg;
+	else
+		return chg < 0 ? chg : 0;
 }
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		region_add(&resv->regions, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
 
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	resv = vma_resv_map(vma);
+	if (!resv)
+		return;
 
 	idx = vma_hugecache_offset(h, vma, addr);
 	region_add(resv, idx, idx + 1);
@@ -2208,7 +2206,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv)
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_get(&resv->refs);
 }
 
@@ -2221,7 +2219,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	unsigned long start;
 	unsigned long end;
 
-	if (resv) {
+	if (!resv)
+		return;
+
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
@@ -3104,7 +3105,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		resv_map = inode->i_mapping->private_data;
+		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to);
 
@@ -3163,7 +3164,7 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	struct resv_map *resv_map = inode->i_mapping->private_data;
+	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

vma_has_reserves() can be substituted by using return value of
vma_needs_reservation(). If chg returned by vma_needs_reservation()
is 0, it means that vma has reserves. Otherwise, it means that vma don't
have reserves and need a hugepage outside of reserve pool. This definition
is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e6c0c77..22ceb04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -473,39 +473,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 		vma->vm_private_data = (void *)0;
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
-{
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decreament reserved count. Without
-		 * decreamenting, reserve count is remained after releasing
-		 * inode, because this allocated page will go into page cache
-		 * and is regarded as coming from reserved pool in releasing
-		 * step. Currently, we don't have any other solution to deal
-		 * with this situation properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return 1;
-		else
-			return 0;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
-
-	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
-	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-		return 1;
-
-	return 0;
-}
-
 static void copy_gigantic_page(struct page *dst, struct page *src)
 {
 	int i;
@@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
 	 */
-	if (!vma_has_reserves(vma, chg) &&
-			h->free_huge_pages - h->resv_huge_pages == 0)
+	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -600,7 +566,7 @@ retry_cpuset:
 			if (page) {
 				if (avoid_reserve)
 					break;
-				if (!vma_has_reserves(vma, chg))
+				if (chg)
 					break;
 
 				SetPagePrivate(page);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

vma_has_reserves() can be substituted by using return value of
vma_needs_reservation(). If chg returned by vma_needs_reservation()
is 0, it means that vma has reserves. Otherwise, it means that vma don't
have reserves and need a hugepage outside of reserve pool. This definition
is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e6c0c77..22ceb04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -473,39 +473,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 		vma->vm_private_data = (void *)0;
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
-{
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decreament reserved count. Without
-		 * decreamenting, reserve count is remained after releasing
-		 * inode, because this allocated page will go into page cache
-		 * and is regarded as coming from reserved pool in releasing
-		 * step. Currently, we don't have any other solution to deal
-		 * with this situation properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return 1;
-		else
-			return 0;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
-
-	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
-	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-		return 1;
-
-	return 0;
-}
-
 static void copy_gigantic_page(struct page *dst, struct page *src)
 {
 	int i;
@@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
 	 */
-	if (!vma_has_reserves(vma, chg) &&
-			h->free_huge_pages - h->resv_huge_pages == 0)
+	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -600,7 +566,7 @@ retry_cpuset:
 			if (page) {
 				if (avoid_reserve)
 					break;
-				if (!vma_has_reserves(vma, chg))
+				if (chg)
 					break;
 
 				SetPagePrivate(page);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, we have two variable to represent whether we can use reserved
page or not, chg and avoid_reserve, respectively. With aggregating these,
we can have more clean code. This makes no functinoal difference.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 22ceb04..8dff972 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, bool use_reserve)
 {
 	struct page *page = NULL;
 	struct mempolicy *mpol;
@@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
+	 * Or, when parent process do COW, we cannot use reserved page.
+	 * In this case, ensure enough pages are in the pool.
 	 */
-	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
-		return NULL;
-
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 retry_cpuset:
@@ -564,9 +561,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
-				if (avoid_reserve)
-					break;
-				if (chg)
+				if (!use_reserve)
 					break;
 
 				SetPagePrivate(page);
@@ -1121,6 +1116,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1136,18 +1132,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
+	use_reserve = (!chg && !avoid_reserve);
+	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		if (chg || avoid_reserve)
+		if (!use_reserve)
 			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1155,7 +1152,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			if (chg || avoid_reserve)
+			if (!use_reserve)
 				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, we have two variable to represent whether we can use reserved
page or not, chg and avoid_reserve, respectively. With aggregating these,
we can have more clean code. This makes no functinoal difference.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 22ceb04..8dff972 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, bool use_reserve)
 {
 	struct page *page = NULL;
 	struct mempolicy *mpol;
@@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
+	 * Or, when parent process do COW, we cannot use reserved page.
+	 * In this case, ensure enough pages are in the pool.
 	 */
-	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
-		return NULL;
-
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 retry_cpuset:
@@ -564,9 +561,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
-				if (avoid_reserve)
-					break;
-				if (chg)
+				if (!use_reserve)
 					break;
 
 				SetPagePrivate(page);
@@ -1121,6 +1116,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1136,18 +1132,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
+	use_reserve = (!chg && !avoid_reserve);
+	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		if (chg || avoid_reserve)
+		if (!use_reserve)
 			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1155,7 +1152,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			if (chg || avoid_reserve)
+			if (!use_reserve)
 				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In order to validate that this failure is reasonable, we need to know
whether allocation request is for reserved or not on caller function.
So moving vma_needs_reservation() up to the caller of alloc_huge_page().
There is no functional change in this patch and following patch use
this information.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8dff972..bc666cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, int use_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
-	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * need pages and subpool limit allocated allocated if no reserve
 	 * mapping overlaps.
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
-		return ERR_PTR(-ENOMEM);
-	use_reserve = (!chg && !avoid_reserve);
 	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
@@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
+	long chg;
+	bool use_reserve;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2535,7 +2531,17 @@ retry_avoidcopy:
 
 	/* Drop page_table_lock as buddy allocator may be called */
 	spin_unlock(&mm->page_table_lock);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	chg = vma_needs_reservation(h, vma, address);
+	if (chg == -ENOMEM) {
+		page_cache_release(old_page);
+
+		/* Caller expects lock to be held */
+		spin_lock(&mm->page_table_lock);
+		return VM_FAULT_OOM;
+	}
+	use_reserve = !chg && !outside_reserve;
+
+	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
 		long err = PTR_ERR(new_page);
@@ -2664,6 +2670,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	long chg;
+	bool use_reserve;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2689,7 +2697,15 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			ret = VM_FAULT_OOM;
+			goto out;
+		}
+		use_reserve = !chg;
+
+		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			if (ret == -ENOMEM)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In order to validate that this failure is reasonable, we need to know
whether allocation request is for reserved or not on caller function.
So moving vma_needs_reservation() up to the caller of alloc_huge_page().
There is no functional change in this patch and following patch use
this information.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8dff972..bc666cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, int use_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
-	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * need pages and subpool limit allocated allocated if no reserve
 	 * mapping overlaps.
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
-		return ERR_PTR(-ENOMEM);
-	use_reserve = (!chg && !avoid_reserve);
 	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
@@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
+	long chg;
+	bool use_reserve;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2535,7 +2531,17 @@ retry_avoidcopy:
 
 	/* Drop page_table_lock as buddy allocator may be called */
 	spin_unlock(&mm->page_table_lock);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	chg = vma_needs_reservation(h, vma, address);
+	if (chg == -ENOMEM) {
+		page_cache_release(old_page);
+
+		/* Caller expects lock to be held */
+		spin_lock(&mm->page_table_lock);
+		return VM_FAULT_OOM;
+	}
+	use_reserve = !chg && !outside_reserve;
+
+	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
 		long err = PTR_ERR(new_page);
@@ -2664,6 +2670,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	long chg;
+	bool use_reserve;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2689,7 +2697,15 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			ret = VM_FAULT_OOM;
+			goto out;
+		}
+		use_reserve = !chg;
+
+		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			if (ret == -ENOMEM)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 15/20] mm, hugetlb: remove a check for return value of alloc_huge_page()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't worry about other return value.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc666cf..24de2ca 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2544,7 +2544,6 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
-		long err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2573,10 +2572,7 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(&mm->page_table_lock);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		else
-			return VM_FAULT_SIGBUS;
+		return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2707,11 +2703,7 @@ retry:
 
 		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			if (ret == -ENOMEM)
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 15/20] mm, hugetlb: remove a check for return value of alloc_huge_page()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't worry about other return value.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc666cf..24de2ca 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2544,7 +2544,6 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
-		long err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2573,10 +2572,7 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(&mm->page_table_lock);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		else
-			return VM_FAULT_SIGBUS;
+		return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2707,11 +2703,7 @@ retry:
 
 		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			if (ret == -ENOMEM)
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 16/20] mm, hugetlb: move down outside_reserve check
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Just move down outside_reserve check and don't check
vma_need_reservation() when outside_resever is true. It is slightly
optimized implementation.

This makes code more readable.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 24de2ca..2372f75 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2499,7 +2499,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
 	long chg;
-	bool use_reserve;
+	bool use_reserve = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2514,6 +2514,11 @@ retry_avoidcopy:
 		return 0;
 	}
 
+	page_cache_get(old_page);
+
+	/* Drop page_table_lock as buddy allocator may be called */
+	spin_unlock(&mm->page_table_lock);
+
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
@@ -2527,19 +2532,17 @@ retry_avoidcopy:
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-	page_cache_get(old_page);
-
-	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
-	chg = vma_needs_reservation(h, vma, address);
-	if (chg == -ENOMEM) {
-		page_cache_release(old_page);
+	if (!outside_reserve) {
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			page_cache_release(old_page);
 
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
+			/* Caller expects lock to be held */
+			spin_lock(&mm->page_table_lock);
+			return VM_FAULT_OOM;
+		}
+		use_reserve = !chg;
 	}
-	use_reserve = !chg && !outside_reserve;
 
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 16/20] mm, hugetlb: move down outside_reserve check
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Just move down outside_reserve check and don't check
vma_need_reservation() when outside_resever is true. It is slightly
optimized implementation.

This makes code more readable.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 24de2ca..2372f75 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2499,7 +2499,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
 	long chg;
-	bool use_reserve;
+	bool use_reserve = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2514,6 +2514,11 @@ retry_avoidcopy:
 		return 0;
 	}
 
+	page_cache_get(old_page);
+
+	/* Drop page_table_lock as buddy allocator may be called */
+	spin_unlock(&mm->page_table_lock);
+
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
@@ -2527,19 +2532,17 @@ retry_avoidcopy:
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-	page_cache_get(old_page);
-
-	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
-	chg = vma_needs_reservation(h, vma, address);
-	if (chg == -ENOMEM) {
-		page_cache_release(old_page);
+	if (!outside_reserve) {
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			page_cache_release(old_page);
 
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
+			/* Caller expects lock to be held */
+			spin_lock(&mm->page_table_lock);
+			return VM_FAULT_OOM;
+		}
+		use_reserve = !chg;
 	}
-	use_reserve = !chg && !outside_reserve;
 
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 17/20] mm, hugetlb: move up anon_vma_prepare()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a allocated hugepage, we need some effort to recover
properly. So, it is better not to allocate a hugepage as much as possible.
So move up anon_vma_prepare() which can be failed in OOM situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2372f75..7e9a651 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2520,6 +2520,17 @@ retry_avoidcopy:
 	spin_unlock(&mm->page_table_lock);
 
 	/*
+	 * When the original hugepage is shared one, it does not have
+	 * anon_vma prepared.
+	 */
+	if (unlikely(anon_vma_prepare(vma))) {
+		page_cache_release(old_page);
+		/* Caller expects lock to be held */
+		spin_lock(&mm->page_table_lock);
+		return VM_FAULT_OOM;
+	}
+
+	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
 	 * the allocation without using the existing reserves. The pagecache
@@ -2578,18 +2589,6 @@ retry_avoidcopy:
 		return VM_FAULT_SIGBUS;
 	}
 
-	/*
-	 * When the original hugepage is shared one, it does not have
-	 * anon_vma prepared.
-	 */
-	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(new_page);
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
-	}
-
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 17/20] mm, hugetlb: move up anon_vma_prepare()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a allocated hugepage, we need some effort to recover
properly. So, it is better not to allocate a hugepage as much as possible.
So move up anon_vma_prepare() which can be failed in OOM situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2372f75..7e9a651 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2520,6 +2520,17 @@ retry_avoidcopy:
 	spin_unlock(&mm->page_table_lock);
 
 	/*
+	 * When the original hugepage is shared one, it does not have
+	 * anon_vma prepared.
+	 */
+	if (unlikely(anon_vma_prepare(vma))) {
+		page_cache_release(old_page);
+		/* Caller expects lock to be held */
+		spin_lock(&mm->page_table_lock);
+		return VM_FAULT_OOM;
+	}
+
+	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
 	 * the allocation without using the existing reserves. The pagecache
@@ -2578,18 +2589,6 @@ retry_avoidcopy:
 		return VM_FAULT_SIGBUS;
 	}
 
-	/*
-	 * When the original hugepage is shared one, it does not have
-	 * anon_vma prepared.
-	 */
-	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(new_page);
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
-	}
-
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 18/20] mm, hugetlb: clean-up error handling in hugetlb_cow()
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e9a651..8743e5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2500,6 +2500,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2524,10 +2525,8 @@ retry_avoidcopy:
 	 * anon_vma prepared.
 	 */
 	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto out_old_page;
 	}
 
 	/*
@@ -2546,11 +2545,8 @@ retry_avoidcopy:
 	if (!outside_reserve) {
 		chg = vma_needs_reservation(h, vma, address);
 		if (chg == -ENOMEM) {
-			page_cache_release(old_page);
-
-			/* Caller expects lock to be held */
-			spin_lock(&mm->page_table_lock);
-			return VM_FAULT_OOM;
+			ret = VM_FAULT_OOM;
+			goto out_old_page;
 		}
 		use_reserve = !chg;
 	}
@@ -2584,9 +2580,8 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_SIGBUS;
+		ret = VM_FAULT_SIGBUS;
+		goto out_lock;
 	}
 
 	copy_user_huge_page(new_page, old_page, address, vma,
@@ -2617,11 +2612,12 @@ retry_avoidcopy:
 	spin_unlock(&mm->page_table_lock);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	page_cache_release(new_page);
+out_old_page:
 	page_cache_release(old_page);
-
+out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(&mm->page_table_lock);
-	return 0;
+	return ret;
 }
 
 /* Return the pagecache page at a given address within a VMA */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 18/20] mm, hugetlb: clean-up error handling in hugetlb_cow()
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e9a651..8743e5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2500,6 +2500,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2524,10 +2525,8 @@ retry_avoidcopy:
 	 * anon_vma prepared.
 	 */
 	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto out_old_page;
 	}
 
 	/*
@@ -2546,11 +2545,8 @@ retry_avoidcopy:
 	if (!outside_reserve) {
 		chg = vma_needs_reservation(h, vma, address);
 		if (chg == -ENOMEM) {
-			page_cache_release(old_page);
-
-			/* Caller expects lock to be held */
-			spin_lock(&mm->page_table_lock);
-			return VM_FAULT_OOM;
+			ret = VM_FAULT_OOM;
+			goto out_old_page;
 		}
 		use_reserve = !chg;
 	}
@@ -2584,9 +2580,8 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
-		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
-		return VM_FAULT_SIGBUS;
+		ret = VM_FAULT_SIGBUS;
+		goto out_lock;
 	}
 
 	copy_user_huge_page(new_page, old_page, address, vma,
@@ -2617,11 +2612,12 @@ retry_avoidcopy:
 	spin_unlock(&mm->page_table_lock);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	page_cache_release(new_page);
+out_old_page:
 	page_cache_release(old_page);
-
+out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(&mm->page_table_lock);
-	return 0;
+	return ret;
 }
 
 /* Return the pagecache page at a given address within a VMA */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. For achieving it, at first, we should ensure that
no one get a SIGBUS if there are enough hugepages.

For this purpose, if we fail to allocate a new hugepage when there is
concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
these threads defer to get a SIGBUS signal until there is no
concurrent user, and so, we can ensure that no one get a SIGBUS if there
are enough hugepages.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e29e28f..981c539 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -242,6 +242,7 @@ struct hstate {
 	int next_nid_to_free;
 	unsigned int order;
 	unsigned long mask;
+	unsigned long nr_dequeue_users;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
 	unsigned long free_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8743e5c..0501fe5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -561,6 +561,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
+				h->nr_dequeue_users++;
 				if (!use_reserve)
 					break;
 
@@ -577,6 +578,16 @@ retry_cpuset:
 	return page;
 }
 
+static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
+{
+	if (!do_dequeue)
+		return;
+
+	spin_lock(&hugetlb_lock);
+	h->nr_dequeue_users--;
+	spin_unlock(&hugetlb_lock);
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int use_reserve)
+				    unsigned long addr, int use_reserve,
+				    unsigned long *nr_dequeue_users,
+				    bool *do_dequeue)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
+	*do_dequeue = true;
 	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
+		*nr_dequeue_users = h->nr_dequeue_users;
+		*do_dequeue = false;
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
@@ -1894,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
 	h->free_huge_pages = 0;
+	h->nr_dequeue_users = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 	INIT_LIST_HEAD(&h->hugepage_activelist);
@@ -2500,6 +2517,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -2551,11 +2570,17 @@ retry_avoidcopy:
 		use_reserve = !chg;
 	}
 
-	new_page = alloc_huge_page(vma, address, use_reserve);
+	new_page = alloc_huge_page(vma, address, use_reserve,
+						&nr_dequeue_users, &do_dequeue);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
 
+		if (nr_dequeue_users) {
+			ret = 0;
+			goto out_lock;
+		}
+
 		/*
 		 * If a process owning a MAP_PRIVATE mapping fails to COW,
 		 * it is due to references held by a child and an insufficient
@@ -2580,6 +2605,9 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
+		if (use_reserve)
+			WARN_ON_ONCE(1);
+
 		ret = VM_FAULT_SIGBUS;
 		goto out_lock;
 	}
@@ -2614,6 +2642,7 @@ retry_avoidcopy:
 	page_cache_release(new_page);
 out_old_page:
 	page_cache_release(old_page);
+	commit_dequeued_huge_page(h, do_dequeue);
 out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(&mm->page_table_lock);
@@ -2666,6 +2695,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t new_pte;
 	long chg;
 	bool use_reserve;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2699,9 +2730,17 @@ retry:
 		}
 		use_reserve = !chg;
 
-		page = alloc_huge_page(vma, address, use_reserve);
+		page = alloc_huge_page(vma, address, use_reserve,
+					&nr_dequeue_users, &do_dequeue);
 		if (IS_ERR(page)) {
-			ret = VM_FAULT_SIGBUS;
+			if (nr_dequeue_users)
+				ret = 0;
+			else {
+				if (use_reserve)
+					WARN_ON_ONCE(1);
+
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
@@ -2714,22 +2753,24 @@ retry:
 			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
 			if (err) {
 				put_page(page);
+				commit_dequeued_huge_page(h, do_dequeue);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
 			ClearPagePrivate(page);
+			commit_dequeued_huge_page(h, do_dequeue);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
+			anon_rmap = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
 				ret = VM_FAULT_OOM;
 				goto backout_unlocked;
 			}
-			anon_rmap = 1;
 		}
 	} else {
 		/*
@@ -2783,6 +2824,8 @@ retry:
 	spin_unlock(&mm->page_table_lock);
 	unlock_page(page);
 out:
+	if (anon_rmap)
+		commit_dequeued_huge_page(h, do_dequeue);
 	return ret;
 
 backout:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. For achieving it, at first, we should ensure that
no one get a SIGBUS if there are enough hugepages.

For this purpose, if we fail to allocate a new hugepage when there is
concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
these threads defer to get a SIGBUS signal until there is no
concurrent user, and so, we can ensure that no one get a SIGBUS if there
are enough hugepages.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e29e28f..981c539 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -242,6 +242,7 @@ struct hstate {
 	int next_nid_to_free;
 	unsigned int order;
 	unsigned long mask;
+	unsigned long nr_dequeue_users;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
 	unsigned long free_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8743e5c..0501fe5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -561,6 +561,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
+				h->nr_dequeue_users++;
 				if (!use_reserve)
 					break;
 
@@ -577,6 +578,16 @@ retry_cpuset:
 	return page;
 }
 
+static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
+{
+	if (!do_dequeue)
+		return;
+
+	spin_lock(&hugetlb_lock);
+	h->nr_dequeue_users--;
+	spin_unlock(&hugetlb_lock);
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int use_reserve)
+				    unsigned long addr, int use_reserve,
+				    unsigned long *nr_dequeue_users,
+				    bool *do_dequeue)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
+	*do_dequeue = true;
 	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
+		*nr_dequeue_users = h->nr_dequeue_users;
+		*do_dequeue = false;
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
@@ -1894,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
 	h->free_huge_pages = 0;
+	h->nr_dequeue_users = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 	INIT_LIST_HEAD(&h->hugepage_activelist);
@@ -2500,6 +2517,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -2551,11 +2570,17 @@ retry_avoidcopy:
 		use_reserve = !chg;
 	}
 
-	new_page = alloc_huge_page(vma, address, use_reserve);
+	new_page = alloc_huge_page(vma, address, use_reserve,
+						&nr_dequeue_users, &do_dequeue);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
 
+		if (nr_dequeue_users) {
+			ret = 0;
+			goto out_lock;
+		}
+
 		/*
 		 * If a process owning a MAP_PRIVATE mapping fails to COW,
 		 * it is due to references held by a child and an insufficient
@@ -2580,6 +2605,9 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
+		if (use_reserve)
+			WARN_ON_ONCE(1);
+
 		ret = VM_FAULT_SIGBUS;
 		goto out_lock;
 	}
@@ -2614,6 +2642,7 @@ retry_avoidcopy:
 	page_cache_release(new_page);
 out_old_page:
 	page_cache_release(old_page);
+	commit_dequeued_huge_page(h, do_dequeue);
 out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(&mm->page_table_lock);
@@ -2666,6 +2695,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t new_pte;
 	long chg;
 	bool use_reserve;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2699,9 +2730,17 @@ retry:
 		}
 		use_reserve = !chg;
 
-		page = alloc_huge_page(vma, address, use_reserve);
+		page = alloc_huge_page(vma, address, use_reserve,
+					&nr_dequeue_users, &do_dequeue);
 		if (IS_ERR(page)) {
-			ret = VM_FAULT_SIGBUS;
+			if (nr_dequeue_users)
+				ret = 0;
+			else {
+				if (use_reserve)
+					WARN_ON_ONCE(1);
+
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
@@ -2714,22 +2753,24 @@ retry:
 			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
 			if (err) {
 				put_page(page);
+				commit_dequeued_huge_page(h, do_dequeue);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
 			ClearPagePrivate(page);
+			commit_dequeued_huge_page(h, do_dequeue);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
+			anon_rmap = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
 				ret = VM_FAULT_OOM;
 				goto backout_unlocked;
 			}
-			anon_rmap = 1;
 		}
 	} else {
 		/*
@@ -2783,6 +2824,8 @@ retry:
 	spin_unlock(&mm->page_table_lock);
 	unlock_page(page);
 out:
+	if (anon_rmap)
+		commit_dequeued_huge_page(h, do_dequeue);
 	return ret;
 
 backout:
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 20/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-09  9:26   ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, we have prepared to have an infrastructure in order to remove a this
awkward mutex which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0501fe5..f2c3a51 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2504,9 +2504,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
@@ -2844,7 +2842,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2864,17 +2861,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
-	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
-	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
-		goto out_mutex;
-	}
+	if (huge_pte_none(entry))
+		return hugetlb_no_page(mm, vma, address, ptep, flags);
 
 	ret = 0;
 
@@ -2887,10 +2876,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, address) < 0) {
-			ret = VM_FAULT_OOM;
-			goto out_mutex;
-		}
+		if (vma_needs_reservation(h, vma, address) < 0)
+			return VM_FAULT_OOM;
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -2939,9 +2926,6 @@ out_page_table_lock:
 		unlock_page(page);
 	put_page(page);
 
-out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH v2 20/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-08-09  9:26   ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-09  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, we have prepared to have an infrastructure in order to remove a this
awkward mutex which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0501fe5..f2c3a51 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2504,9 +2504,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
@@ -2844,7 +2842,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2864,17 +2861,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
-	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
-	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
-		goto out_mutex;
-	}
+	if (huge_pte_none(entry))
+		return hugetlb_no_page(mm, vma, address, ptep, flags);
 
 	ret = 0;
 
@@ -2887,10 +2876,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, address) < 0) {
-			ret = VM_FAULT_OOM;
-			goto out_mutex;
-		}
+		if (vma_needs_reservation(h, vma, address) < 0)
+			return VM_FAULT_OOM;
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -2939,9 +2926,6 @@ out_page_table_lock:
 		unlock_page(page);
 	put_page(page);
 
-out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
 	return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 01/20] mm, hugetlb: protect reserved pages when soft offlining a hugepage
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-12 13:20     ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:20 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> Don't use the reserve pool when soft offlining a hugepage.
> Check we have free pages outside the reserve pool before we
> dequeue the huge page. Otherwise, we can steal other's reserve page.
> 
> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6782b41..d971233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>   */
>  struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_node(h, nid);
> +	if (h->free_huge_pages - h->resv_huge_pages > 0)
> +		page = dequeue_huge_page_node(h, nid);
>  	spin_unlock(&hugetlb_lock);
>  
>  	if (!page)



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 01/20] mm, hugetlb: protect reserved pages when soft offlining a hugepage
@ 2013-08-12 13:20     ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:20 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> Don't use the reserve pool when soft offlining a hugepage.
> Check we have free pages outside the reserve pool before we
> dequeue the huge page. Otherwise, we can steal other's reserve page.
> 
> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6782b41..d971233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>   */
>  struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_node(h, nid);
> +	if (h->free_huge_pages - h->resv_huge_pages > 0)
> +		page = dequeue_huge_page_node(h, nid);
>  	spin_unlock(&hugetlb_lock);
>  
>  	if (!page)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 02/20] mm, hugetlb: change variable name reservations to resv
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-12 13:21     ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> 'reservations' is so long name as a variable and we use 'resv_map'
> to represent 'struct resv_map' in other place. To reduce confusion and
> unreadability, change it.
> 
> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d971233..12b6581 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
>  	} else  {
>  		long err;
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *reservations = vma_resv_map(vma);
> +		struct resv_map *resv = vma_resv_map(vma);
>  
> -		err = region_chg(&reservations->regions, idx, idx + 1);
> +		err = region_chg(&resv->regions, idx, idx + 1);
>  		if (err < 0)
>  			return err;
>  		return 0;
> @@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
>  
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *reservations = vma_resv_map(vma);
> +		struct resv_map *resv = vma_resv_map(vma);
>  
>  		/* Mark this page used in the map. */
> -		region_add(&reservations->regions, idx, idx + 1);
> +		region_add(&resv->regions, idx, idx + 1);
>  	}
>  }
>  
> @@ -2168,7 +2168,7 @@ out:
>  
>  static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  {
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  
>  	/*
>  	 * This new VMA should share its siblings reservation map if present.
> @@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  	 * after this open call completes.  It is therefore safe to take a
>  	 * new reference here without additional locking.
>  	 */
> -	if (reservations)
> -		kref_get(&reservations->refs);
> +	if (resv)
> +		kref_get(&resv->refs);
>  }
>  
>  static void resv_map_put(struct vm_area_struct *vma)
>  {
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  
> -	if (!reservations)
> +	if (!resv)
>  		return;
> -	kref_put(&reservations->refs, resv_map_release);
> +	kref_put(&resv->refs, resv_map_release);
>  }
>  
>  static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>  	struct hstate *h = hstate_vma(vma);
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	unsigned long reserve;
>  	unsigned long start;
>  	unsigned long end;
>  
> -	if (reservations) {
> +	if (resv) {
>  		start = vma_hugecache_offset(h, vma, vma->vm_start);
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>  
>  		reserve = (end - start) -
> -			region_count(&reservations->regions, start, end);
> +			region_count(&resv->regions, start, end);
>  
>  		resv_map_put(vma);
>  



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 02/20] mm, hugetlb: change variable name reservations to resv
@ 2013-08-12 13:21     ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> 'reservations' is so long name as a variable and we use 'resv_map'
> to represent 'struct resv_map' in other place. To reduce confusion and
> unreadability, change it.
> 
> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d971233..12b6581 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
>  	} else  {
>  		long err;
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *reservations = vma_resv_map(vma);
> +		struct resv_map *resv = vma_resv_map(vma);
>  
> -		err = region_chg(&reservations->regions, idx, idx + 1);
> +		err = region_chg(&resv->regions, idx, idx + 1);
>  		if (err < 0)
>  			return err;
>  		return 0;
> @@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
>  
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *reservations = vma_resv_map(vma);
> +		struct resv_map *resv = vma_resv_map(vma);
>  
>  		/* Mark this page used in the map. */
> -		region_add(&reservations->regions, idx, idx + 1);
> +		region_add(&resv->regions, idx, idx + 1);
>  	}
>  }
>  
> @@ -2168,7 +2168,7 @@ out:
>  
>  static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  {
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  
>  	/*
>  	 * This new VMA should share its siblings reservation map if present.
> @@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  	 * after this open call completes.  It is therefore safe to take a
>  	 * new reference here without additional locking.
>  	 */
> -	if (reservations)
> -		kref_get(&reservations->refs);
> +	if (resv)
> +		kref_get(&resv->refs);
>  }
>  
>  static void resv_map_put(struct vm_area_struct *vma)
>  {
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  
> -	if (!reservations)
> +	if (!resv)
>  		return;
> -	kref_put(&reservations->refs, resv_map_release);
> +	kref_put(&resv->refs, resv_map_release);
>  }
>  
>  static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>  	struct hstate *h = hstate_vma(vma);
> -	struct resv_map *reservations = vma_resv_map(vma);
> +	struct resv_map *resv = vma_resv_map(vma);
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	unsigned long reserve;
>  	unsigned long start;
>  	unsigned long end;
>  
> -	if (reservations) {
> +	if (resv) {
>  		start = vma_hugecache_offset(h, vma, vma->vm_start);
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>  
>  		reserve = (end - start) -
> -			region_count(&reservations->regions, start, end);
> +			region_count(&resv->regions, start, end);
>  
>  		resv_map_put(vma);
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-12 13:31     ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:31 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
> for private. So we don't need to check whether this mapping is for
> shared or not.
> 
> This patch is just for clean-up.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1ae0a..c017c52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,8 +2544,7 @@ retry_avoidcopy:
>  	 * at the time of fork() could consume its reserves on COW instead
>  	 * of the full address range.
>  	 */
> -	if (!(vma->vm_flags & VM_MAYSHARE) &&
> -			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>  



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
@ 2013-08-12 13:31     ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:31 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
> for private. So we don't need to check whether this mapping is for
> shared or not.
> 
> This patch is just for clean-up.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1ae0a..c017c52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,8 +2544,7 @@ retry_avoidcopy:
>  	 * at the time of fork() could consume its reserves on COW instead
>  	 * of the full address range.
>  	 */
> -	if (!(vma->vm_flags & VM_MAYSHARE) &&
> -			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-12 13:35     ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:35 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> We don't need to grab a page_table_lock when we try to release a page.
> So, defer to grab a page_table_lock.
> 
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c017c52..6c8eec2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2627,10 +2627,11 @@ retry_avoidcopy:
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	/* Caller expects lock to be held */
> -	spin_lock(&mm->page_table_lock);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> +
> +	/* Caller expects lock to be held */
> +	spin_lock(&mm->page_table_lock);
>  	return 0;
>  }
>  



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
@ 2013-08-12 13:35     ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 13:35 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> We don't need to grab a page_table_lock when we try to release a page.
> So, defer to grab a page_table_lock.
> 
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c017c52..6c8eec2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2627,10 +2627,11 @@ retry_avoidcopy:
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	/* Caller expects lock to be held */
> -	spin_lock(&mm->page_table_lock);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> +
> +	/* Caller expects lock to be held */
> +	spin_lock(&mm->page_table_lock);
>  	return 0;
>  }
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-12 22:03     ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 22:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
> 
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>  
>  struct resv_map {
>  	struct kref refs;
> +	spinlock_t lock;
>  	struct list_head regions;
>  };
>  extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d9cabf6..73034dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   *
> - * The region data structures are protected by a combination of the mmap_sem
> - * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
> - * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation_mutex:
> - *
> - *	down_write(&mm->mmap_sem);
> - * or
> - *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + * The region data structures are embedded into a resv_map and
> + * protected by a resv_map's lock
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	}
>  	nrg->from = f;
>  	nrg->to = t;
> +	spin_unlock(&resv->lock);
>  	return 0;
>  }
>  
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg;
> +	struct file_region *rg, *nrg = NULL;
>  	long chg = 0;
>  
> +retry:
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are before or in. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  	 * Subtle, allocate a new region at the position but make it zero
>  	 * size such that we can guarantee to record the reservation. */
>  	if (&rg->link == head || t < rg->from) {
> -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> -		if (!nrg)
> -			return -ENOMEM;
> +		if (!nrg) {
> +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg) {
> +					chg = -ENOMEM;
> +					goto out;

Just return -ENOMEM here.

> +				}
> +				goto retry;
> +			}
> +		}
> +

You seem to be right, at least in my workloads, the hold times for the
region lock is quite small, so a spinlock is better than a sleeping
lock.

That said, this code is quite messy, but I cannot think of a
better/cleaner approach right now.


>  		nrg->from = f;
>  		nrg->to   = f;
>  		INIT_LIST_HEAD(&nrg->link);
>  		list_add(&nrg->link, rg->link.prev);
> +		nrg = NULL;
>  
> -		return t - f;
> +		chg = t - f;
> +		goto out_locked;
>  	}
>  
>  	/* Round our left edge to the current segment if it encloses us. */
> @@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		if (&rg->link == head)
>  			break;
>  		if (rg->from > t)
> -			return chg;
> +			goto out_locked;
>  
>  		/* We overlap with this area, if it extends further than
>  		 * us then we must extend ourselves.  Account for its
> @@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		}
>  		chg -= rg->to - rg->from;
>  	}
> +
> +out_locked:
> +	spin_unlock(&resv->lock);
> +out:
> +	kfree(nrg);
>  	return chg;
>  }
>  
> @@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (end <= rg->to)
>  			break;
>  	if (&rg->link == head)
> -		return 0;
> +		goto out;
>  
>  	/* If we are in the middle of a region then adjust it. */
>  	if (end > rg->from) {
> @@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
>  		list_del(&rg->link);
>  		kfree(rg);
>  	}
> +
> +out:
> +	spin_unlock(&resv->lock);
>  	return chg;
>  }
>  
> @@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  	struct file_region *rg;
>  	long chg = 0;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate each segment we overlap with, and count that overlap. */
>  	list_for_each_entry(rg, head, link) {
>  		long seg_from;
> @@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  
>  		chg += seg_to - seg_from;
>  	}
> +	spin_unlock(&resv->lock);
>  
>  	return chg;
>  }
> @@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
>  		return NULL;
>  
>  	kref_init(&resv_map->refs);
> +	spin_lock_init(&resv_map->lock);
>  	INIT_LIST_HEAD(&resv_map->regions);
>  
>  	return resv_map;



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-08-12 22:03     ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-08-12 22:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Fri, 2013-08-09 at 18:26 +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
> 
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>  
>  struct resv_map {
>  	struct kref refs;
> +	spinlock_t lock;
>  	struct list_head regions;
>  };
>  extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d9cabf6..73034dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   *
> - * The region data structures are protected by a combination of the mmap_sem
> - * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
> - * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation_mutex:
> - *
> - *	down_write(&mm->mmap_sem);
> - * or
> - *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + * The region data structures are embedded into a resv_map and
> + * protected by a resv_map's lock
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	}
>  	nrg->from = f;
>  	nrg->to = t;
> +	spin_unlock(&resv->lock);
>  	return 0;
>  }
>  
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg;
> +	struct file_region *rg, *nrg = NULL;
>  	long chg = 0;
>  
> +retry:
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are before or in. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  	 * Subtle, allocate a new region at the position but make it zero
>  	 * size such that we can guarantee to record the reservation. */
>  	if (&rg->link == head || t < rg->from) {
> -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> -		if (!nrg)
> -			return -ENOMEM;
> +		if (!nrg) {
> +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg) {
> +					chg = -ENOMEM;
> +					goto out;

Just return -ENOMEM here.

> +				}
> +				goto retry;
> +			}
> +		}
> +

You seem to be right, at least in my workloads, the hold times for the
region lock is quite small, so a spinlock is better than a sleeping
lock.

That said, this code is quite messy, but I cannot think of a
better/cleaner approach right now.


>  		nrg->from = f;
>  		nrg->to   = f;
>  		INIT_LIST_HEAD(&nrg->link);
>  		list_add(&nrg->link, rg->link.prev);
> +		nrg = NULL;
>  
> -		return t - f;
> +		chg = t - f;
> +		goto out_locked;
>  	}
>  
>  	/* Round our left edge to the current segment if it encloses us. */
> @@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		if (&rg->link == head)
>  			break;
>  		if (rg->from > t)
> -			return chg;
> +			goto out_locked;
>  
>  		/* We overlap with this area, if it extends further than
>  		 * us then we must extend ourselves.  Account for its
> @@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		}
>  		chg -= rg->to - rg->from;
>  	}
> +
> +out_locked:
> +	spin_unlock(&resv->lock);
> +out:
> +	kfree(nrg);
>  	return chg;
>  }
>  
> @@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (end <= rg->to)
>  			break;
>  	if (&rg->link == head)
> -		return 0;
> +		goto out;
>  
>  	/* If we are in the middle of a region then adjust it. */
>  	if (end > rg->from) {
> @@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
>  		list_del(&rg->link);
>  		kfree(rg);
>  	}
> +
> +out:
> +	spin_unlock(&resv->lock);
>  	return chg;
>  }
>  
> @@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  	struct file_region *rg;
>  	long chg = 0;
>  
> +	spin_lock(&resv->lock);
>  	/* Locate each segment we overlap with, and count that overlap. */
>  	list_for_each_entry(rg, head, link) {
>  		long seg_from;
> @@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  
>  		chg += seg_to - seg_from;
>  	}
> +	spin_unlock(&resv->lock);
>  
>  	return chg;
>  }
> @@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
>  		return NULL;
>  
>  	kref_init(&resv_map->refs);
> +	spin_lock_init(&resv_map->lock);
>  	INIT_LIST_HEAD(&resv_map->regions);
>  
>  	return resv_map;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-08-12 22:03     ` Davidlohr Bueso
@ 2013-08-13  7:45       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-13  7:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton

> > @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
> >  	 * Subtle, allocate a new region at the position but make it zero
> >  	 * size such that we can guarantee to record the reservation. */
> >  	if (&rg->link == head || t < rg->from) {
> > -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > -		if (!nrg)
> > -			return -ENOMEM;
> > +		if (!nrg) {
> > +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> > +			if (!nrg) {
> > +				spin_unlock(&resv->lock);
> > +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > +				if (!nrg) {
> > +					chg = -ENOMEM;
> > +					goto out;
> 
> Just return -ENOMEM here.

Okay. It looks better!

> 
> > +				}
> > +				goto retry;
> > +			}
> > +		}
> > +
> 
> You seem to be right, at least in my workloads, the hold times for the
> region lock is quite small, so a spinlock is better than a sleeping
> lock.
> 
> That said, this code is quite messy, but I cannot think of a
> better/cleaner approach right now.

Okay.

Thanks for review!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-08-13  7:45       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-13  7:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton

> > @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
> >  	 * Subtle, allocate a new region at the position but make it zero
> >  	 * size such that we can guarantee to record the reservation. */
> >  	if (&rg->link == head || t < rg->from) {
> > -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > -		if (!nrg)
> > -			return -ENOMEM;
> > +		if (!nrg) {
> > +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> > +			if (!nrg) {
> > +				spin_unlock(&resv->lock);
> > +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > +				if (!nrg) {
> > +					chg = -ENOMEM;
> > +					goto out;
> 
> Just return -ENOMEM here.

Okay. It looks better!

> 
> > +				}
> > +				goto retry;
> > +			}
> > +		}
> > +
> 
> You seem to be right, at least in my workloads, the hold times for the
> region lock is quite small, so a spinlock is better than a sleeping
> lock.
> 
> That said, this code is quite messy, but I cannot think of a
> better/cleaner approach right now.

Okay.

Thanks for review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2013-08-09  9:26 ` Joonsoo Kim
@ 2013-08-14 23:22   ` Andrew Morton
  -1 siblings, 0 replies; 139+ messages in thread
From: Andrew Morton @ 2013-08-14 23:22 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri,  9 Aug 2013 18:26:18 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
> fail to allocate a hugepage, because many threads dequeue a hugepage
> to handle a fault of same address. This makes reserved pool shortage
> just for a little while and this cause faulting thread to get a SIGBUS
> signal, although there are enough hugepages.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
>     
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance problem reported by Davidlohr Bueso [1].
> 
> This patchset consist of 4 parts roughly.
> 
> Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
> 	
> 	These can be merged into mainline separately.
> 
> Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
> 	the hugetlb_instantiation_mutex.
> 	
> 	Breaking dependency on the hugetlb_instantiation_mutex for
> 	tracking a region is also needed by other approaches like as
> 	'table mutexes', so these can be merged into mainline separately.
> 
> Part 3. (10-13) Clean-up.
> 	
> 	IMO, these make code really simple, so these are worth to go into
> 	mainline separately, regardless success of my approach.
> 
> Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
> 	
> 	Almost patches are just for clean-up to error handling path.
> 	In patch 19, retry approach is implemented that if faulted thread
> 	failed to allocate a hugepage, it continue to run a fault handler
> 	until there is no concurrent thread having a hugepage. This causes
> 	threads who want to get a last hugepage to be serialized, so
> 	threads don't get a SIGBUS if enough hugepage exist.
> 	In patch 20, remove a hugetlb_instantiation_mutex.

I grabbed the first six easy ones.  I'm getting a bit cross-eyed from
all the reviewing lately so I'll wait and see if someone else takes an
interest in the other patches, sorry.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-08-14 23:22   ` Andrew Morton
  0 siblings, 0 replies; 139+ messages in thread
From: Andrew Morton @ 2013-08-14 23:22 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri,  9 Aug 2013 18:26:18 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
> fail to allocate a hugepage, because many threads dequeue a hugepage
> to handle a fault of same address. This makes reserved pool shortage
> just for a little while and this cause faulting thread to get a SIGBUS
> signal, although there are enough hugepages.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
>     
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance problem reported by Davidlohr Bueso [1].
> 
> This patchset consist of 4 parts roughly.
> 
> Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
> 	
> 	These can be merged into mainline separately.
> 
> Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
> 	the hugetlb_instantiation_mutex.
> 	
> 	Breaking dependency on the hugetlb_instantiation_mutex for
> 	tracking a region is also needed by other approaches like as
> 	'table mutexes', so these can be merged into mainline separately.
> 
> Part 3. (10-13) Clean-up.
> 	
> 	IMO, these make code really simple, so these are worth to go into
> 	mainline separately, regardless success of my approach.
> 
> Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
> 	
> 	Almost patches are just for clean-up to error handling path.
> 	In patch 19, retry approach is implemented that if faulted thread
> 	failed to allocate a hugepage, it continue to run a fault handler
> 	until there is no concurrent thread having a hugepage. This causes
> 	threads who want to get a last hugepage to be serialized, so
> 	threads don't get a SIGBUS if enough hugepage exist.
> 	In patch 20, remove a hugetlb_instantiation_mutex.

I grabbed the first six easy ones.  I'm getting a bit cross-eyed from
all the reviewing lately so I'll wait and see if someone else takes an
interest in the other patches, sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2013-08-14 23:22   ` Andrew Morton
@ 2013-08-16 17:18     ` JoonSoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: JoonSoo Kim @ 2013-08-16 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, Linux Memory Management List,
	LKML, Wanpeng Li, Naoya Horiguchi, Hillf Danton

2013/8/15 Andrew Morton <akpm@linux-foundation.org>:
> On Fri,  9 Aug 2013 18:26:18 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>
>> Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
>> fail to allocate a hugepage, because many threads dequeue a hugepage
>> to handle a fault of same address. This makes reserved pool shortage
>> just for a little while and this cause faulting thread to get a SIGBUS
>> signal, although there are enough hugepages.
>>
>> To solve this problem, we already have a nice solution, that is,
>> a hugetlb_instantiation_mutex. This blocks other threads to dive into
>> a fault handler. This solve the problem clearly, but it introduce
>> performance degradation, because it serialize all fault handling.
>>
>> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
>> performance problem reported by Davidlohr Bueso [1].
>>
>> This patchset consist of 4 parts roughly.
>>
>> Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
>>
>>       These can be merged into mainline separately.
>>
>> Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
>>       the hugetlb_instantiation_mutex.
>>
>>       Breaking dependency on the hugetlb_instantiation_mutex for
>>       tracking a region is also needed by other approaches like as
>>       'table mutexes', so these can be merged into mainline separately.
>>
>> Part 3. (10-13) Clean-up.
>>
>>       IMO, these make code really simple, so these are worth to go into
>>       mainline separately, regardless success of my approach.
>>
>> Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
>>
>>       Almost patches are just for clean-up to error handling path.
>>       In patch 19, retry approach is implemented that if faulted thread
>>       failed to allocate a hugepage, it continue to run a fault handler
>>       until there is no concurrent thread having a hugepage. This causes
>>       threads who want to get a last hugepage to be serialized, so
>>       threads don't get a SIGBUS if enough hugepage exist.
>>       In patch 20, remove a hugetlb_instantiation_mutex.
>
> I grabbed the first six easy ones.  I'm getting a bit cross-eyed from
> all the reviewing lately so I'll wait and see if someone else takes an
> interest in the other patches, sorry.

Hello, Andrew.
Thanks for reviewing this and merging the first six patches.
There is no hurry to me, so I'd willingly wait for someone to review this.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-08-16 17:18     ` JoonSoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: JoonSoo Kim @ 2013-08-16 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, Linux Memory Management List,
	LKML, Wanpeng Li, Naoya Horiguchi, Hillf Danton

2013/8/15 Andrew Morton <akpm@linux-foundation.org>:
> On Fri,  9 Aug 2013 18:26:18 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>
>> Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
>> fail to allocate a hugepage, because many threads dequeue a hugepage
>> to handle a fault of same address. This makes reserved pool shortage
>> just for a little while and this cause faulting thread to get a SIGBUS
>> signal, although there are enough hugepages.
>>
>> To solve this problem, we already have a nice solution, that is,
>> a hugetlb_instantiation_mutex. This blocks other threads to dive into
>> a fault handler. This solve the problem clearly, but it introduce
>> performance degradation, because it serialize all fault handling.
>>
>> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
>> performance problem reported by Davidlohr Bueso [1].
>>
>> This patchset consist of 4 parts roughly.
>>
>> Part 1. (1-6) Random fix and clean-up. Enhancing error handling.
>>
>>       These can be merged into mainline separately.
>>
>> Part 2. (7-9) Protect region tracking via it's own spinlock, instead of
>>       the hugetlb_instantiation_mutex.
>>
>>       Breaking dependency on the hugetlb_instantiation_mutex for
>>       tracking a region is also needed by other approaches like as
>>       'table mutexes', so these can be merged into mainline separately.
>>
>> Part 3. (10-13) Clean-up.
>>
>>       IMO, these make code really simple, so these are worth to go into
>>       mainline separately, regardless success of my approach.
>>
>> Part 4. (14-20) Remove a hugetlb_instantiation_mutex.
>>
>>       Almost patches are just for clean-up to error handling path.
>>       In patch 19, retry approach is implemented that if faulted thread
>>       failed to allocate a hugepage, it continue to run a fault handler
>>       until there is no concurrent thread having a hugepage. This causes
>>       threads who want to get a last hugepage to be serialized, so
>>       threads don't get a SIGBUS if enough hugepage exist.
>>       In patch 20, remove a hugetlb_instantiation_mutex.
>
> I grabbed the first six easy ones.  I'm getting a bit cross-eyed from
> all the reviewing lately so I'll wait and see if someone else takes an
> interest in the other patches, sorry.

Hello, Andrew.
Thanks for reviewing this and merging the first six patches.
There is no hurry to me, so I'd willingly wait for someone to review this.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:28     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:28 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> So, we should check subpool counter when avoid_reserve.
> This patch implement it.

Can you explain this better ? ie, if we don't have a reservation in the
area chg != 0. So why look at avoid_reserve. 

Also the code will become if you did

if (!chg && avoid_reserve)
   chg = 1;

and then rest of the code will be able to handle the case.

>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 12b6581..ea1ae0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1144,13 +1144,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-ENOMEM);
> -	if (chg)
> -		if (hugepage_subpool_get_pages(spool, chg))
> +	if (chg || avoid_reserve)
> +		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
>
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>  	if (ret) {
> -		hugepage_subpool_put_pages(spool, chg);
> +		if (chg || avoid_reserve)
> +			hugepage_subpool_put_pages(spool, 1);
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	spin_lock(&hugetlb_lock);
> @@ -1162,7 +1163,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_cgroup(idx,
>  						       pages_per_huge_page(h),
>  						       h_cg);
> -			hugepage_subpool_put_pages(spool, chg);
> +			if (chg || avoid_reserve)
> +				hugepage_subpool_put_pages(spool, 1);
>  			return ERR_PTR(-ENOSPC);
>  		}
>  		spin_lock(&hugetlb_lock);
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-21  9:28     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:28 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> So, we should check subpool counter when avoid_reserve.
> This patch implement it.

Can you explain this better ? ie, if we don't have a reservation in the
area chg != 0. So why look at avoid_reserve. 

Also the code will become if you did

if (!chg && avoid_reserve)
   chg = 1;

and then rest of the code will be able to handle the case.

>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 12b6581..ea1ae0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1144,13 +1144,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-ENOMEM);
> -	if (chg)
> -		if (hugepage_subpool_get_pages(spool, chg))
> +	if (chg || avoid_reserve)
> +		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
>
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>  	if (ret) {
> -		hugepage_subpool_put_pages(spool, chg);
> +		if (chg || avoid_reserve)
> +			hugepage_subpool_put_pages(spool, 1);
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	spin_lock(&hugetlb_lock);
> @@ -1162,7 +1163,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_cgroup(idx,
>  						       pages_per_huge_page(h),
>  						       h_cg);
> -			hugepage_subpool_put_pages(spool, chg);
> +			if (chg || avoid_reserve)
> +				hugepage_subpool_put_pages(spool, 1);
>  			return ERR_PTR(-ENOSPC);
>  		}
>  		spin_lock(&hugetlb_lock);
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:30     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:30 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
> for private. So we don't need to check whether this mapping is for
> shared or not.
>
> This patch is just for clean-up.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1ae0a..c017c52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,8 +2544,7 @@ retry_avoidcopy:
>  	 * at the time of fork() could consume its reserves on COW instead
>  	 * of the full address range.
>  	 */
> -	if (!(vma->vm_flags & VM_MAYSHARE) &&
> -			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type
@ 2013-08-21  9:30     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:30 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is
> for private. So we don't need to check whether this mapping is for
> shared or not.
>
> This patch is just for clean-up.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1ae0a..c017c52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,8 +2544,7 @@ retry_avoidcopy:
>  	 * at the time of fork() could consume its reserves on COW instead
>  	 * of the full address range.
>  	 */
> -	if (!(vma->vm_flags & VM_MAYSHARE) &&
> -			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:31     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:31 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> We don't need to grab a page_table_lock when we try to release a page.
> So, defer to grab a page_table_lock.
>
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>


Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c017c52..6c8eec2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2627,10 +2627,11 @@ retry_avoidcopy:
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	/* Caller expects lock to be held */
> -	spin_lock(&mm->page_table_lock);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> +
> +	/* Caller expects lock to be held */
> +	spin_lock(&mm->page_table_lock);
>  	return 0;
>  }
>
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release
@ 2013-08-21  9:31     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:31 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> We don't need to grab a page_table_lock when we try to release a page.
> So, defer to grab a page_table_lock.
>
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>


Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c017c52..6c8eec2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2627,10 +2627,11 @@ retry_avoidcopy:
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	/* Caller expects lock to be held */
> -	spin_lock(&mm->page_table_lock);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> +
> +	/* Caller expects lock to be held */
> +	spin_lock(&mm->page_table_lock);
>  	return 0;
>  }
>
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:54     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:54 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we fail with a reserved page, just calling put_page() is not sufficient,
> because put_page() invoke free_huge_page() at last step and it doesn't
> know whether a page comes from a reserved pool or not. So it doesn't do
> anything related to reserved count. This makes reserve count lower
> than how we need, because reserve count already decrease in
> dequeue_huge_page_vma(). This patch fix this situation.

You may want to document you are using PagePrivate for tracking
reservation and why it is ok to that.

>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6c8eec2..3f834f1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -572,6 +572,7 @@ retry_cpuset:
>  				if (!vma_has_reserves(vma, chg))
>  					break;
>
> +				SetPagePrivate(page);
>  				h->resv_huge_pages--;
>  				break;
>  			}
> @@ -626,15 +627,20 @@ static void free_huge_page(struct page *page)
>  	int nid = page_to_nid(page);
>  	struct hugepage_subpool *spool =
>  		(struct hugepage_subpool *)page_private(page);
> +	bool restore_reserve;
>
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
>  	BUG_ON(page_mapcount(page));
> +	restore_reserve = PagePrivate(page);
>
>  	spin_lock(&hugetlb_lock);
>  	hugetlb_cgroup_uncharge_page(hstate_index(h),
>  				     pages_per_huge_page(h), page);
> +	if (restore_reserve)
> +		h->resv_huge_pages++;
> +
>  	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
>  		/* remove the page from active list */
>  		list_del(&page->lru);
> @@ -2616,6 +2622,8 @@ retry_avoidcopy:
>  	spin_lock(&mm->page_table_lock);
>  	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
>  	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> +		ClearPagePrivate(new_page);
> +
>  		/* Break COW */
>  		huge_ptep_clear_flush(vma, address, ptep);
>  		set_huge_pte_at(mm, address, ptep,
> @@ -2727,6 +2735,7 @@ retry:
>  					goto retry;
>  				goto out;
>  			}
> +			ClearPagePrivate(page);
>
>  			spin_lock(&inode->i_lock);
>  			inode->i_blocks += blocks_per_huge_page(h);
> @@ -2773,8 +2782,10 @@ retry:
>  	if (!huge_pte_none(huge_ptep_get(ptep)))
>  		goto backout;
>
> -	if (anon_rmap)
> +	if (anon_rmap) {
> +		ClearPagePrivate(page);
>  		hugepage_add_new_anon_rmap(page, vma, address);
> +	}
>  	else
>  		page_dup_rmap(page);
>  	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
@ 2013-08-21  9:54     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:54 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we fail with a reserved page, just calling put_page() is not sufficient,
> because put_page() invoke free_huge_page() at last step and it doesn't
> know whether a page comes from a reserved pool or not. So it doesn't do
> anything related to reserved count. This makes reserve count lower
> than how we need, because reserve count already decrease in
> dequeue_huge_page_vma(). This patch fix this situation.

You may want to document you are using PagePrivate for tracking
reservation and why it is ok to that.

>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6c8eec2..3f834f1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -572,6 +572,7 @@ retry_cpuset:
>  				if (!vma_has_reserves(vma, chg))
>  					break;
>
> +				SetPagePrivate(page);
>  				h->resv_huge_pages--;
>  				break;
>  			}
> @@ -626,15 +627,20 @@ static void free_huge_page(struct page *page)
>  	int nid = page_to_nid(page);
>  	struct hugepage_subpool *spool =
>  		(struct hugepage_subpool *)page_private(page);
> +	bool restore_reserve;
>
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
>  	BUG_ON(page_mapcount(page));
> +	restore_reserve = PagePrivate(page);
>
>  	spin_lock(&hugetlb_lock);
>  	hugetlb_cgroup_uncharge_page(hstate_index(h),
>  				     pages_per_huge_page(h), page);
> +	if (restore_reserve)
> +		h->resv_huge_pages++;
> +
>  	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
>  		/* remove the page from active list */
>  		list_del(&page->lru);
> @@ -2616,6 +2622,8 @@ retry_avoidcopy:
>  	spin_lock(&mm->page_table_lock);
>  	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
>  	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> +		ClearPagePrivate(new_page);
> +
>  		/* Break COW */
>  		huge_ptep_clear_flush(vma, address, ptep);
>  		set_huge_pte_at(mm, address, ptep,
> @@ -2727,6 +2735,7 @@ retry:
>  					goto retry;
>  				goto out;
>  			}
> +			ClearPagePrivate(page);
>
>  			spin_lock(&inode->i_lock);
>  			inode->i_blocks += blocks_per_huge_page(h);
> @@ -2773,8 +2782,10 @@ retry:
>  	if (!huge_pte_none(huge_ptep_get(ptep)))
>  		goto backout;
>
> -	if (anon_rmap)
> +	if (anon_rmap) {
> +		ClearPagePrivate(page);
>  		hugepage_add_new_anon_rmap(page, vma, address);
> +	}
>  	else
>  		page_dup_rmap(page);
>  	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:57     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:57 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

As mentioned earlier kref_put is confusing because we always have
reference count == 1 , otherwise

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..9bf2c4a 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
> +	struct resv_map *resv_map;
> +
>  	truncate_hugepages(inode, 0);
> +	resv_map = (struct resv_map *)inode->i_mapping->private_data;
> +	if (resv_map)
> +		kref_put(&resv_map->refs, resv_map_release);
>  	clear_inode(inode);
>  }
>
> @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					umode_t mode, dev_t dev)
>  {
>  	struct inode *inode;
> +	struct resv_map *resv_map;
> +
> +	resv_map = resv_map_alloc();
> +	if (!resv_map)
> +		return NULL;
>
>  	inode = new_inode(sb);
>  	if (inode) {
> @@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -		INIT_LIST_HEAD(&inode->i_mapping->private_list);
> +		inode->i_mapping->private_data = resv_map;
>  		info = HUGETLBFS_I(inode);
>  		/*
>  		 * The policy is initialized here even if we are creating a
> @@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  			break;
>  		}
>  		lockdep_annotate_inode_mutex_key(inode);
> -	}
> +	} else
> +		kref_put(&resv_map->refs, resv_map_release);
> +
>  	return inode;
>  }
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6b4890f..2677c07 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -5,6 +5,8 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/kref.h>
>
>  struct ctl_table;
>  struct user_struct;
> @@ -22,6 +24,13 @@ struct hugepage_subpool {
>  	long max_hpages, used_hpages;
>  };
>
> +struct resv_map {
> +	struct kref refs;
> +	struct list_head regions;
> +};
> +extern struct resv_map *resv_map_alloc(void);
> +void resv_map_release(struct kref *ref);
> +
>  extern spinlock_t hugetlb_lock;
>  extern int hugetlb_max_hstate __read_mostly;
>  #define for_each_hstate(h) \
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3f834f1..8751e2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>  	vma->vm_private_data = (void *)value;
>  }
>
> -struct resv_map {
> -	struct kref refs;
> -	struct list_head regions;
> -};
> -
> -static struct resv_map *resv_map_alloc(void)
> +struct resv_map *resv_map_alloc(void)
>  {
>  	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
>  	if (!resv_map)
> @@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
>  	return resv_map;
>  }
>
> -static void resv_map_release(struct kref *ref)
> +void resv_map_release(struct kref *ref)
>  {
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
> @@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		return region_chg(&inode->i_mapping->private_list,
> -							idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		return region_chg(&resv->regions, idx, idx + 1);
>
>  	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		return 1;
> @@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		region_add(&inode->i_mapping->private_list, idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		region_add(&resv->regions, idx, idx + 1);
>
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> @@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	long ret, chg;
>  	struct hstate *h = hstate_inode(inode);
>  	struct hugepage_subpool *spool = subpool_inode(inode);
> +	struct resv_map *resv_map;
>
>  	/*
>  	 * Only apply hugepage reservation if asked. At fault time, an
> @@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * to reserve the full area even if read-only as mprotect() may be
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		chg = region_chg(&inode->i_mapping->private_list, from, to);
> -	else {
> -		struct resv_map *resv_map = resv_map_alloc();
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		resv_map = inode->i_mapping->private_data;
> +
> +		chg = region_chg(&resv_map->regions, from, to);
> +
> +	} else {
> +		resv_map = resv_map_alloc();
>  		if (!resv_map)
>  			return -ENOMEM;
>
> @@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&inode->i_mapping->private_list, from, to);
> +		region_add(&resv_map->regions, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3146,9 +3148,12 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> +	if (resv_map)
> +		chg = region_truncate(&resv_map->regions, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
@ 2013-08-21  9:57     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:57 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

As mentioned earlier kref_put is confusing because we always have
reference count == 1 , otherwise

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..9bf2c4a 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
> +	struct resv_map *resv_map;
> +
>  	truncate_hugepages(inode, 0);
> +	resv_map = (struct resv_map *)inode->i_mapping->private_data;
> +	if (resv_map)
> +		kref_put(&resv_map->refs, resv_map_release);
>  	clear_inode(inode);
>  }
>
> @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					umode_t mode, dev_t dev)
>  {
>  	struct inode *inode;
> +	struct resv_map *resv_map;
> +
> +	resv_map = resv_map_alloc();
> +	if (!resv_map)
> +		return NULL;
>
>  	inode = new_inode(sb);
>  	if (inode) {
> @@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -		INIT_LIST_HEAD(&inode->i_mapping->private_list);
> +		inode->i_mapping->private_data = resv_map;
>  		info = HUGETLBFS_I(inode);
>  		/*
>  		 * The policy is initialized here even if we are creating a
> @@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  			break;
>  		}
>  		lockdep_annotate_inode_mutex_key(inode);
> -	}
> +	} else
> +		kref_put(&resv_map->refs, resv_map_release);
> +
>  	return inode;
>  }
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6b4890f..2677c07 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -5,6 +5,8 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/kref.h>
>
>  struct ctl_table;
>  struct user_struct;
> @@ -22,6 +24,13 @@ struct hugepage_subpool {
>  	long max_hpages, used_hpages;
>  };
>
> +struct resv_map {
> +	struct kref refs;
> +	struct list_head regions;
> +};
> +extern struct resv_map *resv_map_alloc(void);
> +void resv_map_release(struct kref *ref);
> +
>  extern spinlock_t hugetlb_lock;
>  extern int hugetlb_max_hstate __read_mostly;
>  #define for_each_hstate(h) \
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3f834f1..8751e2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>  	vma->vm_private_data = (void *)value;
>  }
>
> -struct resv_map {
> -	struct kref refs;
> -	struct list_head regions;
> -};
> -
> -static struct resv_map *resv_map_alloc(void)
> +struct resv_map *resv_map_alloc(void)
>  {
>  	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
>  	if (!resv_map)
> @@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
>  	return resv_map;
>  }
>
> -static void resv_map_release(struct kref *ref)
> +void resv_map_release(struct kref *ref)
>  {
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
> @@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		return region_chg(&inode->i_mapping->private_list,
> -							idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		return region_chg(&resv->regions, idx, idx + 1);
>
>  	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		return 1;
> @@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		region_add(&inode->i_mapping->private_list, idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		region_add(&resv->regions, idx, idx + 1);
>
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> @@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	long ret, chg;
>  	struct hstate *h = hstate_inode(inode);
>  	struct hugepage_subpool *spool = subpool_inode(inode);
> +	struct resv_map *resv_map;
>
>  	/*
>  	 * Only apply hugepage reservation if asked. At fault time, an
> @@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * to reserve the full area even if read-only as mprotect() may be
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		chg = region_chg(&inode->i_mapping->private_list, from, to);
> -	else {
> -		struct resv_map *resv_map = resv_map_alloc();
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		resv_map = inode->i_mapping->private_data;
> +
> +		chg = region_chg(&resv_map->regions, from, to);
> +
> +	} else {
> +		resv_map = resv_map_alloc();
>  		if (!resv_map)
>  			return -ENOMEM;
>
> @@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&inode->i_mapping->private_list, from, to);
> +		region_add(&resv_map->regions, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3146,9 +3148,12 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> +	if (resv_map)
> +		chg = region_truncate(&resv_map->regions, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 08/20] mm, hugetlb: region manipulation functions take resv_map rather list_head
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21  9:58     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:58 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8751e2c..d9cabf6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -150,8 +150,9 @@ struct file_region {
>  	long to;
>  };
>
> -static long region_add(struct list_head *head, long f, long t)
> +static long region_add(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>
>  	/* Locate the region we are either in or before. */
> @@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t)
>  	return 0;
>  }
>
> -static long region_chg(struct list_head *head, long f, long t)
> +static long region_chg(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg;
>  	long chg = 0;
>
> @@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t)
>  	return chg;
>  }
>
> -static long region_truncate(struct list_head *head, long end)
> +static long region_truncate(struct resv_map *resv, long end)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>
> @@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end)
>  	return chg;
>  }
>
> -static long region_count(struct list_head *head, long f, long t)
> +static long region_count(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg;
>  	long chg = 0;
>
> @@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
>  	/* Clear out any active regions before we release the map. */
> -	region_truncate(&resv_map->regions, 0);
> +	region_truncate(resv_map, 0);
>  	kfree(resv_map);
>  }
>
> @@ -1099,7 +1103,7 @@ static long vma_needs_reservation(struct hstate *h,
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
>  		struct resv_map *resv = vma_resv_map(vma);
>
> -		err = region_chg(&resv->regions, idx, idx + 1);
> +		err = region_chg(resv, idx, idx + 1);
>  		if (err < 0)
>  			return err;
>  		return 0;
> @@ -1121,9 +1125,8 @@ static void vma_commit_reservation(struct hstate *h,
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
>  		struct resv_map *resv = vma_resv_map(vma);
>
> -		/* Mark this page used in the map. */
> -		region_add(&resv->regions, idx, idx + 1);
> -	}
> +	idx = vma_hugecache_offset(h, vma, addr);
> +	region_add(resv, idx, idx + 1);
>  }
>
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> @@ -2211,7 +2214,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>
>  		reserve = (end - start) -
> -			region_count(&resv->regions, start, end);
> +			region_count(resv, start, end);
>
>  		resv_map_put(vma);
>
> @@ -3091,7 +3094,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	if (!vma || vma->vm_flags & VM_MAYSHARE) {
>  		resv_map = inode->i_mapping->private_data;
>
> -		chg = region_chg(&resv_map->regions, from, to);
> +		chg = region_chg(resv_map, from, to);
>
>  	} else {
>  		resv_map = resv_map_alloc();
> @@ -3137,7 +3140,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&resv_map->regions, from, to);
> +		region_add(resv_map, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3153,7 +3156,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
>  	if (resv_map)
> -		chg = region_truncate(&resv_map->regions, offset);
> +		chg = region_truncate(resv_map, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 08/20] mm, hugetlb: region manipulation functions take resv_map rather list_head
@ 2013-08-21  9:58     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21  9:58 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8751e2c..d9cabf6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -150,8 +150,9 @@ struct file_region {
>  	long to;
>  };
>
> -static long region_add(struct list_head *head, long f, long t)
> +static long region_add(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>
>  	/* Locate the region we are either in or before. */
> @@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t)
>  	return 0;
>  }
>
> -static long region_chg(struct list_head *head, long f, long t)
> +static long region_chg(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg;
>  	long chg = 0;
>
> @@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t)
>  	return chg;
>  }
>
> -static long region_truncate(struct list_head *head, long end)
> +static long region_truncate(struct resv_map *resv, long end)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>
> @@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end)
>  	return chg;
>  }
>
> -static long region_count(struct list_head *head, long f, long t)
> +static long region_count(struct resv_map *resv, long f, long t)
>  {
> +	struct list_head *head = &resv->regions;
>  	struct file_region *rg;
>  	long chg = 0;
>
> @@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
>  	/* Clear out any active regions before we release the map. */
> -	region_truncate(&resv_map->regions, 0);
> +	region_truncate(resv_map, 0);
>  	kfree(resv_map);
>  }
>
> @@ -1099,7 +1103,7 @@ static long vma_needs_reservation(struct hstate *h,
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
>  		struct resv_map *resv = vma_resv_map(vma);
>
> -		err = region_chg(&resv->regions, idx, idx + 1);
> +		err = region_chg(resv, idx, idx + 1);
>  		if (err < 0)
>  			return err;
>  		return 0;
> @@ -1121,9 +1125,8 @@ static void vma_commit_reservation(struct hstate *h,
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
>  		struct resv_map *resv = vma_resv_map(vma);
>
> -		/* Mark this page used in the map. */
> -		region_add(&resv->regions, idx, idx + 1);
> -	}
> +	idx = vma_hugecache_offset(h, vma, addr);
> +	region_add(resv, idx, idx + 1);
>  }
>
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> @@ -2211,7 +2214,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>
>  		reserve = (end - start) -
> -			region_count(&resv->regions, start, end);
> +			region_count(resv, start, end);
>
>  		resv_map_put(vma);
>
> @@ -3091,7 +3094,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	if (!vma || vma->vm_flags & VM_MAYSHARE) {
>  		resv_map = inode->i_mapping->private_data;
>
> -		chg = region_chg(&resv_map->regions, from, to);
> +		chg = region_chg(resv_map, from, to);
>
>  	} else {
>  		resv_map = resv_map_alloc();
> @@ -3137,7 +3140,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&resv_map->regions, from, to);
> +		region_add(resv_map, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3153,7 +3156,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
>  	if (resv_map)
> -		chg = region_truncate(&resv_map->regions, offset);
> +		chg = region_truncate(resv_map, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21 10:13     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:13 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
>
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>
>  struct resv_map {
>  	struct kref refs;
> +	spinlock_t lock;
>  	struct list_head regions;
>  };
>  extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d9cabf6..73034dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   *
> - * The region data structures are protected by a combination of the mmap_sem
> - * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
> - * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation_mutex:
> - *
> - *	down_write(&mm->mmap_sem);
> - * or
> - *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + * The region data structures are embedded into a resv_map and
> + * protected by a resv_map's lock
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	}
>  	nrg->from = f;
>  	nrg->to = t;
> +	spin_unlock(&resv->lock);
>  	return 0;
>  }
>
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg;
> +	struct file_region *rg, *nrg = NULL;
>  	long chg = 0;
>
> +retry:
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are before or in. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  	 * Subtle, allocate a new region at the position but make it zero
>  	 * size such that we can guarantee to record the reservation. */
>  	if (&rg->link == head || t < rg->from) {
> -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> -		if (!nrg)
> -			return -ENOMEM;
> +		if (!nrg) {
> +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);

Do we really need to have the GFP_NOWAIT allocation attempt. Why can't we simply say
allocate and retry ? Or should resv->lock be a mutex ?

> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg) {
> +					chg = -ENOMEM;
> +					goto out;
> +				}
> +				goto retry;
> +			}
> +		}
> +
>  		nrg->from = f;
>  		nrg->to   = f;
>  		INIT_LIST_HEAD(&nrg->link);
>  		list_add(&nrg->link, rg->link.prev);
> +		nrg = NULL;
>
> -		return t - f;
> +		chg = t - f;
> +		goto out_locked;
>  	}
>
>  	/* Round our left edge to the current segment if it encloses us. */
> @@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		if (&rg->link == head)
>  			break;
>  		if (rg->from > t)
> -			return chg;
> +			goto out_locked;
>
>  		/* We overlap with this area, if it extends further than
>  		 * us then we must extend ourselves.  Account for its
> @@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		}
>  		chg -= rg->to - rg->from;
>  	}
> +
> +out_locked:
> +	spin_unlock(&resv->lock);
> +out:
> +	kfree(nrg);
>  	return chg;
>  }
>
> @@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (end <= rg->to)
>  			break;
>  	if (&rg->link == head)
> -		return 0;
> +		goto out;
>
>  	/* If we are in the middle of a region then adjust it. */
>  	if (end > rg->from) {
> @@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
>  		list_del(&rg->link);
>  		kfree(rg);
>  	}
> +
> +out:
> +	spin_unlock(&resv->lock);
>  	return chg;
>  }
>
> @@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  	struct file_region *rg;
>  	long chg = 0;
>
> +	spin_lock(&resv->lock);
>  	/* Locate each segment we overlap with, and count that overlap. */
>  	list_for_each_entry(rg, head, link) {
>  		long seg_from;
> @@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>
>  		chg += seg_to - seg_from;
>  	}
> +	spin_unlock(&resv->lock);
>
>  	return chg;
>  }
> @@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
>  		return NULL;
>
>  	kref_init(&resv_map->refs);
> +	spin_lock_init(&resv_map->lock);
>  	INIT_LIST_HEAD(&resv_map->regions);
>
>  	return resv_map;
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-08-21 10:13     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:13 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
>
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>
>  struct resv_map {
>  	struct kref refs;
> +	spinlock_t lock;
>  	struct list_head regions;
>  };
>  extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d9cabf6..73034dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   *
> - * The region data structures are protected by a combination of the mmap_sem
> - * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
> - * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation_mutex:
> - *
> - *	down_write(&mm->mmap_sem);
> - * or
> - *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + * The region data structures are embedded into a resv_map and
> + * protected by a resv_map's lock
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
>
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -184,15 +178,18 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	}
>  	nrg->from = f;
>  	nrg->to = t;
> +	spin_unlock(&resv->lock);
>  	return 0;
>  }
>
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
> -	struct file_region *rg, *nrg;
> +	struct file_region *rg, *nrg = NULL;
>  	long chg = 0;
>
> +retry:
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are before or in. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  	 * Subtle, allocate a new region at the position but make it zero
>  	 * size such that we can guarantee to record the reservation. */
>  	if (&rg->link == head || t < rg->from) {
> -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> -		if (!nrg)
> -			return -ENOMEM;
> +		if (!nrg) {
> +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);

Do we really need to have the GFP_NOWAIT allocation attempt. Why can't we simply say
allocate and retry ? Or should resv->lock be a mutex ?

> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg) {
> +					chg = -ENOMEM;
> +					goto out;
> +				}
> +				goto retry;
> +			}
> +		}
> +
>  		nrg->from = f;
>  		nrg->to   = f;
>  		INIT_LIST_HEAD(&nrg->link);
>  		list_add(&nrg->link, rg->link.prev);
> +		nrg = NULL;
>
> -		return t - f;
> +		chg = t - f;
> +		goto out_locked;
>  	}
>
>  	/* Round our left edge to the current segment if it encloses us. */
> @@ -223,7 +232,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		if (&rg->link == head)
>  			break;
>  		if (rg->from > t)
> -			return chg;
> +			goto out_locked;
>
>  		/* We overlap with this area, if it extends further than
>  		 * us then we must extend ourselves.  Account for its
> @@ -234,6 +243,11 @@ static long region_chg(struct resv_map *resv, long f, long t)
>  		}
>  		chg -= rg->to - rg->from;
>  	}
> +
> +out_locked:
> +	spin_unlock(&resv->lock);
> +out:
> +	kfree(nrg);
>  	return chg;
>  }
>
> @@ -243,12 +257,13 @@ static long region_truncate(struct resv_map *resv, long end)
>  	struct file_region *rg, *trg;
>  	long chg = 0;
>
> +	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
>  	list_for_each_entry(rg, head, link)
>  		if (end <= rg->to)
>  			break;
>  	if (&rg->link == head)
> -		return 0;
> +		goto out;
>
>  	/* If we are in the middle of a region then adjust it. */
>  	if (end > rg->from) {
> @@ -265,6 +280,9 @@ static long region_truncate(struct resv_map *resv, long end)
>  		list_del(&rg->link);
>  		kfree(rg);
>  	}
> +
> +out:
> +	spin_unlock(&resv->lock);
>  	return chg;
>  }
>
> @@ -274,6 +292,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>  	struct file_region *rg;
>  	long chg = 0;
>
> +	spin_lock(&resv->lock);
>  	/* Locate each segment we overlap with, and count that overlap. */
>  	list_for_each_entry(rg, head, link) {
>  		long seg_from;
> @@ -289,6 +308,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>
>  		chg += seg_to - seg_from;
>  	}
> +	spin_unlock(&resv->lock);
>
>  	return chg;
>  }
> @@ -386,6 +406,7 @@ struct resv_map *resv_map_alloc(void)
>  		return NULL;
>
>  	kref_init(&resv_map->refs);
> +	spin_lock_init(&resv_map->lock);
>  	INIT_LIST_HEAD(&resv_map->regions);
>
>  	return resv_map;
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21 10:22     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:22 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..9bf2c4a 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
> +	struct resv_map *resv_map;
> +
>  	truncate_hugepages(inode, 0);
> +	resv_map = (struct resv_map *)inode->i_mapping->private_data;

can you add a comment around saying root inode doesn't have resv_map. 

> +	if (resv_map)
> +		kref_put(&resv_map->refs, resv_map_release);
>  	clear_inode(inode);
>  }
>
> @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					umode_t mode, dev_t dev)
>  {
>  	struct inode *inode;
> +	struct resv_map *resv_map;
> +
> +	resv_map = resv_map_alloc();
> +	if (!resv_map)
> +		return NULL;
>
>  	inode = new_inode(sb);
>  	if (inode) {
> @@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -		INIT_LIST_HEAD(&inode->i_mapping->private_list);
> +		inode->i_mapping->private_data = resv_map;
>  		info = HUGETLBFS_I(inode);
>  		/*
>  		 * The policy is initialized here even if we are creating a
> @@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  			break;
>  		}
>  		lockdep_annotate_inode_mutex_key(inode);
> -	}
> +	} else
> +		kref_put(&resv_map->refs, resv_map_release);
> +
>  	return inode;
>  }
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6b4890f..2677c07 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -5,6 +5,8 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/kref.h>
>
>  struct ctl_table;
>  struct user_struct;
> @@ -22,6 +24,13 @@ struct hugepage_subpool {
>  	long max_hpages, used_hpages;
>  };
>
> +struct resv_map {
> +	struct kref refs;
> +	struct list_head regions;
> +};
> +extern struct resv_map *resv_map_alloc(void);
> +void resv_map_release(struct kref *ref);
> +
>  extern spinlock_t hugetlb_lock;
>  extern int hugetlb_max_hstate __read_mostly;
>  #define for_each_hstate(h) \
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3f834f1..8751e2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>  	vma->vm_private_data = (void *)value;
>  }
>
> -struct resv_map {
> -	struct kref refs;
> -	struct list_head regions;
> -};
> -
> -static struct resv_map *resv_map_alloc(void)
> +struct resv_map *resv_map_alloc(void)
>  {
>  	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
>  	if (!resv_map)
> @@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
>  	return resv_map;
>  }
>
> -static void resv_map_release(struct kref *ref)
> +void resv_map_release(struct kref *ref)
>  {
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
> @@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		return region_chg(&inode->i_mapping->private_list,
> -							idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		return region_chg(&resv->regions, idx, idx + 1);
>
>  	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		return 1;
> @@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		region_add(&inode->i_mapping->private_list, idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		region_add(&resv->regions, idx, idx + 1);
>
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> @@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	long ret, chg;
>  	struct hstate *h = hstate_inode(inode);
>  	struct hugepage_subpool *spool = subpool_inode(inode);
> +	struct resv_map *resv_map;
>
>  	/*
>  	 * Only apply hugepage reservation if asked. At fault time, an
> @@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * to reserve the full area even if read-only as mprotect() may be
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		chg = region_chg(&inode->i_mapping->private_list, from, to);
> -	else {
> -		struct resv_map *resv_map = resv_map_alloc();
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		resv_map = inode->i_mapping->private_data;
> +
> +		chg = region_chg(&resv_map->regions, from, to);
> +
> +	} else {
> +		resv_map = resv_map_alloc();
>  		if (!resv_map)
>  			return -ENOMEM;
>
> @@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&inode->i_mapping->private_list, from, to);
> +		region_add(&resv_map->regions, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3146,9 +3148,12 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> +	if (resv_map)
> +		chg = region_truncate(&resv_map->regions, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
@ 2013-08-21 10:22     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:22 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..9bf2c4a 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
> +	struct resv_map *resv_map;
> +
>  	truncate_hugepages(inode, 0);
> +	resv_map = (struct resv_map *)inode->i_mapping->private_data;

can you add a comment around saying root inode doesn't have resv_map. 

> +	if (resv_map)
> +		kref_put(&resv_map->refs, resv_map_release);
>  	clear_inode(inode);
>  }
>
> @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					umode_t mode, dev_t dev)
>  {
>  	struct inode *inode;
> +	struct resv_map *resv_map;
> +
> +	resv_map = resv_map_alloc();
> +	if (!resv_map)
> +		return NULL;
>
>  	inode = new_inode(sb);
>  	if (inode) {
> @@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -		INIT_LIST_HEAD(&inode->i_mapping->private_list);
> +		inode->i_mapping->private_data = resv_map;
>  		info = HUGETLBFS_I(inode);
>  		/*
>  		 * The policy is initialized here even if we are creating a
> @@ -507,7 +517,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  			break;
>  		}
>  		lockdep_annotate_inode_mutex_key(inode);
> -	}
> +	} else
> +		kref_put(&resv_map->refs, resv_map_release);
> +
>  	return inode;
>  }
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6b4890f..2677c07 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -5,6 +5,8 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/kref.h>
>
>  struct ctl_table;
>  struct user_struct;
> @@ -22,6 +24,13 @@ struct hugepage_subpool {
>  	long max_hpages, used_hpages;
>  };
>
> +struct resv_map {
> +	struct kref refs;
> +	struct list_head regions;
> +};
> +extern struct resv_map *resv_map_alloc(void);
> +void resv_map_release(struct kref *ref);
> +
>  extern spinlock_t hugetlb_lock;
>  extern int hugetlb_max_hstate __read_mostly;
>  #define for_each_hstate(h) \
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3f834f1..8751e2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>  	vma->vm_private_data = (void *)value;
>  }
>
> -struct resv_map {
> -	struct kref refs;
> -	struct list_head regions;
> -};
> -
> -static struct resv_map *resv_map_alloc(void)
> +struct resv_map *resv_map_alloc(void)
>  {
>  	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
>  	if (!resv_map)
> @@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
>  	return resv_map;
>  }
>
> -static void resv_map_release(struct kref *ref)
> +void resv_map_release(struct kref *ref)
>  {
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>
> @@ -1092,8 +1087,9 @@ static long vma_needs_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		return region_chg(&inode->i_mapping->private_list,
> -							idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		return region_chg(&resv->regions, idx, idx + 1);
>
>  	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		return 1;
> @@ -1117,7 +1113,9 @@ static void vma_commit_reservation(struct hstate *h,
>
>  	if (vma->vm_flags & VM_MAYSHARE) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		region_add(&inode->i_mapping->private_list, idx, idx + 1);
> +		struct resv_map *resv = inode->i_mapping->private_data;
> +
> +		region_add(&resv->regions, idx, idx + 1);
>
>  	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> @@ -3074,6 +3072,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	long ret, chg;
>  	struct hstate *h = hstate_inode(inode);
>  	struct hugepage_subpool *spool = subpool_inode(inode);
> +	struct resv_map *resv_map;
>
>  	/*
>  	 * Only apply hugepage reservation if asked. At fault time, an
> @@ -3089,10 +3088,13 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * to reserve the full area even if read-only as mprotect() may be
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		chg = region_chg(&inode->i_mapping->private_list, from, to);
> -	else {
> -		struct resv_map *resv_map = resv_map_alloc();
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		resv_map = inode->i_mapping->private_data;
> +
> +		chg = region_chg(&resv_map->regions, from, to);
> +
> +	} else {
> +		resv_map = resv_map_alloc();
>  		if (!resv_map)
>  			return -ENOMEM;
>
> @@ -3135,7 +3137,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * else has to be done for private mappings here
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(&inode->i_mapping->private_list, from, to);
> +		region_add(&resv_map->regions, from, to);
>  	return 0;
>  out_err:
>  	if (vma)
> @@ -3146,9 +3148,12 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> +	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> +	if (resv_map)
> +		chg = region_truncate(&resv_map->regions, offset);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21 10:37     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:37 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Util now, we get a resv_map by two ways according to each mapping type.
> This makes code dirty and unreadable. So unfiying it.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 869c3e0..e6c0c77 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
>  	kfree(resv_map);
>  }
>
> +static inline struct resv_map *inode_resv_map(struct inode *inode)
> +{
> +	return inode->i_mapping->private_data;
> +}

it would be nice to get have another function that will return resv_map
only if we have HPAGE_RESV_OWNER. So that we could use that in
hugetlb_vm_op_open/close. ? Otherwise 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>




> +
>  static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON(!is_vm_hugetlb_page(vma));
> -	if (!(vma->vm_flags & VM_MAYSHARE))
> +	if (vma->vm_flags & VM_MAYSHARE) {
> +		struct address_space *mapping = vma->vm_file->f_mapping;
> +		struct inode *inode = mapping->host;
> +
> +		return inode_resv_map(inode);
> +
> +	} else {
>  		return (struct resv_map *)(get_vma_private_data(vma) &
>  							~HPAGE_RESV_MASK);
> -	return NULL;
> +	}
>  }
>
>  static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
> @@ -1107,44 +1118,31 @@ static void return_unused_surplus_pages(struct hstate *h,
>  static long vma_needs_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -	struct inode *inode = mapping->host;
> -
> -	if (vma->vm_flags & VM_MAYSHARE) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = inode->i_mapping->private_data;
> -
> -		return region_chg(&resv->regions, idx, idx + 1);
> +	struct resv_map *resv;
> +	pgoff_t idx;
> +	long chg;
>
> -	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> +	resv = vma_resv_map(vma);
> +	if (!resv)
>  		return 1;
>
> -	} else  {
> -		long err;
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = vma_resv_map(vma);
> +	idx = vma_hugecache_offset(h, vma, addr);
> +	chg = region_chg(resv, idx, idx + 1);
>
> -		err = region_chg(resv, idx, idx + 1);
> -		if (err < 0)
> -			return err;
> -		return 0;
> -	}
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		return chg;
> +	else
> +		return chg < 0 ? chg : 0;
>  }
>  static void vma_commit_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -	struct inode *inode = mapping->host;
> -
> -	if (vma->vm_flags & VM_MAYSHARE) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = inode->i_mapping->private_data;
> -
> -		region_add(&resv->regions, idx, idx + 1);
> +	struct resv_map *resv;
> +	pgoff_t idx;
>
> -	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = vma_resv_map(vma);
> +	resv = vma_resv_map(vma);
> +	if (!resv)
> +		return;
>
>  	idx = vma_hugecache_offset(h, vma, addr);
>  	region_add(resv, idx, idx + 1);
> @@ -2208,7 +2206,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  	 * after this open call completes.  It is therefore safe to take a
>  	 * new reference here without additional locking.
>  	 */
> -	if (resv)
> +	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>  		kref_get(&resv->refs);
>  }
>
> @@ -2221,7 +2219,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  	unsigned long start;
>  	unsigned long end;
>
> -	if (resv) {
> +	if (!resv)
> +		return;
> +
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		start = vma_hugecache_offset(h, vma, vma->vm_start);
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>
> @@ -3104,7 +3105,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> -		resv_map = inode->i_mapping->private_data;
> +		resv_map = inode_resv_map(inode);
>
>  		chg = region_chg(resv_map, from, to);
>
> @@ -3163,7 +3164,7 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	struct resv_map *resv_map = inode_resv_map(inode);
>  	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
@ 2013-08-21 10:37     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:37 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Util now, we get a resv_map by two ways according to each mapping type.
> This makes code dirty and unreadable. So unfiying it.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 869c3e0..e6c0c77 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
>  	kfree(resv_map);
>  }
>
> +static inline struct resv_map *inode_resv_map(struct inode *inode)
> +{
> +	return inode->i_mapping->private_data;
> +}

it would be nice to get have another function that will return resv_map
only if we have HPAGE_RESV_OWNER. So that we could use that in
hugetlb_vm_op_open/close. ? Otherwise 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>




> +
>  static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON(!is_vm_hugetlb_page(vma));
> -	if (!(vma->vm_flags & VM_MAYSHARE))
> +	if (vma->vm_flags & VM_MAYSHARE) {
> +		struct address_space *mapping = vma->vm_file->f_mapping;
> +		struct inode *inode = mapping->host;
> +
> +		return inode_resv_map(inode);
> +
> +	} else {
>  		return (struct resv_map *)(get_vma_private_data(vma) &
>  							~HPAGE_RESV_MASK);
> -	return NULL;
> +	}
>  }
>
>  static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
> @@ -1107,44 +1118,31 @@ static void return_unused_surplus_pages(struct hstate *h,
>  static long vma_needs_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -	struct inode *inode = mapping->host;
> -
> -	if (vma->vm_flags & VM_MAYSHARE) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = inode->i_mapping->private_data;
> -
> -		return region_chg(&resv->regions, idx, idx + 1);
> +	struct resv_map *resv;
> +	pgoff_t idx;
> +	long chg;
>
> -	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> +	resv = vma_resv_map(vma);
> +	if (!resv)
>  		return 1;
>
> -	} else  {
> -		long err;
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = vma_resv_map(vma);
> +	idx = vma_hugecache_offset(h, vma, addr);
> +	chg = region_chg(resv, idx, idx + 1);
>
> -		err = region_chg(resv, idx, idx + 1);
> -		if (err < 0)
> -			return err;
> -		return 0;
> -	}
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		return chg;
> +	else
> +		return chg < 0 ? chg : 0;
>  }
>  static void vma_commit_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -	struct inode *inode = mapping->host;
> -
> -	if (vma->vm_flags & VM_MAYSHARE) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = inode->i_mapping->private_data;
> -
> -		region_add(&resv->regions, idx, idx + 1);
> +	struct resv_map *resv;
> +	pgoff_t idx;
>
> -	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> -		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> -		struct resv_map *resv = vma_resv_map(vma);
> +	resv = vma_resv_map(vma);
> +	if (!resv)
> +		return;
>
>  	idx = vma_hugecache_offset(h, vma, addr);
>  	region_add(resv, idx, idx + 1);
> @@ -2208,7 +2206,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  	 * after this open call completes.  It is therefore safe to take a
>  	 * new reference here without additional locking.
>  	 */
> -	if (resv)
> +	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>  		kref_get(&resv->refs);
>  }
>
> @@ -2221,7 +2219,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  	unsigned long start;
>  	unsigned long end;
>
> -	if (resv) {
> +	if (!resv)
> +		return;
> +
> +	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
>  		start = vma_hugecache_offset(h, vma, vma->vm_start);
>  		end = vma_hugecache_offset(h, vma, vma->vm_end);
>
> @@ -3104,7 +3105,7 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * called to make the mapping read-write. Assume !vma is a shm mapping
>  	 */
>  	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> -		resv_map = inode->i_mapping->private_data;
> +		resv_map = inode_resv_map(inode);
>
>  		chg = region_chg(resv_map, from, to);
>
> @@ -3163,7 +3164,7 @@ out_err:
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
> -	struct resv_map *resv_map = inode->i_mapping->private_data;
> +	struct resv_map *resv_map = inode_resv_map(inode);
>  	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-21 10:49     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:49 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> In following patch, I change vma_resv_map() to return resv_map
> for all case. This patch prepares it by removing resv_map_put() which
> doesn't works properly with following change, because it works only for
> HPAGE_RESV_OWNER's resv_map, not for all resv_maps.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 73034dd..869c3e0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  		kref_get(&resv->refs);
>  }
>
> -static void resv_map_put(struct vm_area_struct *vma)
> -{
> -	struct resv_map *resv = vma_resv_map(vma);
> -
> -	if (!resv)
> -		return;
> -	kref_put(&resv->refs, resv_map_release);
> -}

Why not have seperate functions to return vma_resv_map for
HPAGE_RESV_OWNER and one for put ? That way we could have something like

resv_map_hpage_resv_owner_get()
resv_map_hpge_resv_put() 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> -
>  static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>  	struct hstate *h = hstate_vma(vma);
> @@ -2237,7 +2228,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  		reserve = (end - start) -
>  			region_count(resv, start, end);
>
> -		resv_map_put(vma);
> +		kref_put(&resv->refs, resv_map_release);
>
>  		if (reserve) {
>  			hugetlb_acct_memory(h, -reserve);
> @@ -3164,8 +3155,8 @@ int hugetlb_reserve_pages(struct inode *inode,
>  		region_add(resv_map, from, to);
>  	return 0;
>  out_err:
> -	if (vma)
> -		resv_map_put(vma);
> +	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> +		kref_put(&resv_map->refs, resv_map_release);

for this

    if (alloc)
           resv_map_hpage_resv_put();

>  	return ret;
>  }
>
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
@ 2013-08-21 10:49     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-21 10:49 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> In following patch, I change vma_resv_map() to return resv_map
> for all case. This patch prepares it by removing resv_map_put() which
> doesn't works properly with following change, because it works only for
> HPAGE_RESV_OWNER's resv_map, not for all resv_maps.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 73034dd..869c3e0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
>  		kref_get(&resv->refs);
>  }
>
> -static void resv_map_put(struct vm_area_struct *vma)
> -{
> -	struct resv_map *resv = vma_resv_map(vma);
> -
> -	if (!resv)
> -		return;
> -	kref_put(&resv->refs, resv_map_release);
> -}

Why not have seperate functions to return vma_resv_map for
HPAGE_RESV_OWNER and one for put ? That way we could have something like

resv_map_hpage_resv_owner_get()
resv_map_hpge_resv_put() 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> -
>  static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  {
>  	struct hstate *h = hstate_vma(vma);
> @@ -2237,7 +2228,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
>  		reserve = (end - start) -
>  			region_count(resv, start, end);
>
> -		resv_map_put(vma);
> +		kref_put(&resv->refs, resv_map_release);
>
>  		if (reserve) {
>  			hugetlb_acct_memory(h, -reserve);
> @@ -3164,8 +3155,8 @@ int hugetlb_reserve_pages(struct inode *inode,
>  		region_add(resv_map, from, to);
>  	return 0;
>  out_err:
> -	if (vma)
> -		resv_map_put(vma);
> +	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> +		kref_put(&resv_map->refs, resv_map_release);

for this

    if (alloc)
           resv_map_hpage_resv_put();

>  	return ret;
>  }
>
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-21  9:28     ` Aneesh Kumar K.V
@ 2013-08-22  6:50       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Hello, Aneesh.

First of all, thank you for review!

On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> > So, we should check subpool counter when avoid_reserve.
> > This patch implement it.
> 
> Can you explain this better ? ie, if we don't have a reservation in the
> area chg != 0. So why look at avoid_reserve. 

We don't consider avoid_reserve when chg != 0.
Look at following code.

+       if (chg || avoid_reserve)
+               if (hugepage_subpool_get_pages(spool, 1))

It means that if chg != 0, we skip to check avoid_reserve.

> 
> Also the code will become if you did
> 
> if (!chg && avoid_reserve)
>    chg = 1;
> 
> and then rest of the code will be able to handle the case.


We still pass avoid_reserve to dequeue_huge_page_vma() and check avoid_reserve
there, so maintaining avoid_reserve and checking it separately is better
to understand a logic. And it doesn't matter at all since I eventually unify
these in patch 13.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-22  6:50       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Hello, Aneesh.

First of all, thank you for review!

On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> > So, we should check subpool counter when avoid_reserve.
> > This patch implement it.
> 
> Can you explain this better ? ie, if we don't have a reservation in the
> area chg != 0. So why look at avoid_reserve. 

We don't consider avoid_reserve when chg != 0.
Look at following code.

+       if (chg || avoid_reserve)
+               if (hugepage_subpool_get_pages(spool, 1))

It means that if chg != 0, we skip to check avoid_reserve.

> 
> Also the code will become if you did
> 
> if (!chg && avoid_reserve)
>    chg = 1;
> 
> and then rest of the code will be able to handle the case.


We still pass avoid_reserve to dequeue_huge_page_vma() and check avoid_reserve
there, so maintaining avoid_reserve and checking it separately is better
to understand a logic. And it doesn't matter at all since I eventually unify
these in patch 13.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
  2013-08-21  9:54     ` Aneesh Kumar K.V
@ 2013-08-22  6:51       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:24:13PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > If we fail with a reserved page, just calling put_page() is not sufficient,
> > because put_page() invoke free_huge_page() at last step and it doesn't
> > know whether a page comes from a reserved pool or not. So it doesn't do
> > anything related to reserved count. This makes reserve count lower
> > than how we need, because reserve count already decrease in
> > dequeue_huge_page_vma(). This patch fix this situation.
> 
> You may want to document you are using PagePrivate for tracking
> reservation and why it is ok to that.

Okay! I will do it in next spin.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed
@ 2013-08-22  6:51       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:24:13PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > If we fail with a reserved page, just calling put_page() is not sufficient,
> > because put_page() invoke free_huge_page() at last step and it doesn't
> > know whether a page comes from a reserved pool or not. So it doesn't do
> > anything related to reserved count. This makes reserve count lower
> > than how we need, because reserve count already decrease in
> > dequeue_huge_page_vma(). This patch fix this situation.
> 
> You may want to document you are using PagePrivate for tracking
> reservation and why it is ok to that.

Okay! I will do it in next spin.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
  2013-08-21 10:22     ` Aneesh Kumar K.V
@ 2013-08-22  6:53       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:52:57PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index a3f868a..9bf2c4a 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
> >
> >  static void hugetlbfs_evict_inode(struct inode *inode)
> >  {
> > +	struct resv_map *resv_map;
> > +
> >  	truncate_hugepages(inode, 0);
> > +	resv_map = (struct resv_map *)inode->i_mapping->private_data;
> 
> can you add a comment around saying root inode doesn't have resv_map. 

Okay! I will do it.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
@ 2013-08-22  6:53       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:52:57PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index a3f868a..9bf2c4a 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
> >
> >  static void hugetlbfs_evict_inode(struct inode *inode)
> >  {
> > +	struct resv_map *resv_map;
> > +
> >  	truncate_hugepages(inode, 0);
> > +	resv_map = (struct resv_map *)inode->i_mapping->private_data;
> 
> can you add a comment around saying root inode doesn't have resv_map. 

Okay! I will do it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
  2013-08-21  9:57     ` Aneesh Kumar K.V
@ 2013-08-22  6:56       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:27:38PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> As mentioned earlier kref_put is confusing because we always have
> reference count == 1 , otherwise

Okay. In that case, I will use release function directly.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 07/20] mm, hugetlb: unify region structure handling
@ 2013-08-22  6:56       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:27:38PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> As mentioned earlier kref_put is confusing because we always have
> reference count == 1 , otherwise

Okay. In that case, I will use release function directly.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-08-21 10:13     ` Aneesh Kumar K.V
@ 2013-08-22  6:59       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:43:27PM +0530, Aneesh Kumar K.V wrote:

> >  static long region_chg(struct resv_map *resv, long f, long t)
> >  {
> >  	struct list_head *head = &resv->regions;
> > -	struct file_region *rg, *nrg;
> > +	struct file_region *rg, *nrg = NULL;
> >  	long chg = 0;
> >
> > +retry:
> > +	spin_lock(&resv->lock);
> >  	/* Locate the region we are before or in. */
> >  	list_for_each_entry(rg, head, link)
> >  		if (f <= rg->to)
> > @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
> >  	 * Subtle, allocate a new region at the position but make it zero
> >  	 * size such that we can guarantee to record the reservation. */
> >  	if (&rg->link == head || t < rg->from) {
> > -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > -		if (!nrg)
> > -			return -ENOMEM;
> > +		if (!nrg) {
> > +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> 
> Do we really need to have the GFP_NOWAIT allocation attempt. Why can't we simply say
> allocate and retry ? Or should resv->lock be a mutex ?
> 

Yes, your proposal that simply allocate and retry looks good to me.
I will change it.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-08-22  6:59       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  6:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 03:43:27PM +0530, Aneesh Kumar K.V wrote:

> >  static long region_chg(struct resv_map *resv, long f, long t)
> >  {
> >  	struct list_head *head = &resv->regions;
> > -	struct file_region *rg, *nrg;
> > +	struct file_region *rg, *nrg = NULL;
> >  	long chg = 0;
> >
> > +retry:
> > +	spin_lock(&resv->lock);
> >  	/* Locate the region we are before or in. */
> >  	list_for_each_entry(rg, head, link)
> >  		if (f <= rg->to)
> > @@ -202,15 +199,27 @@ static long region_chg(struct resv_map *resv, long f, long t)
> >  	 * Subtle, allocate a new region at the position but make it zero
> >  	 * size such that we can guarantee to record the reservation. */
> >  	if (&rg->link == head || t < rg->from) {
> > -		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > -		if (!nrg)
> > -			return -ENOMEM;
> > +		if (!nrg) {
> > +			nrg = kmalloc(sizeof(*nrg), GFP_NOWAIT);
> 
> Do we really need to have the GFP_NOWAIT allocation attempt. Why can't we simply say
> allocate and retry ? Or should resv->lock be a mutex ?
> 

Yes, your proposal that simply allocate and retry looks good to me.
I will change it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-22  6:50       ` Joonsoo Kim
@ 2013-08-22  7:08         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22  7:08 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Hello, Aneesh.
>
> First of all, thank you for review!
>
> On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
>> > So, we should check subpool counter when avoid_reserve.
>> > This patch implement it.
>> 
>> Can you explain this better ? ie, if we don't have a reservation in the
>> area chg != 0. So why look at avoid_reserve. 
>
> We don't consider avoid_reserve when chg != 0.
> Look at following code.
>
> +       if (chg || avoid_reserve)
> +               if (hugepage_subpool_get_pages(spool, 1))
>
> It means that if chg != 0, we skip to check avoid_reserve.

when whould be avoid_reserve == 1 and chg == 0 ?

-aneesh


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-22  7:08         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22  7:08 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Hello, Aneesh.
>
> First of all, thank you for review!
>
> On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
>> > So, we should check subpool counter when avoid_reserve.
>> > This patch implement it.
>> 
>> Can you explain this better ? ie, if we don't have a reservation in the
>> area chg != 0. So why look at avoid_reserve. 
>
> We don't consider avoid_reserve when chg != 0.
> Look at following code.
>
> +       if (chg || avoid_reserve)
> +               if (hugepage_subpool_get_pages(spool, 1))
>
> It means that if chg != 0, we skip to check avoid_reserve.

when whould be avoid_reserve == 1 and chg == 0 ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
  2013-08-21 10:49     ` Aneesh Kumar K.V
@ 2013-08-22  7:24       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 04:19:20PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > In following patch, I change vma_resv_map() to return resv_map
> > for all case. This patch prepares it by removing resv_map_put() which
> > doesn't works properly with following change, because it works only for
> > HPAGE_RESV_OWNER's resv_map, not for all resv_maps.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 73034dd..869c3e0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> >  		kref_get(&resv->refs);
> >  }
> >
> > -static void resv_map_put(struct vm_area_struct *vma)
> > -{
> > -	struct resv_map *resv = vma_resv_map(vma);
> > -
> > -	if (!resv)
> > -		return;
> > -	kref_put(&resv->refs, resv_map_release);
> > -}
> 
> Why not have seperate functions to return vma_resv_map for
> HPAGE_RESV_OWNER and one for put ? That way we could have something like
> 
> resv_map_hpage_resv_owner_get()
> resv_map_hpge_resv_put() 

Because there is no place to call this function more than once.
IMO, in this simple case, open code is better to understand and better to
reduce code size.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks :)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 10/20] mm, hugetlb: remove resv_map_put()
@ 2013-08-22  7:24       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 04:19:20PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > In following patch, I change vma_resv_map() to return resv_map
> > for all case. This patch prepares it by removing resv_map_put() which
> > doesn't works properly with following change, because it works only for
> > HPAGE_RESV_OWNER's resv_map, not for all resv_maps.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 73034dd..869c3e0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2212,15 +2212,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> >  		kref_get(&resv->refs);
> >  }
> >
> > -static void resv_map_put(struct vm_area_struct *vma)
> > -{
> > -	struct resv_map *resv = vma_resv_map(vma);
> > -
> > -	if (!resv)
> > -		return;
> > -	kref_put(&resv->refs, resv_map_release);
> > -}
> 
> Why not have seperate functions to return vma_resv_map for
> HPAGE_RESV_OWNER and one for put ? That way we could have something like
> 
> resv_map_hpage_resv_owner_get()
> resv_map_hpge_resv_put() 

Because there is no place to call this function more than once.
IMO, in this simple case, open code is better to understand and better to
reduce code size.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
  2013-08-21 10:37     ` Aneesh Kumar K.V
@ 2013-08-22  7:25       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 04:07:36PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Util now, we get a resv_map by two ways according to each mapping type.
> > This makes code dirty and unreadable. So unfiying it.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 869c3e0..e6c0c77 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
> >  	kfree(resv_map);
> >  }
> >
> > +static inline struct resv_map *inode_resv_map(struct inode *inode)
> > +{
> > +	return inode->i_mapping->private_data;
> > +}
> 
> it would be nice to get have another function that will return resv_map
> only if we have HPAGE_RESV_OWNER. So that we could use that in
> hugetlb_vm_op_open/close. ? Otherwise 

It is answered in previous reply.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type
@ 2013-08-22  7:25       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, Aug 21, 2013 at 04:07:36PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Util now, we get a resv_map by two ways according to each mapping type.
> > This makes code dirty and unreadable. So unfiying it.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 869c3e0..e6c0c77 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -421,13 +421,24 @@ void resv_map_release(struct kref *ref)
> >  	kfree(resv_map);
> >  }
> >
> > +static inline struct resv_map *inode_resv_map(struct inode *inode)
> > +{
> > +	return inode->i_mapping->private_data;
> > +}
> 
> it would be nice to get have another function that will return resv_map
> only if we have HPAGE_RESV_OWNER. So that we could use that in
> hugetlb_vm_op_open/close. ? Otherwise 

It is answered in previous reply.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-22  7:08         ` Aneesh Kumar K.V
@ 2013-08-22  7:47           ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Hello, Aneesh.
> >
> > First of all, thank you for review!
> >
> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> >> > So, we should check subpool counter when avoid_reserve.
> >> > This patch implement it.
> >> 
> >> Can you explain this better ? ie, if we don't have a reservation in the
> >> area chg != 0. So why look at avoid_reserve. 
> >
> > We don't consider avoid_reserve when chg != 0.
> > Look at following code.
> >
> > +       if (chg || avoid_reserve)
> > +               if (hugepage_subpool_get_pages(spool, 1))
> >
> > It means that if chg != 0, we skip to check avoid_reserve.
> 
> when whould be avoid_reserve == 1 and chg == 0 ?

In this case, we should do hugepage_subpool_get_pages(), since we don't
get a reserved page due to avoid_reserve.

Thanks.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-22  7:47           ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  7:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Hello, Aneesh.
> >
> > First of all, thank you for review!
> >
> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> >> > So, we should check subpool counter when avoid_reserve.
> >> > This patch implement it.
> >> 
> >> Can you explain this better ? ie, if we don't have a reservation in the
> >> area chg != 0. So why look at avoid_reserve. 
> >
> > We don't consider avoid_reserve when chg != 0.
> > Look at following code.
> >
> > +       if (chg || avoid_reserve)
> > +               if (hugepage_subpool_get_pages(spool, 1))
> >
> > It means that if chg != 0, we skip to check avoid_reserve.
> 
> when whould be avoid_reserve == 1 and chg == 0 ?

In this case, we should do hugepage_subpool_get_pages(), since we don't
get a reserved page due to avoid_reserve.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-22  8:44     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22  8:44 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> vma_has_reserves() can be substituted by using return value of
> vma_needs_reservation(). If chg returned by vma_needs_reservation()
> is 0, it means that vma has reserves. Otherwise, it means that vma don't
> have reserves and need a hugepage outside of reserve pool. This definition
> is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e6c0c77..22ceb04 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -473,39 +473,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
>  		vma->vm_private_data = (void *)0;
>  }
>
> -/* Returns true if the VMA has associated reserve pages */
> -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> -{
> -	if (vma->vm_flags & VM_NORESERVE) {
> -		/*
> -		 * This address is already reserved by other process(chg == 0),
> -		 * so, we should decreament reserved count. Without
> -		 * decreamenting, reserve count is remained after releasing
> -		 * inode, because this allocated page will go into page cache
> -		 * and is regarded as coming from reserved pool in releasing
> -		 * step. Currently, we don't have any other solution to deal
> -		 * with this situation properly, so add work-around here.
> -		 */
> -		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> -			return 1;
> -		else
> -			return 0;
> -	}
> -
> -	/* Shared mappings always use reserves */
> -	if (vma->vm_flags & VM_MAYSHARE)
> -		return 1;
> -
> -	/*
> -	 * Only the process that called mmap() has reserves for
> -	 * private mappings.
> -	 */
> -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> -		return 1;
> -
> -	return 0;
> -}
> -
>  static void copy_gigantic_page(struct page *dst, struct page *src)
>  {
>  	int i;
> @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	 * have no page reserves. This check ensures that reservations are
>  	 * not "stolen". The child may still get SIGKILLed
>  	 */
> -	if (!vma_has_reserves(vma, chg) &&
> -			h->free_huge_pages - h->resv_huge_pages == 0)
> +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
>  		return NULL;
>
>  	/* If reserves cannot be used, ensure enough pages are in the pool */
> @@ -600,7 +566,7 @@ retry_cpuset:
>  			if (page) {
>  				if (avoid_reserve)
>  					break;
> -				if (!vma_has_reserves(vma, chg))
> +				if (chg)
>  					break;
>
>  				SetPagePrivate(page);

Can you add a comment above both the place to explain why checking chg
is good enough ?

-aneesh


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
@ 2013-08-22  8:44     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22  8:44 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> vma_has_reserves() can be substituted by using return value of
> vma_needs_reservation(). If chg returned by vma_needs_reservation()
> is 0, it means that vma has reserves. Otherwise, it means that vma don't
> have reserves and need a hugepage outside of reserve pool. This definition
> is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e6c0c77..22ceb04 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -473,39 +473,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
>  		vma->vm_private_data = (void *)0;
>  }
>
> -/* Returns true if the VMA has associated reserve pages */
> -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> -{
> -	if (vma->vm_flags & VM_NORESERVE) {
> -		/*
> -		 * This address is already reserved by other process(chg == 0),
> -		 * so, we should decreament reserved count. Without
> -		 * decreamenting, reserve count is remained after releasing
> -		 * inode, because this allocated page will go into page cache
> -		 * and is regarded as coming from reserved pool in releasing
> -		 * step. Currently, we don't have any other solution to deal
> -		 * with this situation properly, so add work-around here.
> -		 */
> -		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> -			return 1;
> -		else
> -			return 0;
> -	}
> -
> -	/* Shared mappings always use reserves */
> -	if (vma->vm_flags & VM_MAYSHARE)
> -		return 1;
> -
> -	/*
> -	 * Only the process that called mmap() has reserves for
> -	 * private mappings.
> -	 */
> -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> -		return 1;
> -
> -	return 0;
> -}
> -
>  static void copy_gigantic_page(struct page *dst, struct page *src)
>  {
>  	int i;
> @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	 * have no page reserves. This check ensures that reservations are
>  	 * not "stolen". The child may still get SIGKILLed
>  	 */
> -	if (!vma_has_reserves(vma, chg) &&
> -			h->free_huge_pages - h->resv_huge_pages == 0)
> +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
>  		return NULL;
>
>  	/* If reserves cannot be used, ensure enough pages are in the pool */
> @@ -600,7 +566,7 @@ retry_cpuset:
>  			if (page) {
>  				if (avoid_reserve)
>  					break;
> -				if (!vma_has_reserves(vma, chg))
> +				if (chg)
>  					break;
>
>  				SetPagePrivate(page);

Can you add a comment above both the place to explain why checking chg
is good enough ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
  2013-08-22  8:44     ` Aneesh Kumar K.V
@ 2013-08-22  9:17       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  9:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > vma_has_reserves() can be substituted by using return value of
> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
> > have reserves and need a hugepage outside of reserve pool. This definition
> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  	 * have no page reserves. This check ensures that reservations are
> >  	 * not "stolen". The child may still get SIGKILLed
> >  	 */
> > -	if (!vma_has_reserves(vma, chg) &&
> > -			h->free_huge_pages - h->resv_huge_pages == 0)
> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> >  		return NULL;
> >
> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
> > @@ -600,7 +566,7 @@ retry_cpuset:
> >  			if (page) {
> >  				if (avoid_reserve)
> >  					break;
> > -				if (!vma_has_reserves(vma, chg))
> > +				if (chg)
> >  					break;
> >
> >  				SetPagePrivate(page);
> 
> Can you add a comment above both the place to explain why checking chg
> is good enough ?

Yes, I can. But it will be changed to use_reserve in patch 13 and it
represent it's meaning perfectly. So commeting may be useless.

Thanks.

> 
> -aneesh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
@ 2013-08-22  9:17       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-22  9:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > vma_has_reserves() can be substituted by using return value of
> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
> > have reserves and need a hugepage outside of reserve pool. This definition
> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  	 * have no page reserves. This check ensures that reservations are
> >  	 * not "stolen". The child may still get SIGKILLed
> >  	 */
> > -	if (!vma_has_reserves(vma, chg) &&
> > -			h->free_huge_pages - h->resv_huge_pages == 0)
> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> >  		return NULL;
> >
> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
> > @@ -600,7 +566,7 @@ retry_cpuset:
> >  			if (page) {
> >  				if (avoid_reserve)
> >  					break;
> > -				if (!vma_has_reserves(vma, chg))
> > +				if (chg)
> >  					break;
> >
> >  				SetPagePrivate(page);
> 
> Can you add a comment above both the place to explain why checking chg
> is good enough ?

Yes, I can. But it will be changed to use_reserve in patch 13 and it
represent it's meaning perfectly. So commeting may be useless.

Thanks.

> 
> -aneesh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
  2013-08-22  9:17       ` Joonsoo Kim
@ 2013-08-22 11:04         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22 11:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > vma_has_reserves() can be substituted by using return value of
>> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
>> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
>> > have reserves and need a hugepage outside of reserve pool. This definition
>> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
>> >
>> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> 
>> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Thanks.
>
>> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>> >  	 * have no page reserves. This check ensures that reservations are
>> >  	 * not "stolen". The child may still get SIGKILLed
>> >  	 */
>> > -	if (!vma_has_reserves(vma, chg) &&
>> > -			h->free_huge_pages - h->resv_huge_pages == 0)
>> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
>> >  		return NULL;
>> >
>> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
>> > @@ -600,7 +566,7 @@ retry_cpuset:
>> >  			if (page) {
>> >  				if (avoid_reserve)
>> >  					break;
>> > -				if (!vma_has_reserves(vma, chg))
>> > +				if (chg)
>> >  					break;
>> >
>> >  				SetPagePrivate(page);
>> 
>> Can you add a comment above both the place to explain why checking chg
>> is good enough ?
>
> Yes, I can. But it will be changed to use_reserve in patch 13 and it
> represent it's meaning perfectly. So commeting may be useless.

That should be ok, because having a comment in this patch helps in
understanding the patch better, even though you are removing that
later. 

-aneesh


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
@ 2013-08-22 11:04         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-22 11:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > vma_has_reserves() can be substituted by using return value of
>> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
>> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
>> > have reserves and need a hugepage outside of reserve pool. This definition
>> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
>> >
>> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> 
>> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Thanks.
>
>> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>> >  	 * have no page reserves. This check ensures that reservations are
>> >  	 * not "stolen". The child may still get SIGKILLed
>> >  	 */
>> > -	if (!vma_has_reserves(vma, chg) &&
>> > -			h->free_huge_pages - h->resv_huge_pages == 0)
>> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
>> >  		return NULL;
>> >
>> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
>> > @@ -600,7 +566,7 @@ retry_cpuset:
>> >  			if (page) {
>> >  				if (avoid_reserve)
>> >  					break;
>> > -				if (!vma_has_reserves(vma, chg))
>> > +				if (chg)
>> >  					break;
>> >
>> >  				SetPagePrivate(page);
>> 
>> Can you add a comment above both the place to explain why checking chg
>> is good enough ?
>
> Yes, I can. But it will be changed to use_reserve in patch 13 and it
> represent it's meaning perfectly. So commeting may be useless.

That should be ok, because having a comment in this patch helps in
understanding the patch better, even though you are removing that
later. 

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
  2013-08-22 11:04         ` Aneesh Kumar K.V
@ 2013-08-23  6:16           ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-23  6:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 04:34:22PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > vma_has_reserves() can be substituted by using return value of
> >> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
> >> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
> >> > have reserves and need a hugepage outside of reserve pool. This definition
> >> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
> >> >
> >> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> 
> >> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >
> > Thanks.
> >
> >> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >> >  	 * have no page reserves. This check ensures that reservations are
> >> >  	 * not "stolen". The child may still get SIGKILLed
> >> >  	 */
> >> > -	if (!vma_has_reserves(vma, chg) &&
> >> > -			h->free_huge_pages - h->resv_huge_pages == 0)
> >> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> >> >  		return NULL;
> >> >
> >> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
> >> > @@ -600,7 +566,7 @@ retry_cpuset:
> >> >  			if (page) {
> >> >  				if (avoid_reserve)
> >> >  					break;
> >> > -				if (!vma_has_reserves(vma, chg))
> >> > +				if (chg)
> >> >  					break;
> >> >
> >> >  				SetPagePrivate(page);
> >> 
> >> Can you add a comment above both the place to explain why checking chg
> >> is good enough ?
> >
> > Yes, I can. But it will be changed to use_reserve in patch 13 and it
> > represent it's meaning perfectly. So commeting may be useless.
> 
> That should be ok, because having a comment in this patch helps in
> understanding the patch better, even though you are removing that
> later. 

Okay. I will add it in next spin.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves()
@ 2013-08-23  6:16           ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-23  6:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Aug 22, 2013 at 04:34:22PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > On Thu, Aug 22, 2013 at 02:14:38PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > vma_has_reserves() can be substituted by using return value of
> >> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
> >> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
> >> > have reserves and need a hugepage outside of reserve pool. This definition
> >> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
> >> >
> >> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> 
> >> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >
> > Thanks.
> >
> >> > @@ -580,8 +547,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >> >  	 * have no page reserves. This check ensures that reservations are
> >> >  	 * not "stolen". The child may still get SIGKILLed
> >> >  	 */
> >> > -	if (!vma_has_reserves(vma, chg) &&
> >> > -			h->free_huge_pages - h->resv_huge_pages == 0)
> >> > +	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> >> >  		return NULL;
> >> >
> >> >  	/* If reserves cannot be used, ensure enough pages are in the pool */
> >> > @@ -600,7 +566,7 @@ retry_cpuset:
> >> >  			if (page) {
> >> >  				if (avoid_reserve)
> >> >  					break;
> >> > -				if (!vma_has_reserves(vma, chg))
> >> > +				if (chg)
> >> >  					break;
> >> >
> >> >  				SetPagePrivate(page);
> >> 
> >> Can you add a comment above both the place to explain why checking chg
> >> is good enough ?
> >
> > Yes, I can. But it will be changed to use_reserve in patch 13 and it
> > represent it's meaning perfectly. So commeting may be useless.
> 
> That should be ok, because having a comment in this patch helps in
> understanding the patch better, even though you are removing that
> later. 

Okay. I will add it in next spin.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-22  7:47           ` Joonsoo Kim
@ 2013-08-26 13:01             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > Hello, Aneesh.
>> >
>> > First of all, thank you for review!
>> >
>> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
>> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> >> 
>> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
>> >> > So, we should check subpool counter when avoid_reserve.
>> >> > This patch implement it.
>> >> 
>> >> Can you explain this better ? ie, if we don't have a reservation in the
>> >> area chg != 0. So why look at avoid_reserve. 
>> >
>> > We don't consider avoid_reserve when chg != 0.
>> > Look at following code.
>> >
>> > +       if (chg || avoid_reserve)
>> > +               if (hugepage_subpool_get_pages(spool, 1))
>> >
>> > It means that if chg != 0, we skip to check avoid_reserve.
>> 
>> when whould be avoid_reserve == 1 and chg == 0 ?
>
> In this case, we should do hugepage_subpool_get_pages(), since we don't
> get a reserved page due to avoid_reserve.

As per off-list discussion we had around this, please add additional
information in commit message explaining when we have
avoid_reserve == 1 and chg == 0

Something like the below copied from call site.

	 /* If the process that created a MAP_PRIVATE mapping is about to
	  * perform a COW due to a shared page count, attempt to satisfy
	  * the allocation without using the existing reserves
          */

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

-aneesh


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-26 13:01             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
>> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> 
>> > Hello, Aneesh.
>> >
>> > First of all, thank you for review!
>> >
>> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
>> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>> >> 
>> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
>> >> > So, we should check subpool counter when avoid_reserve.
>> >> > This patch implement it.
>> >> 
>> >> Can you explain this better ? ie, if we don't have a reservation in the
>> >> area chg != 0. So why look at avoid_reserve. 
>> >
>> > We don't consider avoid_reserve when chg != 0.
>> > Look at following code.
>> >
>> > +       if (chg || avoid_reserve)
>> > +               if (hugepage_subpool_get_pages(spool, 1))
>> >
>> > It means that if chg != 0, we skip to check avoid_reserve.
>> 
>> when whould be avoid_reserve == 1 and chg == 0 ?
>
> In this case, we should do hugepage_subpool_get_pages(), since we don't
> get a reserved page due to avoid_reserve.

As per off-list discussion we had around this, please add additional
information in commit message explaining when we have
avoid_reserve == 1 and chg == 0

Something like the below copied from call site.

	 /* If the process that created a MAP_PRIVATE mapping is about to
	  * perform a COW due to a shared page count, attempt to satisfy
	  * the allocation without using the existing reserves
          */

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 13:09     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:09 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, we have two variable to represent whether we can use reserved
> page or not, chg and avoid_reserve, respectively. With aggregating these,
> we can have more clean code. This makes no functinoal difference.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 22ceb04..8dff972 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				struct vm_area_struct *vma,
> -				unsigned long address, int avoid_reserve,
> -				long chg)
> +				unsigned long address, bool use_reserve)
>  {
>  	struct page *page = NULL;
>  	struct mempolicy *mpol;
> @@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	 * A child process with MAP_PRIVATE mappings created by their parent
>  	 * have no page reserves. This check ensures that reservations are
>  	 * not "stolen". The child may still get SIGKILLed
> +	 * Or, when parent process do COW, we cannot use reserved page.
> +	 * In this case, ensure enough pages are in the pool.
>  	 */
> -	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> -		return NULL;

This hunk would be much easier if you were changing. 

	if (!vma_has_reserves(vma) &&
			h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

ie, !vma_has_reserves(vma) == !use_reserve.

So may be a patch rearragment would help ?. But neverthless. 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> -
> -	/* If reserves cannot be used, ensure enough pages are in the pool */
> -	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
> +	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
>  		return NULL;



>
>  retry_cpuset:
> @@ -564,9 +561,7 @@ retry_cpuset:
>  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
>  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
>  			if (page) {
> -				if (avoid_reserve)
> -					break;
> -				if (chg)
> +				if (!use_reserve)
>  					break;
>
>  				SetPagePrivate(page);
> @@ -1121,6 +1116,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
>  	long chg;
> +	bool use_reserve;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>
> @@ -1136,18 +1132,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-ENOMEM);
> -	if (chg || avoid_reserve)
> +	use_reserve = (!chg && !avoid_reserve);
> +	if (!use_reserve)
>  		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
>
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>  	if (ret) {
> -		if (chg || avoid_reserve)
> +		if (!use_reserve)
>  			hugepage_subpool_put_pages(spool, 1);
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> +	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
>  	if (!page) {
>  		spin_unlock(&hugetlb_lock);
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> @@ -1155,7 +1152,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_cgroup(idx,
>  						       pages_per_huge_page(h),
>  						       h_cg);
> -			if (chg || avoid_reserve)
> +			if (!use_reserve)
>  				hugepage_subpool_put_pages(spool, 1);
>  			return ERR_PTR(-ENOSPC);
>  		}
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
@ 2013-08-26 13:09     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:09 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Currently, we have two variable to represent whether we can use reserved
> page or not, chg and avoid_reserve, respectively. With aggregating these,
> we can have more clean code. This makes no functinoal difference.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 22ceb04..8dff972 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				struct vm_area_struct *vma,
> -				unsigned long address, int avoid_reserve,
> -				long chg)
> +				unsigned long address, bool use_reserve)
>  {
>  	struct page *page = NULL;
>  	struct mempolicy *mpol;
> @@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	 * A child process with MAP_PRIVATE mappings created by their parent
>  	 * have no page reserves. This check ensures that reservations are
>  	 * not "stolen". The child may still get SIGKILLed
> +	 * Or, when parent process do COW, we cannot use reserved page.
> +	 * In this case, ensure enough pages are in the pool.
>  	 */
> -	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> -		return NULL;

This hunk would be much easier if you were changing. 

	if (!vma_has_reserves(vma) &&
			h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

ie, !vma_has_reserves(vma) == !use_reserve.

So may be a patch rearragment would help ?. But neverthless. 

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> -
> -	/* If reserves cannot be used, ensure enough pages are in the pool */
> -	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
> +	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
>  		return NULL;



>
>  retry_cpuset:
> @@ -564,9 +561,7 @@ retry_cpuset:
>  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
>  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
>  			if (page) {
> -				if (avoid_reserve)
> -					break;
> -				if (chg)
> +				if (!use_reserve)
>  					break;
>
>  				SetPagePrivate(page);
> @@ -1121,6 +1116,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
>  	long chg;
> +	bool use_reserve;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>
> @@ -1136,18 +1132,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-ENOMEM);
> -	if (chg || avoid_reserve)
> +	use_reserve = (!chg && !avoid_reserve);
> +	if (!use_reserve)
>  		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
>
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>  	if (ret) {
> -		if (chg || avoid_reserve)
> +		if (!use_reserve)
>  			hugepage_subpool_put_pages(spool, 1);
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> +	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
>  	if (!page) {
>  		spin_unlock(&hugetlb_lock);
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> @@ -1155,7 +1152,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_cgroup(idx,
>  						       pages_per_huge_page(h),
>  						       h_cg);
> -			if (chg || avoid_reserve)
> +			if (!use_reserve)
>  				hugepage_subpool_put_pages(spool, 1);
>  			return ERR_PTR(-ENOSPC);
>  		}
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 13:36     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:36 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> In order to validate that this failure is reasonable, we need to know
> whether allocation request is for reserved or not on caller function.
> So moving vma_needs_reservation() up to the caller of alloc_huge_page().
> There is no functional change in this patch and following patch use
> this information.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8dff972..bc666cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
>  }
>
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> -				    unsigned long addr, int avoid_reserve)
> +				    unsigned long addr, int use_reserve)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
> -	long chg;
> -	bool use_reserve;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>
> @@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	 * need pages and subpool limit allocated allocated if no reserve
>  	 * mapping overlaps.
>  	 */
> -	chg = vma_needs_reservation(h, vma, addr);
> -	if (chg < 0)
> -		return ERR_PTR(-ENOMEM);
> -	use_reserve = (!chg && !avoid_reserve);
>  	if (!use_reserve)
>  		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *old_page, *new_page;
>  	int outside_reserve = 0;
> +	long chg;
> +	bool use_reserve;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
>
>  	/* Drop page_table_lock as buddy allocator may be called */
>  	spin_unlock(&mm->page_table_lock);
> -	new_page = alloc_huge_page(vma, address, outside_reserve);
> +	chg = vma_needs_reservation(h, vma, address);
> +	if (chg == -ENOMEM) {

why not 

    if (chg < 0) ?

Should we try to unmap the page from child and avoid cow here ?. May be
with outside_reserve = 1 we will never have vma_needs_reservation fail.
Any how it would be nice to document why this error case is different
from alloc_huge_page error case.


> +		page_cache_release(old_page);
> +
> +		/* Caller expects lock to be held */
> +		spin_lock(&mm->page_table_lock);
> +		return VM_FAULT_OOM;
> +	}
> +	use_reserve = !chg && !outside_reserve;
> +
> +	new_page = alloc_huge_page(vma, address, use_reserve);
>
>  	if (IS_ERR(new_page)) {
>  		long err = PTR_ERR(new_page);
> @@ -2664,6 +2670,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *page;
>  	struct address_space *mapping;
>  	pte_t new_pte;
> +	long chg;
> +	bool use_reserve;
>
>  	/*
>  	 * Currently, we are forced to kill the process in the event the
> @@ -2689,7 +2697,15 @@ retry:
>  		size = i_size_read(mapping->host) >> huge_page_shift(h);
>  		if (idx >= size)
>  			goto out;
> -		page = alloc_huge_page(vma, address, 0);
> +
> +		chg = vma_needs_reservation(h, vma, address);
> +		if (chg == -ENOMEM) {

if (chg < 0)

> +			ret = VM_FAULT_OOM;
> +			goto out;
> +		}
> +		use_reserve = !chg;
> +
> +		page = alloc_huge_page(vma, address, use_reserve);
>  		if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
>  			if (ret == -ENOMEM)
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
@ 2013-08-26 13:36     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:36 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> In order to validate that this failure is reasonable, we need to know
> whether allocation request is for reserved or not on caller function.
> So moving vma_needs_reservation() up to the caller of alloc_huge_page().
> There is no functional change in this patch and following patch use
> this information.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8dff972..bc666cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
>  }
>
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> -				    unsigned long addr, int avoid_reserve)
> +				    unsigned long addr, int use_reserve)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
> -	long chg;
> -	bool use_reserve;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>
> @@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	 * need pages and subpool limit allocated allocated if no reserve
>  	 * mapping overlaps.
>  	 */
> -	chg = vma_needs_reservation(h, vma, addr);
> -	if (chg < 0)
> -		return ERR_PTR(-ENOMEM);
> -	use_reserve = (!chg && !avoid_reserve);
>  	if (!use_reserve)
>  		if (hugepage_subpool_get_pages(spool, 1))
>  			return ERR_PTR(-ENOSPC);
> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *old_page, *new_page;
>  	int outside_reserve = 0;
> +	long chg;
> +	bool use_reserve;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
>
>  	/* Drop page_table_lock as buddy allocator may be called */
>  	spin_unlock(&mm->page_table_lock);
> -	new_page = alloc_huge_page(vma, address, outside_reserve);
> +	chg = vma_needs_reservation(h, vma, address);
> +	if (chg == -ENOMEM) {

why not 

    if (chg < 0) ?

Should we try to unmap the page from child and avoid cow here ?. May be
with outside_reserve = 1 we will never have vma_needs_reservation fail.
Any how it would be nice to document why this error case is different
from alloc_huge_page error case.


> +		page_cache_release(old_page);
> +
> +		/* Caller expects lock to be held */
> +		spin_lock(&mm->page_table_lock);
> +		return VM_FAULT_OOM;
> +	}
> +	use_reserve = !chg && !outside_reserve;
> +
> +	new_page = alloc_huge_page(vma, address, use_reserve);
>
>  	if (IS_ERR(new_page)) {
>  		long err = PTR_ERR(new_page);
> @@ -2664,6 +2670,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *page;
>  	struct address_space *mapping;
>  	pte_t new_pte;
> +	long chg;
> +	bool use_reserve;
>
>  	/*
>  	 * Currently, we are forced to kill the process in the event the
> @@ -2689,7 +2697,15 @@ retry:
>  		size = i_size_read(mapping->host) >> huge_page_shift(h);
>  		if (idx >= size)
>  			goto out;
> -		page = alloc_huge_page(vma, address, 0);
> +
> +		chg = vma_needs_reservation(h, vma, address);
> +		if (chg == -ENOMEM) {

if (chg < 0)

> +			ret = VM_FAULT_OOM;
> +			goto out;
> +		}
> +		use_reserve = !chg;
> +
> +		page = alloc_huge_page(vma, address, use_reserve);
>  		if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
>  			if (ret == -ENOMEM)
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 15/20] mm, hugetlb: remove a check for return value of alloc_huge_page()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 13:38     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:38 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Now, alloc_huge_page() only return -ENOSPEC if failed.
> So, we don't worry about other return value.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bc666cf..24de2ca 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,7 +2544,6 @@ retry_avoidcopy:
>  	new_page = alloc_huge_page(vma, address, use_reserve);
>
>  	if (IS_ERR(new_page)) {
> -		long err = PTR_ERR(new_page);
>  		page_cache_release(old_page);
>
>  		/*
> @@ -2573,10 +2572,7 @@ retry_avoidcopy:
>
>  		/* Caller expects lock to be held */
>  		spin_lock(&mm->page_table_lock);
> -		if (err == -ENOMEM)
> -			return VM_FAULT_OOM;
> -		else
> -			return VM_FAULT_SIGBUS;
> +		return VM_FAULT_SIGBUS;
>  	}
>
>  	/*
> @@ -2707,11 +2703,7 @@ retry:
>
>  		page = alloc_huge_page(vma, address, use_reserve);
>  		if (IS_ERR(page)) {
> -			ret = PTR_ERR(page);
> -			if (ret == -ENOMEM)
> -				ret = VM_FAULT_OOM;
> -			else
> -				ret = VM_FAULT_SIGBUS;
> +			ret = VM_FAULT_SIGBUS;
>  			goto out;
>  		}
>  		clear_huge_page(page, address, pages_per_huge_page(h));
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 15/20] mm, hugetlb: remove a check for return value of alloc_huge_page()
@ 2013-08-26 13:38     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:38 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Now, alloc_huge_page() only return -ENOSPEC if failed.
> So, we don't worry about other return value.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bc666cf..24de2ca 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2544,7 +2544,6 @@ retry_avoidcopy:
>  	new_page = alloc_huge_page(vma, address, use_reserve);
>
>  	if (IS_ERR(new_page)) {
> -		long err = PTR_ERR(new_page);
>  		page_cache_release(old_page);
>
>  		/*
> @@ -2573,10 +2572,7 @@ retry_avoidcopy:
>
>  		/* Caller expects lock to be held */
>  		spin_lock(&mm->page_table_lock);
> -		if (err == -ENOMEM)
> -			return VM_FAULT_OOM;
> -		else
> -			return VM_FAULT_SIGBUS;
> +		return VM_FAULT_SIGBUS;
>  	}
>
>  	/*
> @@ -2707,11 +2703,7 @@ retry:
>
>  		page = alloc_huge_page(vma, address, use_reserve);
>  		if (IS_ERR(page)) {
> -			ret = PTR_ERR(page);
> -			if (ret == -ENOMEM)
> -				ret = VM_FAULT_OOM;
> -			else
> -				ret = VM_FAULT_SIGBUS;
> +			ret = VM_FAULT_SIGBUS;
>  			goto out;
>  		}
>  		clear_huge_page(page, address, pages_per_huge_page(h));
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 16/20] mm, hugetlb: move down outside_reserve check
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 13:44     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:44 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Just move down outside_reserve check and don't check
> vma_need_reservation() when outside_resever is true. It is slightly
> optimized implementation.
>
> This makes code more readable.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

I guess this address the comment I had with the previous patch

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 24de2ca..2372f75 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2499,7 +2499,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *old_page, *new_page;
>  	int outside_reserve = 0;
>  	long chg;
> -	bool use_reserve;
> +	bool use_reserve = false;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2514,6 +2514,11 @@ retry_avoidcopy:
>  		return 0;
>  	}
>
> +	page_cache_get(old_page);
> +
> +	/* Drop page_table_lock as buddy allocator may be called */
> +	spin_unlock(&mm->page_table_lock);
> +
>  	/*
>  	 * If the process that created a MAP_PRIVATE mapping is about to
>  	 * perform a COW due to a shared page count, attempt to satisfy
> @@ -2527,19 +2532,17 @@ retry_avoidcopy:
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>
> -	page_cache_get(old_page);
> -
> -	/* Drop page_table_lock as buddy allocator may be called */
> -	spin_unlock(&mm->page_table_lock);
> -	chg = vma_needs_reservation(h, vma, address);
> -	if (chg == -ENOMEM) {
> -		page_cache_release(old_page);
> +	if (!outside_reserve) {
> +		chg = vma_needs_reservation(h, vma, address);
> +		if (chg == -ENOMEM) {
> +			page_cache_release(old_page);
>
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> +			/* Caller expects lock to be held */
> +			spin_lock(&mm->page_table_lock);
> +			return VM_FAULT_OOM;
> +		}
> +		use_reserve = !chg;
>  	}
> -	use_reserve = !chg && !outside_reserve;
>
>  	new_page = alloc_huge_page(vma, address, use_reserve);
>
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 16/20] mm, hugetlb: move down outside_reserve check
@ 2013-08-26 13:44     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:44 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Just move down outside_reserve check and don't check
> vma_need_reservation() when outside_resever is true. It is slightly
> optimized implementation.
>
> This makes code more readable.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

I guess this address the comment I had with the previous patch

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 24de2ca..2372f75 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2499,7 +2499,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *old_page, *new_page;
>  	int outside_reserve = 0;
>  	long chg;
> -	bool use_reserve;
> +	bool use_reserve = false;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2514,6 +2514,11 @@ retry_avoidcopy:
>  		return 0;
>  	}
>
> +	page_cache_get(old_page);
> +
> +	/* Drop page_table_lock as buddy allocator may be called */
> +	spin_unlock(&mm->page_table_lock);
> +
>  	/*
>  	 * If the process that created a MAP_PRIVATE mapping is about to
>  	 * perform a COW due to a shared page count, attempt to satisfy
> @@ -2527,19 +2532,17 @@ retry_avoidcopy:
>  			old_page != pagecache_page)
>  		outside_reserve = 1;
>
> -	page_cache_get(old_page);
> -
> -	/* Drop page_table_lock as buddy allocator may be called */
> -	spin_unlock(&mm->page_table_lock);
> -	chg = vma_needs_reservation(h, vma, address);
> -	if (chg == -ENOMEM) {
> -		page_cache_release(old_page);
> +	if (!outside_reserve) {
> +		chg = vma_needs_reservation(h, vma, address);
> +		if (chg == -ENOMEM) {
> +			page_cache_release(old_page);
>
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> +			/* Caller expects lock to be held */
> +			spin_lock(&mm->page_table_lock);
> +			return VM_FAULT_OOM;
> +		}
> +		use_reserve = !chg;
>  	}
> -	use_reserve = !chg && !outside_reserve;
>
>  	new_page = alloc_huge_page(vma, address, use_reserve);
>
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
  2013-08-26 13:36     ` Aneesh Kumar K.V
@ 2013-08-26 13:46       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:46 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>
>> In order to validate that this failure is reasonable, we need to know
>> whether allocation request is for reserved or not on caller function.
>> So moving vma_needs_reservation() up to the caller of alloc_huge_page().
>> There is no functional change in this patch and following patch use
>> this information.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 8dff972..bc666cf 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
>>  }
>>
>>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> -				    unsigned long addr, int avoid_reserve)
>> +				    unsigned long addr, int use_reserve)
>>  {
>>  	struct hugepage_subpool *spool = subpool_vma(vma);
>>  	struct hstate *h = hstate_vma(vma);
>>  	struct page *page;
>> -	long chg;
>> -	bool use_reserve;
>>  	int ret, idx;
>>  	struct hugetlb_cgroup *h_cg;
>>
>> @@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>  	 * need pages and subpool limit allocated allocated if no reserve
>>  	 * mapping overlaps.
>>  	 */
>> -	chg = vma_needs_reservation(h, vma, addr);
>> -	if (chg < 0)
>> -		return ERR_PTR(-ENOMEM);
>> -	use_reserve = (!chg && !avoid_reserve);
>>  	if (!use_reserve)
>>  		if (hugepage_subpool_get_pages(spool, 1))
>>  			return ERR_PTR(-ENOSPC);
>> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>>  	struct hstate *h = hstate_vma(vma);
>>  	struct page *old_page, *new_page;
>>  	int outside_reserve = 0;
>> +	long chg;
>> +	bool use_reserve;
>>  	unsigned long mmun_start;	/* For mmu_notifiers */
>>  	unsigned long mmun_end;		/* For mmu_notifiers */
>>
>> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
>>
>>  	/* Drop page_table_lock as buddy allocator may be called */
>>  	spin_unlock(&mm->page_table_lock);
>> -	new_page = alloc_huge_page(vma, address, outside_reserve);
>> +	chg = vma_needs_reservation(h, vma, address);
>> +	if (chg == -ENOMEM) {
>
> why not 
>
>     if (chg < 0) ?
>
> Should we try to unmap the page from child and avoid cow here ?. May be
> with outside_reserve = 1 we will never have vma_needs_reservation fail.
> Any how it would be nice to document why this error case is different
> from alloc_huge_page error case.
>

I guess patch  16 address this . So if we do if (chg < 0) we are good
here.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

-aneesh


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
@ 2013-08-26 13:46       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 13:46 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
>
>> In order to validate that this failure is reasonable, we need to know
>> whether allocation request is for reserved or not on caller function.
>> So moving vma_needs_reservation() up to the caller of alloc_huge_page().
>> There is no functional change in this patch and following patch use
>> this information.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 8dff972..bc666cf 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1110,13 +1110,11 @@ static void vma_commit_reservation(struct hstate *h,
>>  }
>>
>>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> -				    unsigned long addr, int avoid_reserve)
>> +				    unsigned long addr, int use_reserve)
>>  {
>>  	struct hugepage_subpool *spool = subpool_vma(vma);
>>  	struct hstate *h = hstate_vma(vma);
>>  	struct page *page;
>> -	long chg;
>> -	bool use_reserve;
>>  	int ret, idx;
>>  	struct hugetlb_cgroup *h_cg;
>>
>> @@ -1129,10 +1127,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>  	 * need pages and subpool limit allocated allocated if no reserve
>>  	 * mapping overlaps.
>>  	 */
>> -	chg = vma_needs_reservation(h, vma, addr);
>> -	if (chg < 0)
>> -		return ERR_PTR(-ENOMEM);
>> -	use_reserve = (!chg && !avoid_reserve);
>>  	if (!use_reserve)
>>  		if (hugepage_subpool_get_pages(spool, 1))
>>  			return ERR_PTR(-ENOSPC);
>> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>>  	struct hstate *h = hstate_vma(vma);
>>  	struct page *old_page, *new_page;
>>  	int outside_reserve = 0;
>> +	long chg;
>> +	bool use_reserve;
>>  	unsigned long mmun_start;	/* For mmu_notifiers */
>>  	unsigned long mmun_end;		/* For mmu_notifiers */
>>
>> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
>>
>>  	/* Drop page_table_lock as buddy allocator may be called */
>>  	spin_unlock(&mm->page_table_lock);
>> -	new_page = alloc_huge_page(vma, address, outside_reserve);
>> +	chg = vma_needs_reservation(h, vma, address);
>> +	if (chg == -ENOMEM) {
>
> why not 
>
>     if (chg < 0) ?
>
> Should we try to unmap the page from child and avoid cow here ?. May be
> with outside_reserve = 1 we will never have vma_needs_reservation fail.
> Any how it would be nice to document why this error case is different
> from alloc_huge_page error case.
>

I guess patch  16 address this . So if we do if (chg < 0) we are good
here.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 17/20] mm, hugetlb: move up anon_vma_prepare()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 14:09     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 14:09 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we fail with a allocated hugepage, we need some effort to recover
> properly. So, it is better not to allocate a hugepage as much as possible.
> So move up anon_vma_prepare() which can be failed in OOM situation.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2372f75..7e9a651 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2520,6 +2520,17 @@ retry_avoidcopy:
>  	spin_unlock(&mm->page_table_lock);
>
>  	/*
> +	 * When the original hugepage is shared one, it does not have
> +	 * anon_vma prepared.
> +	 */
> +	if (unlikely(anon_vma_prepare(vma))) {
> +		page_cache_release(old_page);
> +		/* Caller expects lock to be held */
> +		spin_lock(&mm->page_table_lock);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	/*
>  	 * If the process that created a MAP_PRIVATE mapping is about to
>  	 * perform a COW due to a shared page count, attempt to satisfy
>  	 * the allocation without using the existing reserves. The pagecache
> @@ -2578,18 +2589,6 @@ retry_avoidcopy:
>  		return VM_FAULT_SIGBUS;
>  	}
>
> -	/*
> -	 * When the original hugepage is shared one, it does not have
> -	 * anon_vma prepared.
> -	 */
> -	if (unlikely(anon_vma_prepare(vma))) {
> -		page_cache_release(new_page);
> -		page_cache_release(old_page);
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> -	}
> -
>  	copy_user_huge_page(new_page, old_page, address, vma,
>  			    pages_per_huge_page(h));
>  	__SetPageUptodate(new_page);
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 17/20] mm, hugetlb: move up anon_vma_prepare()
@ 2013-08-26 14:09     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 14:09 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> If we fail with a allocated hugepage, we need some effort to recover
> properly. So, it is better not to allocate a hugepage as much as possible.
> So move up anon_vma_prepare() which can be failed in OOM situation.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2372f75..7e9a651 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2520,6 +2520,17 @@ retry_avoidcopy:
>  	spin_unlock(&mm->page_table_lock);
>
>  	/*
> +	 * When the original hugepage is shared one, it does not have
> +	 * anon_vma prepared.
> +	 */
> +	if (unlikely(anon_vma_prepare(vma))) {
> +		page_cache_release(old_page);
> +		/* Caller expects lock to be held */
> +		spin_lock(&mm->page_table_lock);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	/*
>  	 * If the process that created a MAP_PRIVATE mapping is about to
>  	 * perform a COW due to a shared page count, attempt to satisfy
>  	 * the allocation without using the existing reserves. The pagecache
> @@ -2578,18 +2589,6 @@ retry_avoidcopy:
>  		return VM_FAULT_SIGBUS;
>  	}
>
> -	/*
> -	 * When the original hugepage is shared one, it does not have
> -	 * anon_vma prepared.
> -	 */
> -	if (unlikely(anon_vma_prepare(vma))) {
> -		page_cache_release(new_page);
> -		page_cache_release(old_page);
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> -	}
> -
>  	copy_user_huge_page(new_page, old_page, address, vma,
>  			    pages_per_huge_page(h));
>  	__SetPageUptodate(new_page);
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 18/20] mm, hugetlb: clean-up error handling in hugetlb_cow()
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-08-26 14:12     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 14:12 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Current code include 'Caller expects lock to be held' in every error path.
> We can clean-up it as we do error handling in one place.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7e9a651..8743e5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2500,6 +2500,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int outside_reserve = 0;
>  	long chg;
>  	bool use_reserve = false;
> +	int ret = 0;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2524,10 +2525,8 @@ retry_avoidcopy:
>  	 * anon_vma prepared.
>  	 */
>  	if (unlikely(anon_vma_prepare(vma))) {
> -		page_cache_release(old_page);
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto out_old_page;
>  	}
>
>  	/*
> @@ -2546,11 +2545,8 @@ retry_avoidcopy:
>  	if (!outside_reserve) {
>  		chg = vma_needs_reservation(h, vma, address);
>  		if (chg == -ENOMEM) {
> -			page_cache_release(old_page);
> -
> -			/* Caller expects lock to be held */
> -			spin_lock(&mm->page_table_lock);
> -			return VM_FAULT_OOM;
> +			ret = VM_FAULT_OOM;
> +			goto out_old_page;
>  		}
>  		use_reserve = !chg;
>  	}
> @@ -2584,9 +2580,8 @@ retry_avoidcopy:
>  			WARN_ON_ONCE(1);
>  		}
>
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_SIGBUS;
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_lock;
>  	}
>
>  	copy_user_huge_page(new_page, old_page, address, vma,
> @@ -2617,11 +2612,12 @@ retry_avoidcopy:
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  	page_cache_release(new_page);
> +out_old_page:
>  	page_cache_release(old_page);
> -
> +out_lock:
>  	/* Caller expects lock to be held */
>  	spin_lock(&mm->page_table_lock);
> -	return 0;
> +	return ret;
>  }
>
>  /* Return the pagecache page at a given address within a VMA */
> -- 
> 1.7.9.5


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 18/20] mm, hugetlb: clean-up error handling in hugetlb_cow()
@ 2013-08-26 14:12     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 139+ messages in thread
From: Aneesh Kumar K.V @ 2013-08-26 14:12 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, KAMEZAWA Hiroyuki,
	Hugh Dickins, Davidlohr Bueso, David Gibson, linux-mm,
	linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:

> Current code include 'Caller expects lock to be held' in every error path.
> We can clean-up it as we do error handling in one place.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7e9a651..8743e5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2500,6 +2500,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int outside_reserve = 0;
>  	long chg;
>  	bool use_reserve = false;
> +	int ret = 0;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>
> @@ -2524,10 +2525,8 @@ retry_avoidcopy:
>  	 * anon_vma prepared.
>  	 */
>  	if (unlikely(anon_vma_prepare(vma))) {
> -		page_cache_release(old_page);
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto out_old_page;
>  	}
>
>  	/*
> @@ -2546,11 +2545,8 @@ retry_avoidcopy:
>  	if (!outside_reserve) {
>  		chg = vma_needs_reservation(h, vma, address);
>  		if (chg == -ENOMEM) {
> -			page_cache_release(old_page);
> -
> -			/* Caller expects lock to be held */
> -			spin_lock(&mm->page_table_lock);
> -			return VM_FAULT_OOM;
> +			ret = VM_FAULT_OOM;
> +			goto out_old_page;
>  		}
>  		use_reserve = !chg;
>  	}
> @@ -2584,9 +2580,8 @@ retry_avoidcopy:
>  			WARN_ON_ONCE(1);
>  		}
>
> -		/* Caller expects lock to be held */
> -		spin_lock(&mm->page_table_lock);
> -		return VM_FAULT_SIGBUS;
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_lock;
>  	}
>
>  	copy_user_huge_page(new_page, old_page, address, vma,
> @@ -2617,11 +2612,12 @@ retry_avoidcopy:
>  	spin_unlock(&mm->page_table_lock);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  	page_cache_release(new_page);
> +out_old_page:
>  	page_cache_release(old_page);
> -
> +out_lock:
>  	/* Caller expects lock to be held */
>  	spin_lock(&mm->page_table_lock);
> -	return 0;
> +	return ret;
>  }
>
>  /* Return the pagecache page at a given address within a VMA */
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
  2013-08-26 13:01             ` Aneesh Kumar K.V
@ 2013-08-27  7:40               ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Mon, Aug 26, 2013 at 06:31:35PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > Hello, Aneesh.
> >> >
> >> > First of all, thank you for review!
> >> >
> >> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> >> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> >> 
> >> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> >> >> > So, we should check subpool counter when avoid_reserve.
> >> >> > This patch implement it.
> >> >> 
> >> >> Can you explain this better ? ie, if we don't have a reservation in the
> >> >> area chg != 0. So why look at avoid_reserve. 
> >> >
> >> > We don't consider avoid_reserve when chg != 0.
> >> > Look at following code.
> >> >
> >> > +       if (chg || avoid_reserve)
> >> > +               if (hugepage_subpool_get_pages(spool, 1))
> >> >
> >> > It means that if chg != 0, we skip to check avoid_reserve.
> >> 
> >> when whould be avoid_reserve == 1 and chg == 0 ?
> >
> > In this case, we should do hugepage_subpool_get_pages(), since we don't
> > get a reserved page due to avoid_reserve.
> 
> As per off-list discussion we had around this, please add additional
> information in commit message explaining when we have
> avoid_reserve == 1 and chg == 0

Okay!

> 
> Something like the below copied from call site.
> 
> 	 /* If the process that created a MAP_PRIVATE mapping is about to
> 	  * perform a COW due to a shared page count, attempt to satisfy
> 	  * the allocation without using the existing reserves
>           */
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling
@ 2013-08-27  7:40               ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Mon, Aug 26, 2013 at 06:31:35PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > On Thu, Aug 22, 2013 at 12:38:12PM +0530, Aneesh Kumar K.V wrote:
> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> 
> >> > Hello, Aneesh.
> >> >
> >> > First of all, thank you for review!
> >> >
> >> > On Wed, Aug 21, 2013 at 02:58:20PM +0530, Aneesh Kumar K.V wrote:
> >> >> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> >> >> 
> >> >> > If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
> >> >> > So, we should check subpool counter when avoid_reserve.
> >> >> > This patch implement it.
> >> >> 
> >> >> Can you explain this better ? ie, if we don't have a reservation in the
> >> >> area chg != 0. So why look at avoid_reserve. 
> >> >
> >> > We don't consider avoid_reserve when chg != 0.
> >> > Look at following code.
> >> >
> >> > +       if (chg || avoid_reserve)
> >> > +               if (hugepage_subpool_get_pages(spool, 1))
> >> >
> >> > It means that if chg != 0, we skip to check avoid_reserve.
> >> 
> >> when whould be avoid_reserve == 1 and chg == 0 ?
> >
> > In this case, we should do hugepage_subpool_get_pages(), since we don't
> > get a reserved page due to avoid_reserve.
> 
> As per off-list discussion we had around this, please add additional
> information in commit message explaining when we have
> avoid_reserve == 1 and chg == 0

Okay!

> 
> Something like the below copied from call site.
> 
> 	 /* If the process that created a MAP_PRIVATE mapping is about to
> 	  * perform a COW due to a shared page count, attempt to satisfy
> 	  * the allocation without using the existing reserves
>           */
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  2013-08-26 13:09     ` Aneesh Kumar K.V
@ 2013-08-27  7:57       ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Mon, Aug 26, 2013 at 06:39:35PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, we have two variable to represent whether we can use reserved
> > page or not, chg and avoid_reserve, respectively. With aggregating these,
> > we can have more clean code. This makes no functinoal difference.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 22ceb04..8dff972 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  				struct vm_area_struct *vma,
> > -				unsigned long address, int avoid_reserve,
> > -				long chg)
> > +				unsigned long address, bool use_reserve)
> >  {
> >  	struct page *page = NULL;
> >  	struct mempolicy *mpol;
> > @@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  	 * A child process with MAP_PRIVATE mappings created by their parent
> >  	 * have no page reserves. This check ensures that reservations are
> >  	 * not "stolen". The child may still get SIGKILLed
> > +	 * Or, when parent process do COW, we cannot use reserved page.
> > +	 * In this case, ensure enough pages are in the pool.
> >  	 */
> > -	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> > -		return NULL;
> 
> This hunk would be much easier if you were changing. 
> 
> 	if (!vma_has_reserves(vma) &&
> 			h->free_huge_pages - h->resv_huge_pages == 0)
> 		goto err;
> 
> ie, !vma_has_reserves(vma) == !use_reserve.
> 
> So may be a patch rearragment would help ?. But neverthless. 

I think that current form is better since use_reserve is not same as
vma_has_reserves(). I changed the call site of vma_has_reserves() to chg in
previous patch. In this patch, use_reserve in alloc_huge_page is made
by chg and avoid_reserve and is passed to dequeue_huge_page_vma(). So changing
in dequeue_huge_page_vma() is trivial.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks!


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
@ 2013-08-27  7:57       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Mon, Aug 26, 2013 at 06:39:35PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <iamjoonsoo.kim@lge.com> writes:
> 
> > Currently, we have two variable to represent whether we can use reserved
> > page or not, chg and avoid_reserve, respectively. With aggregating these,
> > we can have more clean code. This makes no functinoal difference.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 22ceb04..8dff972 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -531,8 +531,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  				struct vm_area_struct *vma,
> > -				unsigned long address, int avoid_reserve,
> > -				long chg)
> > +				unsigned long address, bool use_reserve)
> >  {
> >  	struct page *page = NULL;
> >  	struct mempolicy *mpol;
> > @@ -546,12 +545,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  	 * A child process with MAP_PRIVATE mappings created by their parent
> >  	 * have no page reserves. This check ensures that reservations are
> >  	 * not "stolen". The child may still get SIGKILLed
> > +	 * Or, when parent process do COW, we cannot use reserved page.
> > +	 * In this case, ensure enough pages are in the pool.
> >  	 */
> > -	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
> > -		return NULL;
> 
> This hunk would be much easier if you were changing. 
> 
> 	if (!vma_has_reserves(vma) &&
> 			h->free_huge_pages - h->resv_huge_pages == 0)
> 		goto err;
> 
> ie, !vma_has_reserves(vma) == !use_reserve.
> 
> So may be a patch rearragment would help ?. But neverthless. 

I think that current form is better since use_reserve is not same as
vma_has_reserves(). I changed the call site of vma_has_reserves() to chg in
previous patch. In this patch, use_reserve in alloc_huge_page is made
by chg and avoid_reserve and is passed to dequeue_huge_page_vma(). So changing
in dequeue_huge_page_vma() is trivial.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
  2013-08-26 13:46       ` Aneesh Kumar K.V
@ 2013-08-27  7:58         ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

> >> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> >>  	struct hstate *h = hstate_vma(vma);
> >>  	struct page *old_page, *new_page;
> >>  	int outside_reserve = 0;
> >> +	long chg;
> >> +	bool use_reserve;
> >>  	unsigned long mmun_start;	/* For mmu_notifiers */
> >>  	unsigned long mmun_end;		/* For mmu_notifiers */
> >>
> >> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
> >>
> >>  	/* Drop page_table_lock as buddy allocator may be called */
> >>  	spin_unlock(&mm->page_table_lock);
> >> -	new_page = alloc_huge_page(vma, address, outside_reserve);
> >> +	chg = vma_needs_reservation(h, vma, address);
> >> +	if (chg == -ENOMEM) {
> >
> > why not 
> >
> >     if (chg < 0) ?
> >
> > Should we try to unmap the page from child and avoid cow here ?. May be
> > with outside_reserve = 1 we will never have vma_needs_reservation fail.
> > Any how it would be nice to document why this error case is different
> > from alloc_huge_page error case.
> >
> 
> I guess patch  16 address this . So if we do if (chg < 0) we are good
> here.

Okay! I will change it.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
@ 2013-08-27  7:58         ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-08-27  7:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

> >> @@ -2504,6 +2498,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> >>  	struct hstate *h = hstate_vma(vma);
> >>  	struct page *old_page, *new_page;
> >>  	int outside_reserve = 0;
> >> +	long chg;
> >> +	bool use_reserve;
> >>  	unsigned long mmun_start;	/* For mmu_notifiers */
> >>  	unsigned long mmun_end;		/* For mmu_notifiers */
> >>
> >> @@ -2535,7 +2531,17 @@ retry_avoidcopy:
> >>
> >>  	/* Drop page_table_lock as buddy allocator may be called */
> >>  	spin_unlock(&mm->page_table_lock);
> >> -	new_page = alloc_huge_page(vma, address, outside_reserve);
> >> +	chg = vma_needs_reservation(h, vma, address);
> >> +	if (chg == -ENOMEM) {
> >
> > why not 
> >
> >     if (chg < 0) ?
> >
> > Should we try to unmap the page from child and avoid cow here ?. May be
> > with outside_reserve = 1 we will never have vma_needs_reservation fail.
> > Any how it would be nice to document why this error case is different
> > from alloc_huge_page error case.
> >
> 
> I guess patch  16 address this . So if we do if (chg < 0) we are good
> here.

Okay! I will change it.

> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-08-09  9:26   ` Joonsoo Kim
@ 2013-09-04  8:44     ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-04  8:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation. For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Hello, David.
May I ask to you to review this one?
I guess that you already thought about the various race condition,
so I think that you are the most appropriate reviewer to this patch. :)

Thanks.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-09-04  8:44     ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-04  8:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation. For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Hello, David.
May I ask to you to review this one?
I guess that you already thought about the various race condition,
so I think that you are the most appropriate reviewer to this patch. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-08-09  9:26   ` Joonsoo Kim
  (?)
  (?)
@ 2013-09-05  1:15   ` David Gibson
  2013-09-05  5:43       ` Joonsoo Kim
  -1 siblings, 1 reply; 139+ messages in thread
From: David Gibson @ 2013-09-05  1:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 7680 bytes --]

On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation. For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e29e28f..981c539 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -242,6 +242,7 @@ struct hstate {
>  	int next_nid_to_free;
>  	unsigned int order;
>  	unsigned long mask;
> +	unsigned long nr_dequeue_users;
>  	unsigned long max_huge_pages;
>  	unsigned long nr_huge_pages;
>  	unsigned long free_huge_pages;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8743e5c..0501fe5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -561,6 +561,7 @@ retry_cpuset:
>  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
>  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
>  			if (page) {
> +				h->nr_dequeue_users++;

So, nr_dequeue_users doesn't seem to be incremented in the
alloc_huge_page_node() path.  I'm not sure exactly where that's used,
so I'm not sure if it's a problem.

>  				if (!use_reserve)
>  					break;
>  
> @@ -577,6 +578,16 @@ retry_cpuset:
>  	return page;
>  }
>  
> +static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
> +{
> +	if (!do_dequeue)
> +		return;

Seems like it would be easier to do this test in the callers, but I
doubt it matters much.

> +	spin_lock(&hugetlb_lock);
> +	h->nr_dequeue_users--;
> +	spin_unlock(&hugetlb_lock);
> +}
> +
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> @@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
>  }
>  
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> -				    unsigned long addr, int use_reserve)
> +				    unsigned long addr, int use_reserve,
> +				    unsigned long *nr_dequeue_users,
> +				    bool *do_dequeue)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
> @@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	spin_lock(&hugetlb_lock);
> +	*do_dequeue = true;
>  	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
>  	if (!page) {
> +		*nr_dequeue_users = h->nr_dequeue_users;

So, the nr_dequeue_users parameter is only initialized if !page here.
It's not obvious to me that the callers only use it in hat case.

> +		*do_dequeue = false;
>  		spin_unlock(&hugetlb_lock);
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {

I think the counter also needs to be incremented in the case where we
call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
new, it gets added to the hugepage pool at this point and could still
be a contended page for the last allocation, unless I'm missing
something.

> @@ -1894,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
>  	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
>  	h->nr_huge_pages = 0;
>  	h->free_huge_pages = 0;
> +	h->nr_dequeue_users = 0;
>  	for (i = 0; i < MAX_NUMNODES; ++i)
>  		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
>  	INIT_LIST_HEAD(&h->hugepage_activelist);
> @@ -2500,6 +2517,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int outside_reserve = 0;
>  	long chg;
>  	bool use_reserve = false;
> +	unsigned long nr_dequeue_users = 0;
> +	bool do_dequeue = false;
>  	int ret = 0;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
> @@ -2551,11 +2570,17 @@ retry_avoidcopy:
>  		use_reserve = !chg;
>  	}
>  
> -	new_page = alloc_huge_page(vma, address, use_reserve);
> +	new_page = alloc_huge_page(vma, address, use_reserve,
> +						&nr_dequeue_users, &do_dequeue);
>  
>  	if (IS_ERR(new_page)) {
>  		page_cache_release(old_page);
>  
> +		if (nr_dequeue_users) {
> +			ret = 0;
> +			goto out_lock;
> +		}
> +
>  		/*
>  		 * If a process owning a MAP_PRIVATE mapping fails to COW,
>  		 * it is due to references held by a child and an insufficient
> @@ -2580,6 +2605,9 @@ retry_avoidcopy:
>  			WARN_ON_ONCE(1);
>  		}
>  
> +		if (use_reserve)
> +			WARN_ON_ONCE(1);
> +
>  		ret = VM_FAULT_SIGBUS;
>  		goto out_lock;
>  	}
> @@ -2614,6 +2642,7 @@ retry_avoidcopy:
>  	page_cache_release(new_page);
>  out_old_page:
>  	page_cache_release(old_page);
> +	commit_dequeued_huge_page(h, do_dequeue);
>  out_lock:
>  	/* Caller expects lock to be held */
>  	spin_lock(&mm->page_table_lock);
> @@ -2666,6 +2695,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pte_t new_pte;
>  	long chg;
>  	bool use_reserve;
> +	unsigned long nr_dequeue_users = 0;
> +	bool do_dequeue = false;
>  
>  	/*
>  	 * Currently, we are forced to kill the process in the event the
> @@ -2699,9 +2730,17 @@ retry:
>  		}
>  		use_reserve = !chg;
>  
> -		page = alloc_huge_page(vma, address, use_reserve);
> +		page = alloc_huge_page(vma, address, use_reserve,
> +					&nr_dequeue_users, &do_dequeue);
>  		if (IS_ERR(page)) {
> -			ret = VM_FAULT_SIGBUS;
> +			if (nr_dequeue_users)
> +				ret = 0;
> +			else {
> +				if (use_reserve)
> +					WARN_ON_ONCE(1);
> +
> +				ret = VM_FAULT_SIGBUS;
> +			}
>  			goto out;
>  		}
>  		clear_huge_page(page, address, pages_per_huge_page(h));
> @@ -2714,22 +2753,24 @@ retry:
>  			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
>  			if (err) {
>  				put_page(page);
> +				commit_dequeued_huge_page(h, do_dequeue);
>  				if (err == -EEXIST)
>  					goto retry;
>  				goto out;
>  			}
>  			ClearPagePrivate(page);
> +			commit_dequeued_huge_page(h, do_dequeue);
>  
>  			spin_lock(&inode->i_lock);
>  			inode->i_blocks += blocks_per_huge_page(h);
>  			spin_unlock(&inode->i_lock);
>  		} else {
>  			lock_page(page);
> +			anon_rmap = 1;
>  			if (unlikely(anon_vma_prepare(vma))) {
>  				ret = VM_FAULT_OOM;
>  				goto backout_unlocked;
>  			}
> -			anon_rmap = 1;
>  		}
>  	} else {
>  		/*
> @@ -2783,6 +2824,8 @@ retry:
>  	spin_unlock(&mm->page_table_lock);
>  	unlock_page(page);
>  out:
> +	if (anon_rmap)
> +		commit_dequeued_huge_page(h, do_dequeue);
>  	return ret;
>  
>  backout:

Otherwise I think it looks good.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-09-04  8:44     ` Joonsoo Kim
  (?)
@ 2013-09-05  1:16     ` David Gibson
  -1 siblings, 0 replies; 139+ messages in thread
From: David Gibson @ 2013-09-05  1:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 1808 bytes --]

On Wed, Sep 04, 2013 at 05:44:30PM +0900, Joonsoo Kim wrote:
> On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation. For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> 
> Hello, David.
> May I ask to you to review this one?
> I guess that you already thought about the various race condition,
> so I think that you are the most appropriate reviewer to this patch. :)

Yeah, sorry, I meant to get to it but kept forgetting.  I've sent a
review now.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-09-05  1:15   ` David Gibson
@ 2013-09-05  5:43       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-05  5:43 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

Hello, David.

First of all, thanks for review!

On Thu, Sep 05, 2013 at 11:15:53AM +1000, David Gibson wrote:
> On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation. For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index e29e28f..981c539 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -242,6 +242,7 @@ struct hstate {
> >  	int next_nid_to_free;
> >  	unsigned int order;
> >  	unsigned long mask;
> > +	unsigned long nr_dequeue_users;
> >  	unsigned long max_huge_pages;
> >  	unsigned long nr_huge_pages;
> >  	unsigned long free_huge_pages;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8743e5c..0501fe5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -561,6 +561,7 @@ retry_cpuset:
> >  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> >  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> >  			if (page) {
> > +				h->nr_dequeue_users++;
> 
> So, nr_dequeue_users doesn't seem to be incremented in the
> alloc_huge_page_node() path.  I'm not sure exactly where that's used,
> so I'm not sure if it's a problem.
> 

Hmm.. I think that it isn't a problem. The point is that we want to avoid
the race which kill the legitimate users of hugepages by out of resources.
This allocation doesn't harm to the legitimate users.

> >  				if (!use_reserve)
> >  					break;
> >  
> > @@ -577,6 +578,16 @@ retry_cpuset:
> >  	return page;
> >  }
> >  
> > +static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
> > +{
> > +	if (!do_dequeue)
> > +		return;
> 
> Seems like it would be easier to do this test in the callers, but I
> doubt it matters much.

Yes, I will fix it.

> 
> > +	spin_lock(&hugetlb_lock);
> > +	h->nr_dequeue_users--;
> > +	spin_unlock(&hugetlb_lock);
> > +}
> > +
> >  static void update_and_free_page(struct hstate *h, struct page *page)
> >  {
> >  	int i;
> > @@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
> >  }
> >  
> >  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > -				    unsigned long addr, int use_reserve)
> > +				    unsigned long addr, int use_reserve,
> > +				    unsigned long *nr_dequeue_users,
> > +				    bool *do_dequeue)
> >  {
> >  	struct hugepage_subpool *spool = subpool_vma(vma);
> >  	struct hstate *h = hstate_vma(vma);
> > @@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> >  		return ERR_PTR(-ENOSPC);
> >  	}
> >  	spin_lock(&hugetlb_lock);
> > +	*do_dequeue = true;
> >  	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
> >  	if (!page) {
> > +		*nr_dequeue_users = h->nr_dequeue_users;
> 
> So, the nr_dequeue_users parameter is only initialized if !page here.
> It's not obvious to me that the callers only use it in hat case.

Okay. I will fix it.

> 
> > +		*do_dequeue = false;
> >  		spin_unlock(&hugetlb_lock);
> >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> >  		if (!page) {
> 
> I think the counter also needs to be incremented in the case where we
> call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> new, it gets added to the hugepage pool at this point and could still
> be a contended page for the last allocation, unless I'm missing
> something.

Your comment has reasonable point to me, but I have a different opinion.

As I already mentioned, the point is that we want to avoid the race
which kill the legitimate users of hugepages by out of resources.
I increase 'h->nr_dequeue_users' when the hugepage allocated by
administrator is dequeued. It is because what the hugepage I want to
protect from the race is the one allocated by administrator via
kernel param or /proc interface. Administrator may already know how many
hugepages are needed for their application so that he may set nr_hugepage
to reasonable value. I want to guarantee that these hugepages can be used
for his application without any race, since he assume that the application
would work fine with these hugepages.

To protect hugepages returned from alloc_buddy_huge_page() from the race
is different for me. Although it will be added to the hugepage pool, this
doesn't guarantee certain application's success more. If certain
application's success depends on the race of this new hugepage, it's death
by the race doesn't matter, since nobody assume that it works fine.


[snip..]

> Otherwise I think it looks good.

Really thanks! :)


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-09-05  5:43       ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-05  5:43 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

Hello, David.

First of all, thanks for review!

On Thu, Sep 05, 2013 at 11:15:53AM +1000, David Gibson wrote:
> On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation. For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index e29e28f..981c539 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -242,6 +242,7 @@ struct hstate {
> >  	int next_nid_to_free;
> >  	unsigned int order;
> >  	unsigned long mask;
> > +	unsigned long nr_dequeue_users;
> >  	unsigned long max_huge_pages;
> >  	unsigned long nr_huge_pages;
> >  	unsigned long free_huge_pages;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8743e5c..0501fe5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -561,6 +561,7 @@ retry_cpuset:
> >  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> >  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> >  			if (page) {
> > +				h->nr_dequeue_users++;
> 
> So, nr_dequeue_users doesn't seem to be incremented in the
> alloc_huge_page_node() path.  I'm not sure exactly where that's used,
> so I'm not sure if it's a problem.
> 

Hmm.. I think that it isn't a problem. The point is that we want to avoid
the race which kill the legitimate users of hugepages by out of resources.
This allocation doesn't harm to the legitimate users.

> >  				if (!use_reserve)
> >  					break;
> >  
> > @@ -577,6 +578,16 @@ retry_cpuset:
> >  	return page;
> >  }
> >  
> > +static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
> > +{
> > +	if (!do_dequeue)
> > +		return;
> 
> Seems like it would be easier to do this test in the callers, but I
> doubt it matters much.

Yes, I will fix it.

> 
> > +	spin_lock(&hugetlb_lock);
> > +	h->nr_dequeue_users--;
> > +	spin_unlock(&hugetlb_lock);
> > +}
> > +
> >  static void update_and_free_page(struct hstate *h, struct page *page)
> >  {
> >  	int i;
> > @@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
> >  }
> >  
> >  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > -				    unsigned long addr, int use_reserve)
> > +				    unsigned long addr, int use_reserve,
> > +				    unsigned long *nr_dequeue_users,
> > +				    bool *do_dequeue)
> >  {
> >  	struct hugepage_subpool *spool = subpool_vma(vma);
> >  	struct hstate *h = hstate_vma(vma);
> > @@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> >  		return ERR_PTR(-ENOSPC);
> >  	}
> >  	spin_lock(&hugetlb_lock);
> > +	*do_dequeue = true;
> >  	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
> >  	if (!page) {
> > +		*nr_dequeue_users = h->nr_dequeue_users;
> 
> So, the nr_dequeue_users parameter is only initialized if !page here.
> It's not obvious to me that the callers only use it in hat case.

Okay. I will fix it.

> 
> > +		*do_dequeue = false;
> >  		spin_unlock(&hugetlb_lock);
> >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> >  		if (!page) {
> 
> I think the counter also needs to be incremented in the case where we
> call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> new, it gets added to the hugepage pool at this point and could still
> be a contended page for the last allocation, unless I'm missing
> something.

Your comment has reasonable point to me, but I have a different opinion.

As I already mentioned, the point is that we want to avoid the race
which kill the legitimate users of hugepages by out of resources.
I increase 'h->nr_dequeue_users' when the hugepage allocated by
administrator is dequeued. It is because what the hugepage I want to
protect from the race is the one allocated by administrator via
kernel param or /proc interface. Administrator may already know how many
hugepages are needed for their application so that he may set nr_hugepage
to reasonable value. I want to guarantee that these hugepages can be used
for his application without any race, since he assume that the application
would work fine with these hugepages.

To protect hugepages returned from alloc_buddy_huge_page() from the race
is different for me. Although it will be added to the hugepage pool, this
doesn't guarantee certain application's success more. If certain
application's success depends on the race of this new hugepage, it's death
by the race doesn't matter, since nobody assume that it works fine.


[snip..]

> Otherwise I think it looks good.

Really thanks! :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-09-05  5:43       ` Joonsoo Kim
  (?)
@ 2013-09-16 12:09       ` David Gibson
  2013-09-30  7:47           ` Joonsoo Kim
  -1 siblings, 1 reply; 139+ messages in thread
From: David Gibson @ 2013-09-16 12:09 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 6743 bytes --]

On Thu, Sep 05, 2013 at 02:43:57PM +0900, Joonsoo Kim wrote:
> Hello, David.
> 
> First of all, thanks for review!
> 
> On Thu, Sep 05, 2013 at 11:15:53AM +1000, David Gibson wrote:
> > On Fri, Aug 09, 2013 at 06:26:37PM +0900, Joonsoo Kim wrote:
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation. For achieving it, at first, we should ensure that
> > > no one get a SIGBUS if there are enough hugepages.
> > > 
> > > For this purpose, if we fail to allocate a new hugepage when there is
> > > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > > these threads defer to get a SIGBUS signal until there is no
> > > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > > are enough hugepages.
> > > 
> > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > 
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index e29e28f..981c539 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -242,6 +242,7 @@ struct hstate {
> > >  	int next_nid_to_free;
> > >  	unsigned int order;
> > >  	unsigned long mask;
> > > +	unsigned long nr_dequeue_users;
> > >  	unsigned long max_huge_pages;
> > >  	unsigned long nr_huge_pages;
> > >  	unsigned long free_huge_pages;
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 8743e5c..0501fe5 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -561,6 +561,7 @@ retry_cpuset:
> > >  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> > >  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> > >  			if (page) {
> > > +				h->nr_dequeue_users++;
> > 
> > So, nr_dequeue_users doesn't seem to be incremented in the
> > alloc_huge_page_node() path.  I'm not sure exactly where that's used,
> > so I'm not sure if it's a problem.
> > 
> 
> Hmm.. I think that it isn't a problem. The point is that we want to avoid
> the race which kill the legitimate users of hugepages by out of resources.
> This allocation doesn't harm to the legitimate users.

Well, my point is just that since whatever callers there are to this
function are external, they need to be checked to see if they can
participate in this race.

> 
> > >  				if (!use_reserve)
> > >  					break;
> > >  
> > > @@ -577,6 +578,16 @@ retry_cpuset:
> > >  	return page;
> > >  }
> > >  
> > > +static void commit_dequeued_huge_page(struct hstate *h, bool do_dequeue)
> > > +{
> > > +	if (!do_dequeue)
> > > +		return;
> > 
> > Seems like it would be easier to do this test in the callers, but I
> > doubt it matters much.
> 
> Yes, I will fix it.
> 
> > 
> > > +	spin_lock(&hugetlb_lock);
> > > +	h->nr_dequeue_users--;
> > > +	spin_unlock(&hugetlb_lock);
> > > +}
> > > +
> > >  static void update_and_free_page(struct hstate *h, struct page *page)
> > >  {
> > >  	int i;
> > > @@ -1110,7 +1121,9 @@ static void vma_commit_reservation(struct hstate *h,
> > >  }
> > >  
> > >  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > > -				    unsigned long addr, int use_reserve)
> > > +				    unsigned long addr, int use_reserve,
> > > +				    unsigned long *nr_dequeue_users,
> > > +				    bool *do_dequeue)
> > >  {
> > >  	struct hugepage_subpool *spool = subpool_vma(vma);
> > >  	struct hstate *h = hstate_vma(vma);
> > > @@ -1138,8 +1151,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > >  		return ERR_PTR(-ENOSPC);
> > >  	}
> > >  	spin_lock(&hugetlb_lock);
> > > +	*do_dequeue = true;
> > >  	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
> > >  	if (!page) {
> > > +		*nr_dequeue_users = h->nr_dequeue_users;
> > 
> > So, the nr_dequeue_users parameter is only initialized if !page here.
> > It's not obvious to me that the callers only use it in hat case.
> 
> Okay. I will fix it.
> 
> > 
> > > +		*do_dequeue = false;
> > >  		spin_unlock(&hugetlb_lock);
> > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > >  		if (!page) {
> > 
> > I think the counter also needs to be incremented in the case where we
> > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > new, it gets added to the hugepage pool at this point and could still
> > be a contended page for the last allocation, unless I'm missing
> > something.
> 
> Your comment has reasonable point to me, but I have a different opinion.
> 
> As I already mentioned, the point is that we want to avoid the race
> which kill the legitimate users of hugepages by out of resources.
> I increase 'h->nr_dequeue_users' when the hugepage allocated by
> administrator is dequeued. It is because what the hugepage I want to
> protect from the race is the one allocated by administrator via
> kernel param or /proc interface. Administrator may already know how many
> hugepages are needed for their application so that he may set nr_hugepage
> to reasonable value. I want to guarantee that these hugepages can be used
> for his application without any race, since he assume that the application
> would work fine with these hugepages.
> 
> To protect hugepages returned from alloc_buddy_huge_page() from the race
> is different for me. Although it will be added to the hugepage pool, this
> doesn't guarantee certain application's success more. If certain
> application's success depends on the race of this new hugepage, it's death
> by the race doesn't matter, since nobody assume that it works fine.

Hrm.  I still think this path should be included.  Although I'll agree
that failing in this case is less bad.

However, it can still lead to a situation where with two processes or
threads, faulting on exactly the same shared page we have one succeed
and the other fail.  That's a strange behaviour and I think we want to
avoid it in this case too.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-09-16 12:09       ` David Gibson
@ 2013-09-30  7:47           ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-30  7:47 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > 
> > > > +		*do_dequeue = false;
> > > >  		spin_unlock(&hugetlb_lock);
> > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > >  		if (!page) {
> > > 
> > > I think the counter also needs to be incremented in the case where we
> > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > new, it gets added to the hugepage pool at this point and could still
> > > be a contended page for the last allocation, unless I'm missing
> > > something.
> > 
> > Your comment has reasonable point to me, but I have a different opinion.
> > 
> > As I already mentioned, the point is that we want to avoid the race
> > which kill the legitimate users of hugepages by out of resources.
> > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > administrator is dequeued. It is because what the hugepage I want to
> > protect from the race is the one allocated by administrator via
> > kernel param or /proc interface. Administrator may already know how many
> > hugepages are needed for their application so that he may set nr_hugepage
> > to reasonable value. I want to guarantee that these hugepages can be used
> > for his application without any race, since he assume that the application
> > would work fine with these hugepages.
> > 
> > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > is different for me. Although it will be added to the hugepage pool, this
> > doesn't guarantee certain application's success more. If certain
> > application's success depends on the race of this new hugepage, it's death
> > by the race doesn't matter, since nobody assume that it works fine.
> 
> Hrm.  I still think this path should be included.  Although I'll agree
> that failing in this case is less bad.
> 
> However, it can still lead to a situation where with two processes or
> threads, faulting on exactly the same shared page we have one succeed
> and the other fail.  That's a strange behaviour and I think we want to
> avoid it in this case too.

Hello, David.

I don't think it is a strange behaviour. Similar situation can occur
even though we use the mutex. Hugepage allocation can be failed when
the first process try to allocate the hugepage while second process is blocked
by the mutex. And then, second process will go into the fault handler. And
at this time, it can succeed. So result is that we have one succeed and
the other fail.

It is slightly different from the case you mentioned, but I think that
effect for user is same. We cannot avoid this kind of race completely and
I think that avoiding the race for administrator managed hugepage pool is
good enough to use.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-09-30  7:47           ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-09-30  7:47 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > 
> > > > +		*do_dequeue = false;
> > > >  		spin_unlock(&hugetlb_lock);
> > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > >  		if (!page) {
> > > 
> > > I think the counter also needs to be incremented in the case where we
> > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > new, it gets added to the hugepage pool at this point and could still
> > > be a contended page for the last allocation, unless I'm missing
> > > something.
> > 
> > Your comment has reasonable point to me, but I have a different opinion.
> > 
> > As I already mentioned, the point is that we want to avoid the race
> > which kill the legitimate users of hugepages by out of resources.
> > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > administrator is dequeued. It is because what the hugepage I want to
> > protect from the race is the one allocated by administrator via
> > kernel param or /proc interface. Administrator may already know how many
> > hugepages are needed for their application so that he may set nr_hugepage
> > to reasonable value. I want to guarantee that these hugepages can be used
> > for his application without any race, since he assume that the application
> > would work fine with these hugepages.
> > 
> > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > is different for me. Although it will be added to the hugepage pool, this
> > doesn't guarantee certain application's success more. If certain
> > application's success depends on the race of this new hugepage, it's death
> > by the race doesn't matter, since nobody assume that it works fine.
> 
> Hrm.  I still think this path should be included.  Although I'll agree
> that failing in this case is less bad.
> 
> However, it can still lead to a situation where with two processes or
> threads, faulting on exactly the same shared page we have one succeed
> and the other fail.  That's a strange behaviour and I think we want to
> avoid it in this case too.

Hello, David.

I don't think it is a strange behaviour. Similar situation can occur
even though we use the mutex. Hugepage allocation can be failed when
the first process try to allocate the hugepage while second process is blocked
by the mutex. And then, second process will go into the fault handler. And
at this time, it can succeed. So result is that we have one succeed and
the other fail.

It is slightly different from the case you mentioned, but I think that
effect for user is same. We cannot avoid this kind of race completely and
I think that avoiding the race for administrator managed hugepage pool is
good enough to use.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-09-30  7:47           ` Joonsoo Kim
@ 2013-12-09 16:36             ` Davidlohr Bueso
  -1 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-12-09 16:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Gibson, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, 2013-09-30 at 16:47 +0900, Joonsoo Kim wrote:
> On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > > 
> > > > > +		*do_dequeue = false;
> > > > >  		spin_unlock(&hugetlb_lock);
> > > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > > >  		if (!page) {
> > > > 
> > > > I think the counter also needs to be incremented in the case where we
> > > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > > new, it gets added to the hugepage pool at this point and could still
> > > > be a contended page for the last allocation, unless I'm missing
> > > > something.
> > > 
> > > Your comment has reasonable point to me, but I have a different opinion.
> > > 
> > > As I already mentioned, the point is that we want to avoid the race
> > > which kill the legitimate users of hugepages by out of resources.
> > > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > > administrator is dequeued. It is because what the hugepage I want to
> > > protect from the race is the one allocated by administrator via
> > > kernel param or /proc interface. Administrator may already know how many
> > > hugepages are needed for their application so that he may set nr_hugepage
> > > to reasonable value. I want to guarantee that these hugepages can be used
> > > for his application without any race, since he assume that the application
> > > would work fine with these hugepages.
> > > 
> > > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > > is different for me. Although it will be added to the hugepage pool, this
> > > doesn't guarantee certain application's success more. If certain
> > > application's success depends on the race of this new hugepage, it's death
> > > by the race doesn't matter, since nobody assume that it works fine.
> > 
> > Hrm.  I still think this path should be included.  Although I'll agree
> > that failing in this case is less bad.
> > 
> > However, it can still lead to a situation where with two processes or
> > threads, faulting on exactly the same shared page we have one succeed
> > and the other fail.  That's a strange behaviour and I think we want to
> > avoid it in this case too.
> 
> Hello, David.
> 
> I don't think it is a strange behaviour. Similar situation can occur
> even though we use the mutex. Hugepage allocation can be failed when
> the first process try to allocate the hugepage while second process is blocked
> by the mutex. And then, second process will go into the fault handler. And
> at this time, it can succeed. So result is that we have one succeed and
> the other fail.
> 
> It is slightly different from the case you mentioned, but I think that
> effect for user is same. We cannot avoid this kind of race completely and
> I think that avoiding the race for administrator managed hugepage pool is
> good enough to use.

What was the final decision on this issue? Is Joonsoo's approach to
removing this mutex viable, or are we stuck with it?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-09 16:36             ` Davidlohr Bueso
  0 siblings, 0 replies; 139+ messages in thread
From: Davidlohr Bueso @ 2013-12-09 16:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Gibson, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, 2013-09-30 at 16:47 +0900, Joonsoo Kim wrote:
> On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > > 
> > > > > +		*do_dequeue = false;
> > > > >  		spin_unlock(&hugetlb_lock);
> > > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > > >  		if (!page) {
> > > > 
> > > > I think the counter also needs to be incremented in the case where we
> > > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > > new, it gets added to the hugepage pool at this point and could still
> > > > be a contended page for the last allocation, unless I'm missing
> > > > something.
> > > 
> > > Your comment has reasonable point to me, but I have a different opinion.
> > > 
> > > As I already mentioned, the point is that we want to avoid the race
> > > which kill the legitimate users of hugepages by out of resources.
> > > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > > administrator is dequeued. It is because what the hugepage I want to
> > > protect from the race is the one allocated by administrator via
> > > kernel param or /proc interface. Administrator may already know how many
> > > hugepages are needed for their application so that he may set nr_hugepage
> > > to reasonable value. I want to guarantee that these hugepages can be used
> > > for his application without any race, since he assume that the application
> > > would work fine with these hugepages.
> > > 
> > > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > > is different for me. Although it will be added to the hugepage pool, this
> > > doesn't guarantee certain application's success more. If certain
> > > application's success depends on the race of this new hugepage, it's death
> > > by the race doesn't matter, since nobody assume that it works fine.
> > 
> > Hrm.  I still think this path should be included.  Although I'll agree
> > that failing in this case is less bad.
> > 
> > However, it can still lead to a situation where with two processes or
> > threads, faulting on exactly the same shared page we have one succeed
> > and the other fail.  That's a strange behaviour and I think we want to
> > avoid it in this case too.
> 
> Hello, David.
> 
> I don't think it is a strange behaviour. Similar situation can occur
> even though we use the mutex. Hugepage allocation can be failed when
> the first process try to allocate the hugepage while second process is blocked
> by the mutex. And then, second process will go into the fault handler. And
> at this time, it can succeed. So result is that we have one succeed and
> the other fail.
> 
> It is slightly different from the case you mentioned, but I think that
> effect for user is same. We cannot avoid this kind of race completely and
> I think that avoiding the race for administrator managed hugepage pool is
> good enough to use.

What was the final decision on this issue? Is Joonsoo's approach to
removing this mutex viable, or are we stuck with it?

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-09 16:36             ` Davidlohr Bueso
@ 2013-12-10  8:32               ` Joonsoo Kim
  -1 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-12-10  8:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, Dec 09, 2013 at 08:36:23AM -0800, Davidlohr Bueso wrote:
> On Mon, 2013-09-30 at 16:47 +0900, Joonsoo Kim wrote:
> > On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > > > 
> > > > > > +		*do_dequeue = false;
> > > > > >  		spin_unlock(&hugetlb_lock);
> > > > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > > > >  		if (!page) {
> > > > > 
> > > > > I think the counter also needs to be incremented in the case where we
> > > > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > > > new, it gets added to the hugepage pool at this point and could still
> > > > > be a contended page for the last allocation, unless I'm missing
> > > > > something.
> > > > 
> > > > Your comment has reasonable point to me, but I have a different opinion.
> > > > 
> > > > As I already mentioned, the point is that we want to avoid the race
> > > > which kill the legitimate users of hugepages by out of resources.
> > > > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > > > administrator is dequeued. It is because what the hugepage I want to
> > > > protect from the race is the one allocated by administrator via
> > > > kernel param or /proc interface. Administrator may already know how many
> > > > hugepages are needed for their application so that he may set nr_hugepage
> > > > to reasonable value. I want to guarantee that these hugepages can be used
> > > > for his application without any race, since he assume that the application
> > > > would work fine with these hugepages.
> > > > 
> > > > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > > > is different for me. Although it will be added to the hugepage pool, this
> > > > doesn't guarantee certain application's success more. If certain
> > > > application's success depends on the race of this new hugepage, it's death
> > > > by the race doesn't matter, since nobody assume that it works fine.
> > > 
> > > Hrm.  I still think this path should be included.  Although I'll agree
> > > that failing in this case is less bad.
> > > 
> > > However, it can still lead to a situation where with two processes or
> > > threads, faulting on exactly the same shared page we have one succeed
> > > and the other fail.  That's a strange behaviour and I think we want to
> > > avoid it in this case too.
> > 
> > Hello, David.
> > 
> > I don't think it is a strange behaviour. Similar situation can occur
> > even though we use the mutex. Hugepage allocation can be failed when
> > the first process try to allocate the hugepage while second process is blocked
> > by the mutex. And then, second process will go into the fault handler. And
> > at this time, it can succeed. So result is that we have one succeed and
> > the other fail.
> > 
> > It is slightly different from the case you mentioned, but I think that
> > effect for user is same. We cannot avoid this kind of race completely and
> > I think that avoiding the race for administrator managed hugepage pool is
> > good enough to use.
> 
> What was the final decision on this issue? Is Joonsoo's approach to
> removing this mutex viable, or are we stuck with it?

Hello.

After rebasing on current kernel, I will repost it soon.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-10  8:32               ` Joonsoo Kim
  0 siblings, 0 replies; 139+ messages in thread
From: Joonsoo Kim @ 2013-12-10  8:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Mon, Dec 09, 2013 at 08:36:23AM -0800, Davidlohr Bueso wrote:
> On Mon, 2013-09-30 at 16:47 +0900, Joonsoo Kim wrote:
> > On Mon, Sep 16, 2013 at 10:09:09PM +1000, David Gibson wrote:
> > > > > 
> > > > > > +		*do_dequeue = false;
> > > > > >  		spin_unlock(&hugetlb_lock);
> > > > > >  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > > > > >  		if (!page) {
> > > > > 
> > > > > I think the counter also needs to be incremented in the case where we
> > > > > call alloc_buddy_huge_page() from alloc_huge_page().  Even though it's
> > > > > new, it gets added to the hugepage pool at this point and could still
> > > > > be a contended page for the last allocation, unless I'm missing
> > > > > something.
> > > > 
> > > > Your comment has reasonable point to me, but I have a different opinion.
> > > > 
> > > > As I already mentioned, the point is that we want to avoid the race
> > > > which kill the legitimate users of hugepages by out of resources.
> > > > I increase 'h->nr_dequeue_users' when the hugepage allocated by
> > > > administrator is dequeued. It is because what the hugepage I want to
> > > > protect from the race is the one allocated by administrator via
> > > > kernel param or /proc interface. Administrator may already know how many
> > > > hugepages are needed for their application so that he may set nr_hugepage
> > > > to reasonable value. I want to guarantee that these hugepages can be used
> > > > for his application without any race, since he assume that the application
> > > > would work fine with these hugepages.
> > > > 
> > > > To protect hugepages returned from alloc_buddy_huge_page() from the race
> > > > is different for me. Although it will be added to the hugepage pool, this
> > > > doesn't guarantee certain application's success more. If certain
> > > > application's success depends on the race of this new hugepage, it's death
> > > > by the race doesn't matter, since nobody assume that it works fine.
> > > 
> > > Hrm.  I still think this path should be included.  Although I'll agree
> > > that failing in this case is less bad.
> > > 
> > > However, it can still lead to a situation where with two processes or
> > > threads, faulting on exactly the same shared page we have one succeed
> > > and the other fail.  That's a strange behaviour and I think we want to
> > > avoid it in this case too.
> > 
> > Hello, David.
> > 
> > I don't think it is a strange behaviour. Similar situation can occur
> > even though we use the mutex. Hugepage allocation can be failed when
> > the first process try to allocate the hugepage while second process is blocked
> > by the mutex. And then, second process will go into the fault handler. And
> > at this time, it can succeed. So result is that we have one succeed and
> > the other fail.
> > 
> > It is slightly different from the case you mentioned, but I think that
> > effect for user is same. We cannot avoid this kind of race completely and
> > I think that avoiding the race for administrator managed hugepage pool is
> > good enough to use.
> 
> What was the final decision on this issue? Is Joonsoo's approach to
> removing this mutex viable, or are we stuck with it?

Hello.

After rebasing on current kernel, I will repost it soon.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2013-12-10  8:29 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-09  9:26 [PATCH v2 00/20] mm, hugetlb: remove a hugetlb_instantiation_mutex Joonsoo Kim
2013-08-09  9:26 ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 01/20] mm, hugetlb: protect reserved pages when soft offlining a hugepage Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-12 13:20   ` Davidlohr Bueso
2013-08-12 13:20     ` Davidlohr Bueso
2013-08-09  9:26 ` [PATCH v2 02/20] mm, hugetlb: change variable name reservations to resv Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-12 13:21   ` Davidlohr Bueso
2013-08-12 13:21     ` Davidlohr Bueso
2013-08-09  9:26 ` [PATCH v2 03/20] mm, hugetlb: fix subpool accounting handling Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21  9:28   ` Aneesh Kumar K.V
2013-08-21  9:28     ` Aneesh Kumar K.V
2013-08-22  6:50     ` Joonsoo Kim
2013-08-22  6:50       ` Joonsoo Kim
2013-08-22  7:08       ` Aneesh Kumar K.V
2013-08-22  7:08         ` Aneesh Kumar K.V
2013-08-22  7:47         ` Joonsoo Kim
2013-08-22  7:47           ` Joonsoo Kim
2013-08-26 13:01           ` Aneesh Kumar K.V
2013-08-26 13:01             ` Aneesh Kumar K.V
2013-08-27  7:40             ` Joonsoo Kim
2013-08-27  7:40               ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 04/20] mm, hugetlb: remove useless check about mapping type Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-12 13:31   ` Davidlohr Bueso
2013-08-12 13:31     ` Davidlohr Bueso
2013-08-21  9:30   ` Aneesh Kumar K.V
2013-08-21  9:30     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 05/20] mm, hugetlb: grab a page_table_lock after page_cache_release Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-12 13:35   ` Davidlohr Bueso
2013-08-12 13:35     ` Davidlohr Bueso
2013-08-21  9:31   ` Aneesh Kumar K.V
2013-08-21  9:31     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 06/20] mm, hugetlb: return a reserved page to a reserved pool if failed Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21  9:54   ` Aneesh Kumar K.V
2013-08-21  9:54     ` Aneesh Kumar K.V
2013-08-22  6:51     ` Joonsoo Kim
2013-08-22  6:51       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 07/20] mm, hugetlb: unify region structure handling Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21  9:57   ` Aneesh Kumar K.V
2013-08-21  9:57     ` Aneesh Kumar K.V
2013-08-22  6:56     ` Joonsoo Kim
2013-08-22  6:56       ` Joonsoo Kim
2013-08-21 10:22   ` Aneesh Kumar K.V
2013-08-21 10:22     ` Aneesh Kumar K.V
2013-08-22  6:53     ` Joonsoo Kim
2013-08-22  6:53       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 08/20] mm, hugetlb: region manipulation functions take resv_map rather list_head Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21  9:58   ` Aneesh Kumar K.V
2013-08-21  9:58     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 09/20] mm, hugetlb: protect region tracking via newly introduced resv_map lock Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-12 22:03   ` Davidlohr Bueso
2013-08-12 22:03     ` Davidlohr Bueso
2013-08-13  7:45     ` Joonsoo Kim
2013-08-13  7:45       ` Joonsoo Kim
2013-08-21 10:13   ` Aneesh Kumar K.V
2013-08-21 10:13     ` Aneesh Kumar K.V
2013-08-22  6:59     ` Joonsoo Kim
2013-08-22  6:59       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 10/20] mm, hugetlb: remove resv_map_put() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21 10:49   ` Aneesh Kumar K.V
2013-08-21 10:49     ` Aneesh Kumar K.V
2013-08-22  7:24     ` Joonsoo Kim
2013-08-22  7:24       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 11/20] mm, hugetlb: make vma_resv_map() works for all mapping type Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-21 10:37   ` Aneesh Kumar K.V
2013-08-21 10:37     ` Aneesh Kumar K.V
2013-08-22  7:25     ` Joonsoo Kim
2013-08-22  7:25       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 12/20] mm, hugetlb: remove vma_has_reserves() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-22  8:44   ` Aneesh Kumar K.V
2013-08-22  8:44     ` Aneesh Kumar K.V
2013-08-22  9:17     ` Joonsoo Kim
2013-08-22  9:17       ` Joonsoo Kim
2013-08-22 11:04       ` Aneesh Kumar K.V
2013-08-22 11:04         ` Aneesh Kumar K.V
2013-08-23  6:16         ` Joonsoo Kim
2013-08-23  6:16           ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 13/20] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 13:09   ` Aneesh Kumar K.V
2013-08-26 13:09     ` Aneesh Kumar K.V
2013-08-27  7:57     ` Joonsoo Kim
2013-08-27  7:57       ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 14/20] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 13:36   ` Aneesh Kumar K.V
2013-08-26 13:36     ` Aneesh Kumar K.V
2013-08-26 13:46     ` Aneesh Kumar K.V
2013-08-26 13:46       ` Aneesh Kumar K.V
2013-08-27  7:58       ` Joonsoo Kim
2013-08-27  7:58         ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 15/20] mm, hugetlb: remove a check for return value of alloc_huge_page() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 13:38   ` Aneesh Kumar K.V
2013-08-26 13:38     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 16/20] mm, hugetlb: move down outside_reserve check Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 13:44   ` Aneesh Kumar K.V
2013-08-26 13:44     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 17/20] mm, hugetlb: move up anon_vma_prepare() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 14:09   ` Aneesh Kumar K.V
2013-08-26 14:09     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 18/20] mm, hugetlb: clean-up error handling in hugetlb_cow() Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-26 14:12   ` Aneesh Kumar K.V
2013-08-26 14:12     ` Aneesh Kumar K.V
2013-08-09  9:26 ` [PATCH v2 19/20] mm, hugetlb: retry if failed to allocate and there is concurrent user Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-09-04  8:44   ` Joonsoo Kim
2013-09-04  8:44     ` Joonsoo Kim
2013-09-05  1:16     ` David Gibson
2013-09-05  1:15   ` David Gibson
2013-09-05  5:43     ` Joonsoo Kim
2013-09-05  5:43       ` Joonsoo Kim
2013-09-16 12:09       ` David Gibson
2013-09-30  7:47         ` Joonsoo Kim
2013-09-30  7:47           ` Joonsoo Kim
2013-12-09 16:36           ` Davidlohr Bueso
2013-12-09 16:36             ` Davidlohr Bueso
2013-12-10  8:32             ` Joonsoo Kim
2013-12-10  8:32               ` Joonsoo Kim
2013-08-09  9:26 ` [PATCH v2 20/20] mm, hugetlb: remove a hugetlb_instantiation_mutex Joonsoo Kim
2013-08-09  9:26   ` Joonsoo Kim
2013-08-14 23:22 ` [PATCH v2 00/20] " Andrew Morton
2013-08-14 23:22   ` Andrew Morton
2013-08-16 17:18   ` JoonSoo Kim
2013-08-16 17:18     ` JoonSoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.