All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-12-18  6:53 ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

* NOTE for v3
- Updating patchset is so late because of other works, not issue from
this patchset.

- While reviewing v2, David Gibson who had tried to remove this mutex long
time ago suggested that the race between concurrent call to
alloc_buddy_huge_page() in alloc_huge_page() is also prevented[2] since
this *new* hugepage from it is also contended page for the last allocation.
But I think that it is useless, since if some application's success depends
on the *new* hugepage from alloc_buddy_huge_page() rather than *reserved*
page, it's successful running cannot be guaranteed all the times. So I
don't implement it. Except this issue, there is no issue to this patchset.

* Changes in v3 (No big difference)
- Slightly modify cover-letter since Part 1. is already mereged.
- On patch 1-12, add Reviewed-by from "Aneesh Kumar K.V".
- Patches 1-12 and 14 are just rebased onto v3.13-rc4.
- Patch 13 is changed as following.
	add comment on alloc_huge_page()
	add in-flight user handling in alloc_huge_page_noerr()
	minor code position changes (Suggested by David)

* Changes in v2
- Re-order patches to clear it's relationship
- sleepable object allocation(kmalloc) without holding a spinlock
	(Pointed by Hillf)
- Remove vma_has_reserves, instead of vma_needs_reservation.
	(Suggest by Aneesh and Naoya)
- Change a way of returning a hugepage back to reserved pool
	(Suggedt by Naoya)

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this causes faulting thread to get a SIGBUS
signal, although there are enough hugepages.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.
    
Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

This patchset consist of 4 parts roughly.

Part 1. (Merged) Random fix and clean-up to enhance error handling.
	These are already merged to mainline.

Part 2. (1-3) introduce new protection method for region tracking 
	data structure, instead of the hugetlb_instantiation_mutex. There
	is race condition when we map the hugetlbfs file to two different
	processes. To prevent it, we need to new protection method like
	as this patchset.
	
	This can be merged into mainline separately.

Part 3. (4-7) clean-up.
	
	IMO, these make code really simple, so these are worth to go into
	mainline separately.

Part 4. (8-14) remove a hugetlb_instantiation_mutex.
	
	Almost patches are just for clean-up to error handling path.
	In patch 13, retry approach is implemented that if faulted thread
	failed to allocate a hugepage, it continue to run a fault handler
	until there is no concurrent thread having a hugepage. This causes
	threads who want to get a last hugepage to be serialized, so
	threads don't get a SIGBUS if enough hugepage exist.
	In patch 14, remove a hugetlb_instantiation_mutex.

These patches are based on v3.13-rc4.

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is something I should consider, please let me know!
Thanks.

[1] http://lwn.net/Articles/558863/ 
	"[PATCH] mm/hugetlb: per-vma instantiation mutexes"
[2] https://lkml.org/lkml/2013/9/4/630

Joonsoo Kim (14):
  mm, hugetlb: unify region structure handling
  mm, hugetlb: region manipulation functions take resv_map rather
    list_head
  mm, hugetlb: protect region tracking via newly introduced resv_map
    lock
  mm, hugetlb: remove resv_map_put()
  mm, hugetlb: make vma_resv_map() works for all mapping type
  mm, hugetlb: remove vma_has_reserves()
  mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  mm, hugetlb: call vma_needs_reservation before entering
    alloc_huge_page()
  mm, hugetlb: remove a check for return value of alloc_huge_page()
  mm, hugetlb: move down outside_reserve check
  mm, hugetlb: move up anon_vma_prepare()
  mm, hugetlb: clean-up error handling in hugetlb_cow()
  mm, hugetlb: retry if failed to allocate and there is concurrent user
  mm, hugetlb: remove a hugetlb_instantiation_mutex

 fs/hugetlbfs/inode.c    |   17 +-
 include/linux/hugetlb.h |   11 ++
 mm/hugetlb.c            |  401 +++++++++++++++++++++++++----------------------
 3 files changed, 241 insertions(+), 188 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-12-18  6:53 ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

* NOTE for v3
- Updating patchset is so late because of other works, not issue from
this patchset.

- While reviewing v2, David Gibson who had tried to remove this mutex long
time ago suggested that the race between concurrent call to
alloc_buddy_huge_page() in alloc_huge_page() is also prevented[2] since
this *new* hugepage from it is also contended page for the last allocation.
But I think that it is useless, since if some application's success depends
on the *new* hugepage from alloc_buddy_huge_page() rather than *reserved*
page, it's successful running cannot be guaranteed all the times. So I
don't implement it. Except this issue, there is no issue to this patchset.

* Changes in v3 (No big difference)
- Slightly modify cover-letter since Part 1. is already mereged.
- On patch 1-12, add Reviewed-by from "Aneesh Kumar K.V".
- Patches 1-12 and 14 are just rebased onto v3.13-rc4.
- Patch 13 is changed as following.
	add comment on alloc_huge_page()
	add in-flight user handling in alloc_huge_page_noerr()
	minor code position changes (Suggested by David)

* Changes in v2
- Re-order patches to clear it's relationship
- sleepable object allocation(kmalloc) without holding a spinlock
	(Pointed by Hillf)
- Remove vma_has_reserves, instead of vma_needs_reservation.
	(Suggest by Aneesh and Naoya)
- Change a way of returning a hugepage back to reserved pool
	(Suggedt by Naoya)

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this causes faulting thread to get a SIGBUS
signal, although there are enough hugepages.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.
    
Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

This patchset consist of 4 parts roughly.

Part 1. (Merged) Random fix and clean-up to enhance error handling.
	These are already merged to mainline.

Part 2. (1-3) introduce new protection method for region tracking 
	data structure, instead of the hugetlb_instantiation_mutex. There
	is race condition when we map the hugetlbfs file to two different
	processes. To prevent it, we need to new protection method like
	as this patchset.
	
	This can be merged into mainline separately.

Part 3. (4-7) clean-up.
	
	IMO, these make code really simple, so these are worth to go into
	mainline separately.

Part 4. (8-14) remove a hugetlb_instantiation_mutex.
	
	Almost patches are just for clean-up to error handling path.
	In patch 13, retry approach is implemented that if faulted thread
	failed to allocate a hugepage, it continue to run a fault handler
	until there is no concurrent thread having a hugepage. This causes
	threads who want to get a last hugepage to be serialized, so
	threads don't get a SIGBUS if enough hugepage exist.
	In patch 14, remove a hugetlb_instantiation_mutex.

These patches are based on v3.13-rc4.

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is something I should consider, please let me know!
Thanks.

[1] http://lwn.net/Articles/558863/ 
	"[PATCH] mm/hugetlb: per-vma instantiation mutexes"
[2] https://lkml.org/lkml/2013/9/4/630

Joonsoo Kim (14):
  mm, hugetlb: unify region structure handling
  mm, hugetlb: region manipulation functions take resv_map rather
    list_head
  mm, hugetlb: protect region tracking via newly introduced resv_map
    lock
  mm, hugetlb: remove resv_map_put()
  mm, hugetlb: make vma_resv_map() works for all mapping type
  mm, hugetlb: remove vma_has_reserves()
  mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  mm, hugetlb: call vma_needs_reservation before entering
    alloc_huge_page()
  mm, hugetlb: remove a check for return value of alloc_huge_page()
  mm, hugetlb: move down outside_reserve check
  mm, hugetlb: move up anon_vma_prepare()
  mm, hugetlb: clean-up error handling in hugetlb_cow()
  mm, hugetlb: retry if failed to allocate and there is concurrent user
  mm, hugetlb: remove a hugetlb_instantiation_mutex

 fs/hugetlbfs/inode.c    |   17 +-
 include/linux/hugetlb.h |   11 ++
 mm/hugetlb.c            |  401 +++++++++++++++++++++++++----------------------
 3 files changed, 241 insertions(+), 188 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 01/14] mm, hugetlb: unify region structure handling
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, to track a reserved and allocated region, we use two different
ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
Now, we are preparing to change a coarse grained lock which protect
a region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d19b30a..2040275 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,13 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
+	struct resv_map *resv_map;
+
 	truncate_hugepages(inode, 0);
+	resv_map = (struct resv_map *)inode->i_mapping->private_data;
+	/* root inode doesn't have the resv_map, so we should check it */
+	if (resv_map)
+		resv_map_release(&resv_map->refs);
 	clear_inode(inode);
 }
 
@@ -476,6 +482,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
+	struct resv_map *resv_map;
+
+	resv_map = resv_map_alloc();
+	if (!resv_map)
+		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -487,7 +498,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		INIT_LIST_HEAD(&inode->i_mapping->private_list);
+		inode->i_mapping->private_data = resv_map;
 		info = HUGETLBFS_I(inode);
 		/*
 		 * The policy is initialized here even if we are creating a
@@ -517,7 +528,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	}
+	} else
+		kref_put(&resv_map->refs, resv_map_release);
+
 	return inode;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bd7e987..317b0a6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>
 
 struct ctl_table;
 struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
 	long max_hpages, used_hpages;
 };
 
+struct resv_map {
+	struct kref refs;
+	struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dee6cf4..2891902 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -376,12 +376,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
@@ -393,7 +388,7 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
@@ -1164,8 +1159,9 @@ static long vma_needs_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		return region_chg(&resv->regions, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1189,7 +1185,9 @@ static void vma_commit_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		region_add(&resv->regions, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3159,6 +3157,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
+	struct resv_map *resv_map;
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -3174,10 +3173,13 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
-		struct resv_map *resv_map = resv_map_alloc();
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		resv_map = inode->i_mapping->private_data;
+
+		chg = region_chg(&resv_map->regions, from, to);
+
+	} else {
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -3220,7 +3222,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		region_add(&resv_map->regions, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3231,9 +3233,12 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct resv_map *resv_map = inode->i_mapping->private_data;
+	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
+	if (resv_map)
+		chg = region_truncate(&resv_map->regions, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 01/14] mm, hugetlb: unify region structure handling
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, to track a reserved and allocated region, we use two different
ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
Now, we are preparing to change a coarse grained lock which protect
a region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d19b30a..2040275 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,13 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
+	struct resv_map *resv_map;
+
 	truncate_hugepages(inode, 0);
+	resv_map = (struct resv_map *)inode->i_mapping->private_data;
+	/* root inode doesn't have the resv_map, so we should check it */
+	if (resv_map)
+		resv_map_release(&resv_map->refs);
 	clear_inode(inode);
 }
 
@@ -476,6 +482,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
+	struct resv_map *resv_map;
+
+	resv_map = resv_map_alloc();
+	if (!resv_map)
+		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -487,7 +498,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		INIT_LIST_HEAD(&inode->i_mapping->private_list);
+		inode->i_mapping->private_data = resv_map;
 		info = HUGETLBFS_I(inode);
 		/*
 		 * The policy is initialized here even if we are creating a
@@ -517,7 +528,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	}
+	} else
+		kref_put(&resv_map->refs, resv_map_release);
+
 	return inode;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bd7e987..317b0a6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>
 
 struct ctl_table;
 struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
 	long max_hpages, used_hpages;
 };
 
+struct resv_map {
+	struct kref refs;
+	struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dee6cf4..2891902 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -376,12 +376,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
@@ -393,7 +388,7 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
@@ -1164,8 +1159,9 @@ static long vma_needs_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		return region_chg(&resv->regions, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1189,7 +1185,9 @@ static void vma_commit_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		region_add(&resv->regions, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3159,6 +3157,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
+	struct resv_map *resv_map;
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -3174,10 +3173,13 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
-		struct resv_map *resv_map = resv_map_alloc();
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		resv_map = inode->i_mapping->private_data;
+
+		chg = region_chg(&resv_map->regions, from, to);
+
+	} else {
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -3220,7 +3222,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		region_add(&resv_map->regions, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3231,9 +3233,12 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct resv_map *resv_map = inode->i_mapping->private_data;
+	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
+	if (resv_map)
+		chg = region_truncate(&resv_map->regions, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2891902..3e7a44b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -151,8 +151,9 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
 	/* Locate the region we are either in or before. */
@@ -187,8 +188,9 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
@@ -236,8 +238,9 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
 	long chg = 0;
 
@@ -266,8 +269,9 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg;
 	long chg = 0;
 
@@ -393,7 +397,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
+	region_truncate(resv_map, 0);
 	kfree(resv_map);
 }
 
@@ -1161,7 +1165,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		return region_chg(&resv->regions, idx, idx + 1);
+		return region_chg(resv, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1171,7 +1175,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&resv->regions, idx, idx + 1);
+		err = region_chg(resv, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1187,14 +1191,14 @@ static void vma_commit_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 	}
 }
 
@@ -2285,7 +2289,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&resv->regions, start, end);
+			region_count(resv, start, end);
 
 		resv_map_put(vma);
 
@@ -3176,7 +3180,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode->i_mapping->private_data;
 
-		chg = region_chg(&resv_map->regions, from, to);
+		chg = region_chg(resv_map, from, to);
 
 	} else {
 		resv_map = resv_map_alloc();
@@ -3222,7 +3226,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&resv_map->regions, from, to);
+		region_add(resv_map, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3238,7 +3242,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(&resv_map->regions, offset);
+		chg = region_truncate(resv_map, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2891902..3e7a44b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -151,8 +151,9 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
 	/* Locate the region we are either in or before. */
@@ -187,8 +188,9 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
@@ -236,8 +238,9 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
 	long chg = 0;
 
@@ -266,8 +269,9 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg;
 	long chg = 0;
 
@@ -393,7 +397,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
+	region_truncate(resv_map, 0);
 	kfree(resv_map);
 }
 
@@ -1161,7 +1165,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		return region_chg(&resv->regions, idx, idx + 1);
+		return region_chg(resv, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1171,7 +1175,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&resv->regions, idx, idx + 1);
+		err = region_chg(resv, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1187,14 +1191,14 @@ static void vma_commit_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 	}
 }
 
@@ -2285,7 +2289,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&resv->regions, start, end);
+			region_count(resv, start, end);
 
 		resv_map_put(vma);
 
@@ -3176,7 +3180,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode->i_mapping->private_data;
 
-		chg = region_chg(&resv_map->regions, from, to);
+		chg = region_chg(resv_map, from, to);
 
 	} else {
 		resv_map = resv_map_alloc();
@@ -3222,7 +3226,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&resv_map->regions, from, to);
+		region_add(resv_map, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3238,7 +3242,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(&resv_map->regions, offset);
+		chg = region_truncate(resv_map, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 317b0a6..ee304d1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {
 
 struct resv_map {
 	struct kref refs;
+	spinlock_t lock;
 	struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3e7a44b..cf0eaff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -135,15 +135,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
 	struct list_head link;
@@ -156,6 +149,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -185,15 +179,18 @@ static long region_add(struct resv_map *resv, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+	spin_unlock(&resv->lock);
 	return 0;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
+	struct file_region *rg, *nrg = NULL;
 	long chg = 0;
 
+retry:
+	spin_lock(&resv->lock);
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	 * Subtle, allocate a new region at the position but make it zero
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			spin_unlock(&resv->lock);
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+
+			goto retry;
+		}
+
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
+		nrg = NULL;
 
-		return t - f;
+		chg = t - f;
+		goto out_locked;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
@@ -224,7 +229,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			goto out_locked;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -235,6 +240,10 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+out_locked:
+	spin_unlock(&resv->lock);
+	kfree(nrg);
 	return chg;
 }
 
@@ -244,12 +253,13 @@ static long region_truncate(struct resv_map *resv, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto out;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -266,6 +276,9 @@ static long region_truncate(struct resv_map *resv, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+out:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -275,6 +288,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -290,6 +304,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 
 		chg += seg_to - seg_from;
 	}
+	spin_unlock(&resv->lock);
 
 	return chg;
 }
@@ -387,6 +402,7 @@ struct resv_map *resv_map_alloc(void)
 		return NULL;
 
 	kref_init(&resv_map->refs);
+	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	return resv_map;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 317b0a6..ee304d1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {
 
 struct resv_map {
 	struct kref refs;
+	spinlock_t lock;
 	struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3e7a44b..cf0eaff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -135,15 +135,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
 	struct list_head link;
@@ -156,6 +149,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -185,15 +179,18 @@ static long region_add(struct resv_map *resv, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+	spin_unlock(&resv->lock);
 	return 0;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
+	struct file_region *rg, *nrg = NULL;
 	long chg = 0;
 
+retry:
+	spin_lock(&resv->lock);
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	 * Subtle, allocate a new region at the position but make it zero
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			spin_unlock(&resv->lock);
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+
+			goto retry;
+		}
+
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
+		nrg = NULL;
 
-		return t - f;
+		chg = t - f;
+		goto out_locked;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
@@ -224,7 +229,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			goto out_locked;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -235,6 +240,10 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+out_locked:
+	spin_unlock(&resv->lock);
+	kfree(nrg);
 	return chg;
 }
 
@@ -244,12 +253,13 @@ static long region_truncate(struct resv_map *resv, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto out;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -266,6 +276,9 @@ static long region_truncate(struct resv_map *resv, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+out:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -275,6 +288,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -290,6 +304,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 
 		chg += seg_to - seg_from;
 	}
+	spin_unlock(&resv->lock);
 
 	return chg;
 }
@@ -387,6 +402,7 @@ struct resv_map *resv_map_alloc(void)
 		return NULL;
 
 	kref_init(&resv_map->refs);
+	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	return resv_map;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 04/14] mm, hugetlb: remove resv_map_put()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In following patch, I change vma_resv_map() to return resv_map
for all case. This patch prepares it by removing resv_map_put() which
doesn't works properly with following change, because it works only for
HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cf0eaff..ef70b6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2282,15 +2282,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		kref_get(&resv->refs);
 }
 
-static void resv_map_put(struct vm_area_struct *vma)
-{
-	struct resv_map *resv = vma_resv_map(vma);
-
-	if (!resv)
-		return;
-	kref_put(&resv->refs, resv_map_release);
-}
-
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -2307,7 +2298,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		reserve = (end - start) -
 			region_count(resv, start, end);
 
-		resv_map_put(vma);
+		kref_put(&resv->refs, resv_map_release);
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
@@ -3245,8 +3236,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 		region_add(resv_map, from, to);
 	return 0;
 out_err:
-	if (vma)
-		resv_map_put(vma);
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 04/14] mm, hugetlb: remove resv_map_put()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In following patch, I change vma_resv_map() to return resv_map
for all case. This patch prepares it by removing resv_map_put() which
doesn't works properly with following change, because it works only for
HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cf0eaff..ef70b6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2282,15 +2282,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		kref_get(&resv->refs);
 }
 
-static void resv_map_put(struct vm_area_struct *vma)
-{
-	struct resv_map *resv = vma_resv_map(vma);
-
-	if (!resv)
-		return;
-	kref_put(&resv->refs, resv_map_release);
-}
-
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -2307,7 +2298,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		reserve = (end - start) -
 			region_count(resv, start, end);
 
-		resv_map_put(vma);
+		kref_put(&resv->refs, resv_map_release);
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
@@ -3245,8 +3236,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 		region_add(resv_map, from, to);
 	return 0;
 out_err:
-	if (vma)
-		resv_map_put(vma);
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 05/14] mm, hugetlb: make vma_resv_map() works for all mapping type
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. So unfiying it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef70b6f..f394454 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -417,13 +417,24 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	return inode->i_mapping->private_data;
+}
+
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_MAYSHARE))
+	if (vma->vm_flags & VM_MAYSHARE) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		struct inode *inode = mapping->host;
+
+		return inode_resv_map(inode);
+
+	} else {
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
-	return NULL;
+	}
 }
 
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1174,48 +1185,34 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		return region_chg(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
+	long chg;
 
-	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	resv = vma_resv_map(vma);
+	if (!resv)
 		return 1;
 
-	} else  {
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	idx = vma_hugecache_offset(h, vma, addr);
+	chg = region_chg(resv, idx, idx + 1);
 
-		err = region_chg(resv, idx, idx + 1);
-		if (err < 0)
-			return err;
-		return 0;
-	}
+	if (vma->vm_flags & VM_MAYSHARE)
+		return chg;
+	else
+		return chg < 0 ? chg : 0;
 }
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		region_add(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
 
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	resv = vma_resv_map(vma);
+	if (!resv)
+		return;
 
-		/* Mark this page used in the map. */
-		region_add(resv, idx, idx + 1);
-	}
+	idx = vma_hugecache_offset(h, vma, addr);
+	region_add(resv, idx, idx + 1);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2278,7 +2275,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv)
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_get(&resv->refs);
 }
 
@@ -2291,7 +2288,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	unsigned long start;
 	unsigned long end;
 
-	if (resv) {
+	if (!resv)
+		return;
+
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
@@ -3185,7 +3185,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		resv_map = inode->i_mapping->private_data;
+		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to);
 
@@ -3244,7 +3244,7 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	struct resv_map *resv_map = inode->i_mapping->private_data;
+	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 05/14] mm, hugetlb: make vma_resv_map() works for all mapping type
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. So unfiying it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef70b6f..f394454 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -417,13 +417,24 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	return inode->i_mapping->private_data;
+}
+
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_MAYSHARE))
+	if (vma->vm_flags & VM_MAYSHARE) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		struct inode *inode = mapping->host;
+
+		return inode_resv_map(inode);
+
+	} else {
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
-	return NULL;
+	}
 }
 
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1174,48 +1185,34 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		return region_chg(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
+	long chg;
 
-	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	resv = vma_resv_map(vma);
+	if (!resv)
 		return 1;
 
-	} else  {
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	idx = vma_hugecache_offset(h, vma, addr);
+	chg = region_chg(resv, idx, idx + 1);
 
-		err = region_chg(resv, idx, idx + 1);
-		if (err < 0)
-			return err;
-		return 0;
-	}
+	if (vma->vm_flags & VM_MAYSHARE)
+		return chg;
+	else
+		return chg < 0 ? chg : 0;
 }
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		region_add(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
 
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	resv = vma_resv_map(vma);
+	if (!resv)
+		return;
 
-		/* Mark this page used in the map. */
-		region_add(resv, idx, idx + 1);
-	}
+	idx = vma_hugecache_offset(h, vma, addr);
+	region_add(resv, idx, idx + 1);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2278,7 +2275,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv)
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_get(&resv->refs);
 }
 
@@ -2291,7 +2288,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	unsigned long start;
 	unsigned long end;
 
-	if (resv) {
+	if (!resv)
+		return;
+
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		start = vma_hugecache_offset(h, vma, vma->vm_start);
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
@@ -3185,7 +3185,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		resv_map = inode->i_mapping->private_data;
+		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to);
 
@@ -3244,7 +3244,7 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	struct resv_map *resv_map = inode->i_mapping->private_data;
+	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 06/14] mm, hugetlb: remove vma_has_reserves()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

vma_has_reserves() can be substituted by using return value of
vma_needs_reservation(). If chg returned by vma_needs_reservation()
is 0, it means that vma has reserves. Otherwise, it means that vma don't
have reserves and need a hugepage outside of reserve pool. This definition
is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f394454..9d456d4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -469,39 +469,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 		vma->vm_private_data = (void *)0;
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
-{
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decrement reserved count. Without decrementing,
-		 * reserve count remains after releasing inode, because this
-		 * allocated page will go into page cache and is regarded as
-		 * coming from reserved pool in releasing step.  Currently, we
-		 * don't have any other solution to deal with this situation
-		 * properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return 1;
-		else
-			return 0;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
-
-	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
-	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-		return 1;
-
-	return 0;
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -555,10 +522,11 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
-	 * not "stolen". The child may still get SIGKILLed
+	 * not "stolen". The child may still get SIGKILLed.
+	 * chg represents whether current user has a reserved hugepages or not,
+	 * so that we can use it to ensure that reservations are not "stolen".
 	 */
-	if (!vma_has_reserves(vma, chg) &&
-			h->free_huge_pages - h->resv_huge_pages == 0)
+	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -577,7 +545,11 @@ retry_cpuset:
 			if (page) {
 				if (avoid_reserve)
 					break;
-				if (!vma_has_reserves(vma, chg))
+				/*
+				 * chg means whether current user allocates
+				 * a hugepage on the reserved pool or not
+				 */
+				if (chg)
 					break;
 
 				SetPagePrivate(page);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 06/14] mm, hugetlb: remove vma_has_reserves()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

vma_has_reserves() can be substituted by using return value of
vma_needs_reservation(). If chg returned by vma_needs_reservation()
is 0, it means that vma has reserves. Otherwise, it means that vma don't
have reserves and need a hugepage outside of reserve pool. This definition
is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f394454..9d456d4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -469,39 +469,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 		vma->vm_private_data = (void *)0;
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
-{
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decrement reserved count. Without decrementing,
-		 * reserve count remains after releasing inode, because this
-		 * allocated page will go into page cache and is regarded as
-		 * coming from reserved pool in releasing step.  Currently, we
-		 * don't have any other solution to deal with this situation
-		 * properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return 1;
-		else
-			return 0;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
-
-	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
-	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-		return 1;
-
-	return 0;
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -555,10 +522,11 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
-	 * not "stolen". The child may still get SIGKILLed
+	 * not "stolen". The child may still get SIGKILLed.
+	 * chg represents whether current user has a reserved hugepages or not,
+	 * so that we can use it to ensure that reservations are not "stolen".
 	 */
-	if (!vma_has_reserves(vma, chg) &&
-			h->free_huge_pages - h->resv_huge_pages == 0)
+	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -577,7 +545,11 @@ retry_cpuset:
 			if (page) {
 				if (avoid_reserve)
 					break;
-				if (!vma_has_reserves(vma, chg))
+				/*
+				 * chg means whether current user allocates
+				 * a hugepage on the reserved pool or not
+				 */
+				if (chg)
 					break;
 
 				SetPagePrivate(page);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 07/14] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, we have two variable to represent whether we can use reserved
page or not, chg and avoid_reserve, respectively. With aggregating these,
we can have more clean code. This makes no functinoal difference.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d456d4..9927407 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -508,8 +508,7 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, bool use_reserve)
 {
 	struct page *page = NULL;
 	struct mempolicy *mpol;
@@ -523,14 +522,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed.
-	 * chg represents whether current user has a reserved hugepages or not,
-	 * so that we can use it to ensure that reservations are not "stolen".
+	 * Or, when parent process do COW, we cannot use reserved page.
+	 * In this case, ensure enough pages are in the pool.
 	 */
-	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
-		goto err;
-
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
 retry_cpuset:
@@ -543,13 +538,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
-				if (avoid_reserve)
-					break;
-				/*
-				 * chg means whether current user allocates
-				 * a hugepage on the reserved pool or not
-				 */
-				if (chg)
+				if (!use_reserve)
 					break;
 
 				SetPagePrivate(page);
@@ -1194,6 +1183,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1209,18 +1199,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
+	use_reserve = (!chg && !avoid_reserve);
+	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		if (chg || avoid_reserve)
+		if (!use_reserve)
 			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1228,7 +1219,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			if (chg || avoid_reserve)
+			if (!use_reserve)
 				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 07/14] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Currently, we have two variable to represent whether we can use reserved
page or not, chg and avoid_reserve, respectively. With aggregating these,
we can have more clean code. This makes no functinoal difference.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d456d4..9927407 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -508,8 +508,7 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, bool use_reserve)
 {
 	struct page *page = NULL;
 	struct mempolicy *mpol;
@@ -523,14 +522,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed.
-	 * chg represents whether current user has a reserved hugepages or not,
-	 * so that we can use it to ensure that reservations are not "stolen".
+	 * Or, when parent process do COW, we cannot use reserved page.
+	 * In this case, ensure enough pages are in the pool.
 	 */
-	if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
-		goto err;
-
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+	if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
 retry_cpuset:
@@ -543,13 +538,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
-				if (avoid_reserve)
-					break;
-				/*
-				 * chg means whether current user allocates
-				 * a hugepage on the reserved pool or not
-				 */
-				if (chg)
+				if (!use_reserve)
 					break;
 
 				SetPagePrivate(page);
@@ -1194,6 +1183,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg;
+	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1209,18 +1199,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
+	use_reserve = (!chg && !avoid_reserve);
+	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret) {
-		if (chg || avoid_reserve)
+		if (!use_reserve)
 			hugepage_subpool_put_pages(spool, 1);
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1228,7 +1219,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_cgroup(idx,
 						       pages_per_huge_page(h),
 						       h_cg);
-			if (chg || avoid_reserve)
+			if (!use_reserve)
 				hugepage_subpool_put_pages(spool, 1);
 			return ERR_PTR(-ENOSPC);
 		}
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 08/14] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In order to validate that this failure is reasonable, we need to know
whether allocation request is for reserved or not on caller function.
So moving vma_needs_reservation() up to the caller of alloc_huge_page().
There is no functional change in this patch and following patch use
this information.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9927407..d960f46 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1177,13 +1177,11 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, int use_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
-	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1196,10 +1194,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * need pages and subpool limit allocated allocated if no reserve
 	 * mapping overlaps.
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
-		return ERR_PTR(-ENOMEM);
-	use_reserve = (!chg && !avoid_reserve);
 	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
@@ -1244,7 +1238,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve)
 {
-	struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
+	struct page *page = alloc_huge_page(vma, addr, !avoid_reserve);
 	if (IS_ERR(page))
 		page = NULL;
 	return page;
@@ -2581,6 +2575,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
+	long chg;
+	bool use_reserve;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2612,7 +2608,17 @@ retry_avoidcopy:
 
 	/* Drop page table lock as buddy allocator may be called */
 	spin_unlock(ptl);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	chg = vma_needs_reservation(h, vma, address);
+	if (chg < 0) {
+		page_cache_release(old_page);
+
+		/* Caller expects lock to be held */
+		spin_lock(ptl);
+		return VM_FAULT_OOM;
+	}
+	use_reserve = !chg && !outside_reserve;
+
+	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
 		long err = PTR_ERR(new_page);
@@ -2742,6 +2748,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	pte_t new_pte;
 	spinlock_t *ptl;
+	long chg;
+	bool use_reserve;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2767,7 +2775,15 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			ret = VM_FAULT_OOM;
+			goto out;
+		}
+		use_reserve = !chg;
+
+		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			if (ret == -ENOMEM)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 08/14] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

In order to validate that this failure is reasonable, we need to know
whether allocation request is for reserved or not on caller function.
So moving vma_needs_reservation() up to the caller of alloc_huge_page().
There is no functional change in this patch and following patch use
this information.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9927407..d960f46 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1177,13 +1177,11 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, int use_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
-	bool use_reserve;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1196,10 +1194,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * need pages and subpool limit allocated allocated if no reserve
 	 * mapping overlaps.
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
-		return ERR_PTR(-ENOMEM);
-	use_reserve = (!chg && !avoid_reserve);
 	if (!use_reserve)
 		if (hugepage_subpool_get_pages(spool, 1))
 			return ERR_PTR(-ENOSPC);
@@ -1244,7 +1238,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve)
 {
-	struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
+	struct page *page = alloc_huge_page(vma, addr, !avoid_reserve);
 	if (IS_ERR(page))
 		page = NULL;
 	return page;
@@ -2581,6 +2575,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
+	long chg;
+	bool use_reserve;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2612,7 +2608,17 @@ retry_avoidcopy:
 
 	/* Drop page table lock as buddy allocator may be called */
 	spin_unlock(ptl);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	chg = vma_needs_reservation(h, vma, address);
+	if (chg < 0) {
+		page_cache_release(old_page);
+
+		/* Caller expects lock to be held */
+		spin_lock(ptl);
+		return VM_FAULT_OOM;
+	}
+	use_reserve = !chg && !outside_reserve;
+
+	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
 		long err = PTR_ERR(new_page);
@@ -2742,6 +2748,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	pte_t new_pte;
 	spinlock_t *ptl;
+	long chg;
+	bool use_reserve;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2767,7 +2775,15 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg == -ENOMEM) {
+			ret = VM_FAULT_OOM;
+			goto out;
+		}
+		use_reserve = !chg;
+
+		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			if (ret == -ENOMEM)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 09/14] mm, hugetlb: remove a check for return value of alloc_huge_page()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't need to worry about other return value.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d960f46..0f56bbf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2621,7 +2621,6 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
-		long err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2650,10 +2649,7 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(ptl);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		else
-			return VM_FAULT_SIGBUS;
+		return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2785,11 +2781,7 @@ retry:
 
 		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			if (ret == -ENOMEM)
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 09/14] mm, hugetlb: remove a check for return value of alloc_huge_page()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't need to worry about other return value.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d960f46..0f56bbf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2621,7 +2621,6 @@ retry_avoidcopy:
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
 	if (IS_ERR(new_page)) {
-		long err = PTR_ERR(new_page);
 		page_cache_release(old_page);
 
 		/*
@@ -2650,10 +2649,7 @@ retry_avoidcopy:
 
 		/* Caller expects lock to be held */
 		spin_lock(ptl);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		else
-			return VM_FAULT_SIGBUS;
+		return VM_FAULT_SIGBUS;
 	}
 
 	/*
@@ -2785,11 +2781,7 @@ retry:
 
 		page = alloc_huge_page(vma, address, use_reserve);
 		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			if (ret == -ENOMEM)
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 10/14] mm, hugetlb: move down outside_reserve check
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Just move down outside_reserve check and don't check
vma_need_reservation() when outside_resever is true. It is slightly
optimized implementation.

This makes code more readable.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f56bbf..03ab285 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2576,7 +2576,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
 	long chg;
-	bool use_reserve;
+	bool use_reserve = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2591,6 +2591,11 @@ retry_avoidcopy:
 		return 0;
 	}
 
+	page_cache_get(old_page);
+
+	/* Drop page table lock as buddy allocator may be called */
+	spin_unlock(ptl);
+
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
@@ -2604,19 +2609,17 @@ retry_avoidcopy:
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-	page_cache_get(old_page);
-
-	/* Drop page table lock as buddy allocator may be called */
-	spin_unlock(ptl);
-	chg = vma_needs_reservation(h, vma, address);
-	if (chg < 0) {
-		page_cache_release(old_page);
+	if (!outside_reserve) {
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg < 0) {
+			page_cache_release(old_page);
 
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
+			/* Caller expects lock to be held */
+			spin_lock(ptl);
+			return VM_FAULT_OOM;
+		}
+		use_reserve = !chg;
 	}
-	use_reserve = !chg && !outside_reserve;
 
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 10/14] mm, hugetlb: move down outside_reserve check
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Just move down outside_reserve check and don't check
vma_need_reservation() when outside_resever is true. It is slightly
optimized implementation.

This makes code more readable.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f56bbf..03ab285 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2576,7 +2576,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *old_page, *new_page;
 	int outside_reserve = 0;
 	long chg;
-	bool use_reserve;
+	bool use_reserve = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2591,6 +2591,11 @@ retry_avoidcopy:
 		return 0;
 	}
 
+	page_cache_get(old_page);
+
+	/* Drop page table lock as buddy allocator may be called */
+	spin_unlock(ptl);
+
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
@@ -2604,19 +2609,17 @@ retry_avoidcopy:
 			old_page != pagecache_page)
 		outside_reserve = 1;
 
-	page_cache_get(old_page);
-
-	/* Drop page table lock as buddy allocator may be called */
-	spin_unlock(ptl);
-	chg = vma_needs_reservation(h, vma, address);
-	if (chg < 0) {
-		page_cache_release(old_page);
+	if (!outside_reserve) {
+		chg = vma_needs_reservation(h, vma, address);
+		if (chg < 0) {
+			page_cache_release(old_page);
 
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
+			/* Caller expects lock to be held */
+			spin_lock(ptl);
+			return VM_FAULT_OOM;
+		}
+		use_reserve = !chg;
 	}
-	use_reserve = !chg && !outside_reserve;
 
 	new_page = alloc_huge_page(vma, address, use_reserve);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 11/14] mm, hugetlb: move up anon_vma_prepare()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a allocated hugepage, we need some effort to recover
properly. So, it is better not to allocate a hugepage as much as possible.
So move up anon_vma_prepare() which can be failed in OOM situation.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 03ab285..1817720 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2597,6 +2597,17 @@ retry_avoidcopy:
 	spin_unlock(ptl);
 
 	/*
+	 * When the original hugepage is shared one, it does not have
+	 * anon_vma prepared.
+	 */
+	if (unlikely(anon_vma_prepare(vma))) {
+		page_cache_release(old_page);
+		/* Caller expects lock to be held */
+		spin_lock(ptl);
+		return VM_FAULT_OOM;
+	}
+
+	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
 	 * the allocation without using the existing reserves. The pagecache
@@ -2655,18 +2666,6 @@ retry_avoidcopy:
 		return VM_FAULT_SIGBUS;
 	}
 
-	/*
-	 * When the original hugepage is shared one, it does not have
-	 * anon_vma prepared.
-	 */
-	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(new_page);
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
-	}
-
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 11/14] mm, hugetlb: move up anon_vma_prepare()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If we fail with a allocated hugepage, we need some effort to recover
properly. So, it is better not to allocate a hugepage as much as possible.
So move up anon_vma_prepare() which can be failed in OOM situation.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 03ab285..1817720 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2597,6 +2597,17 @@ retry_avoidcopy:
 	spin_unlock(ptl);
 
 	/*
+	 * When the original hugepage is shared one, it does not have
+	 * anon_vma prepared.
+	 */
+	if (unlikely(anon_vma_prepare(vma))) {
+		page_cache_release(old_page);
+		/* Caller expects lock to be held */
+		spin_lock(ptl);
+		return VM_FAULT_OOM;
+	}
+
+	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
 	 * the allocation without using the existing reserves. The pagecache
@@ -2655,18 +2666,6 @@ retry_avoidcopy:
 		return VM_FAULT_SIGBUS;
 	}
 
-	/*
-	 * When the original hugepage is shared one, it does not have
-	 * anon_vma prepared.
-	 */
-	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(new_page);
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
-	}
-
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 12/14] mm, hugetlb: clean-up error handling in hugetlb_cow()
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1817720..a9ae7d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2577,6 +2577,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2601,10 +2602,8 @@ retry_avoidcopy:
 	 * anon_vma prepared.
 	 */
 	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto out_old_page;
 	}
 
 	/*
@@ -2623,11 +2622,8 @@ retry_avoidcopy:
 	if (!outside_reserve) {
 		chg = vma_needs_reservation(h, vma, address);
 		if (chg < 0) {
-			page_cache_release(old_page);
-
-			/* Caller expects lock to be held */
-			spin_lock(ptl);
-			return VM_FAULT_OOM;
+			ret = VM_FAULT_OOM;
+			goto out_old_page;
 		}
 		use_reserve = !chg;
 	}
@@ -2661,9 +2657,8 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_SIGBUS;
+		ret = VM_FAULT_SIGBUS;
+		goto out_lock;
 	}
 
 	copy_user_huge_page(new_page, old_page, address, vma,
@@ -2694,11 +2689,12 @@ retry_avoidcopy:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	page_cache_release(new_page);
+out_old_page:
 	page_cache_release(old_page);
-
+out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(ptl);
-	return 0;
+	return ret;
 }
 
 /* Return the pagecache page at a given address within a VMA */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 12/14] mm, hugetlb: clean-up error handling in hugetlb_cow()
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1817720..a9ae7d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2577,6 +2577,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2601,10 +2602,8 @@ retry_avoidcopy:
 	 * anon_vma prepared.
 	 */
 	if (unlikely(anon_vma_prepare(vma))) {
-		page_cache_release(old_page);
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto out_old_page;
 	}
 
 	/*
@@ -2623,11 +2622,8 @@ retry_avoidcopy:
 	if (!outside_reserve) {
 		chg = vma_needs_reservation(h, vma, address);
 		if (chg < 0) {
-			page_cache_release(old_page);
-
-			/* Caller expects lock to be held */
-			spin_lock(ptl);
-			return VM_FAULT_OOM;
+			ret = VM_FAULT_OOM;
+			goto out_old_page;
 		}
 		use_reserve = !chg;
 	}
@@ -2661,9 +2657,8 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
-		/* Caller expects lock to be held */
-		spin_lock(ptl);
-		return VM_FAULT_SIGBUS;
+		ret = VM_FAULT_SIGBUS;
+		goto out_lock;
 	}
 
 	copy_user_huge_page(new_page, old_page, address, vma,
@@ -2694,11 +2689,12 @@ retry_avoidcopy:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	page_cache_release(new_page);
+out_old_page:
 	page_cache_release(old_page);
-
+out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(ptl);
-	return 0;
+	return ret;
 }
 
 /* Return the pagecache page at a given address within a VMA */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. For achieving it, at first, we should ensure that
no one get a SIGBUS if there are enough hugepages.

For this purpose, if we fail to allocate a new hugepage when there is
concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
these threads defer to get a SIGBUS signal until there is no
concurrent user, and so, we can ensure that no one get a SIGBUS if there
are enough hugepages.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ee304d1..daca347 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -255,6 +255,7 @@ struct hstate {
 	int next_nid_to_free;
 	unsigned int order;
 	unsigned long mask;
+	unsigned long nr_dequeue_users;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
 	unsigned long free_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9ae7d3..843c554 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -538,6 +538,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
+				h->nr_dequeue_users++;
 				if (!use_reserve)
 					break;
 
@@ -557,6 +558,15 @@ err:
 	return NULL;
 }
 
+static void commit_dequeued_huge_page(struct vm_area_struct *vma)
+{
+	struct hstate *h = hstate_vma(vma);
+
+	spin_lock(&hugetlb_lock);
+	h->nr_dequeue_users--;
+	spin_unlock(&hugetlb_lock);
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1176,8 +1186,18 @@ static void vma_commit_reservation(struct hstate *h,
 	region_add(resv, idx, idx + 1);
 }
 
+/*
+ * alloc_huge_page() calls dequeue_huge_page_vma() and it would increase
+ * hstate's nr_dequeue_users if it gets a page from the queue. This
+ * nr_dequeue_users is used to prevent concurrent users who can get a page on
+ * the queue from being killed by SIGBUS. After determining if we actually use
+ * it or not, we should notify that we are done to hstate by calling
+ * commit_dequeued_huge_page().
+ */
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int use_reserve)
+				    unsigned long addr, int use_reserve,
+				    unsigned long *nr_dequeue_users,
+				    bool *do_dequeue)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -1205,8 +1225,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
+	*do_dequeue = true;
+	*nr_dequeue_users = h->nr_dequeue_users;
 	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
+		*do_dequeue = false;
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
@@ -1238,9 +1261,16 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve)
 {
-	struct page *page = alloc_huge_page(vma, addr, !avoid_reserve);
+	struct page *page;
+	unsigned long nr_dequeue_users;
+	bool do_dequeue = false;
+
+	page = alloc_huge_page(vma, addr, !avoid_reserve,
+				&nr_dequeue_users, &do_dequeue);
 	if (IS_ERR(page))
 		page = NULL;
+	else if (do_dequeue)
+		commit_dequeued_huge_page(vma);
 	return page;
 }
 
@@ -1975,6 +2005,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
 	h->free_huge_pages = 0;
+	h->nr_dequeue_users = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 	INIT_LIST_HEAD(&h->hugepage_activelist);
@@ -2577,6 +2608,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -2628,11 +2661,17 @@ retry_avoidcopy:
 		use_reserve = !chg;
 	}
 
-	new_page = alloc_huge_page(vma, address, use_reserve);
+	new_page = alloc_huge_page(vma, address, use_reserve,
+						&nr_dequeue_users, &do_dequeue);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
 
+		if (nr_dequeue_users) {
+			ret = 0;
+			goto out_lock;
+		}
+
 		/*
 		 * If a process owning a MAP_PRIVATE mapping fails to COW,
 		 * it is due to references held by a child and an insufficient
@@ -2657,6 +2696,9 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
+		if (use_reserve)
+			WARN_ON_ONCE(1);
+
 		ret = VM_FAULT_SIGBUS;
 		goto out_lock;
 	}
@@ -2691,6 +2733,8 @@ retry_avoidcopy:
 	page_cache_release(new_page);
 out_old_page:
 	page_cache_release(old_page);
+	if (do_dequeue)
+		commit_dequeued_huge_page(vma);
 out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(ptl);
@@ -2744,6 +2788,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	long chg;
 	bool use_reserve;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2777,9 +2823,17 @@ retry:
 		}
 		use_reserve = !chg;
 
-		page = alloc_huge_page(vma, address, use_reserve);
+		page = alloc_huge_page(vma, address, use_reserve,
+					&nr_dequeue_users, &do_dequeue);
 		if (IS_ERR(page)) {
-			ret = VM_FAULT_SIGBUS;
+			if (nr_dequeue_users)
+				ret = 0;
+			else {
+				if (use_reserve)
+					WARN_ON_ONCE(1);
+
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
@@ -2792,22 +2846,26 @@ retry:
 			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
 			if (err) {
 				put_page(page);
+				if (do_dequeue)
+					commit_dequeued_huge_page(vma);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
 			ClearPagePrivate(page);
+			if (do_dequeue)
+				commit_dequeued_huge_page(vma);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
+			anon_rmap = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
 				ret = VM_FAULT_OOM;
 				goto backout_unlocked;
 			}
-			anon_rmap = 1;
 		}
 	} else {
 		/*
@@ -2862,6 +2920,8 @@ retry:
 	spin_unlock(ptl);
 	unlock_page(page);
 out:
+	if (anon_rmap && do_dequeue)
+		commit_dequeued_huge_page(vma);
 	return ret;
 
 backout:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-18  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who can get hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. For achieving it, at first, we should ensure that
no one get a SIGBUS if there are enough hugepages.

For this purpose, if we fail to allocate a new hugepage when there is
concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
these threads defer to get a SIGBUS signal until there is no
concurrent user, and so, we can ensure that no one get a SIGBUS if there
are enough hugepages.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ee304d1..daca347 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -255,6 +255,7 @@ struct hstate {
 	int next_nid_to_free;
 	unsigned int order;
 	unsigned long mask;
+	unsigned long nr_dequeue_users;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
 	unsigned long free_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9ae7d3..843c554 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -538,6 +538,7 @@ retry_cpuset:
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
+				h->nr_dequeue_users++;
 				if (!use_reserve)
 					break;
 
@@ -557,6 +558,15 @@ err:
 	return NULL;
 }
 
+static void commit_dequeued_huge_page(struct vm_area_struct *vma)
+{
+	struct hstate *h = hstate_vma(vma);
+
+	spin_lock(&hugetlb_lock);
+	h->nr_dequeue_users--;
+	spin_unlock(&hugetlb_lock);
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1176,8 +1186,18 @@ static void vma_commit_reservation(struct hstate *h,
 	region_add(resv, idx, idx + 1);
 }
 
+/*
+ * alloc_huge_page() calls dequeue_huge_page_vma() and it would increase
+ * hstate's nr_dequeue_users if it gets a page from the queue. This
+ * nr_dequeue_users is used to prevent concurrent users who can get a page on
+ * the queue from being killed by SIGBUS. After determining if we actually use
+ * it or not, we should notify that we are done to hstate by calling
+ * commit_dequeued_huge_page().
+ */
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int use_reserve)
+				    unsigned long addr, int use_reserve,
+				    unsigned long *nr_dequeue_users,
+				    bool *do_dequeue)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -1205,8 +1225,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		return ERR_PTR(-ENOSPC);
 	}
 	spin_lock(&hugetlb_lock);
+	*do_dequeue = true;
+	*nr_dequeue_users = h->nr_dequeue_users;
 	page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
 	if (!page) {
+		*do_dequeue = false;
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
@@ -1238,9 +1261,16 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve)
 {
-	struct page *page = alloc_huge_page(vma, addr, !avoid_reserve);
+	struct page *page;
+	unsigned long nr_dequeue_users;
+	bool do_dequeue = false;
+
+	page = alloc_huge_page(vma, addr, !avoid_reserve,
+				&nr_dequeue_users, &do_dequeue);
 	if (IS_ERR(page))
 		page = NULL;
+	else if (do_dequeue)
+		commit_dequeued_huge_page(vma);
 	return page;
 }
 
@@ -1975,6 +2005,7 @@ void __init hugetlb_add_hstate(unsigned order)
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
 	h->nr_huge_pages = 0;
 	h->free_huge_pages = 0;
+	h->nr_dequeue_users = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 	INIT_LIST_HEAD(&h->hugepage_activelist);
@@ -2577,6 +2608,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	long chg;
 	bool use_reserve = false;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 	int ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -2628,11 +2661,17 @@ retry_avoidcopy:
 		use_reserve = !chg;
 	}
 
-	new_page = alloc_huge_page(vma, address, use_reserve);
+	new_page = alloc_huge_page(vma, address, use_reserve,
+						&nr_dequeue_users, &do_dequeue);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
 
+		if (nr_dequeue_users) {
+			ret = 0;
+			goto out_lock;
+		}
+
 		/*
 		 * If a process owning a MAP_PRIVATE mapping fails to COW,
 		 * it is due to references held by a child and an insufficient
@@ -2657,6 +2696,9 @@ retry_avoidcopy:
 			WARN_ON_ONCE(1);
 		}
 
+		if (use_reserve)
+			WARN_ON_ONCE(1);
+
 		ret = VM_FAULT_SIGBUS;
 		goto out_lock;
 	}
@@ -2691,6 +2733,8 @@ retry_avoidcopy:
 	page_cache_release(new_page);
 out_old_page:
 	page_cache_release(old_page);
+	if (do_dequeue)
+		commit_dequeued_huge_page(vma);
 out_lock:
 	/* Caller expects lock to be held */
 	spin_lock(ptl);
@@ -2744,6 +2788,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	long chg;
 	bool use_reserve;
+	unsigned long nr_dequeue_users = 0;
+	bool do_dequeue = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2777,9 +2823,17 @@ retry:
 		}
 		use_reserve = !chg;
 
-		page = alloc_huge_page(vma, address, use_reserve);
+		page = alloc_huge_page(vma, address, use_reserve,
+					&nr_dequeue_users, &do_dequeue);
 		if (IS_ERR(page)) {
-			ret = VM_FAULT_SIGBUS;
+			if (nr_dequeue_users)
+				ret = 0;
+			else {
+				if (use_reserve)
+					WARN_ON_ONCE(1);
+
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
@@ -2792,22 +2846,26 @@ retry:
 			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
 			if (err) {
 				put_page(page);
+				if (do_dequeue)
+					commit_dequeued_huge_page(vma);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
 			ClearPagePrivate(page);
+			if (do_dequeue)
+				commit_dequeued_huge_page(vma);
 
 			spin_lock(&inode->i_lock);
 			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
+			anon_rmap = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
 				ret = VM_FAULT_OOM;
 				goto backout_unlocked;
 			}
-			anon_rmap = 1;
 		}
 	} else {
 		/*
@@ -2862,6 +2920,8 @@ retry:
 	spin_unlock(ptl);
 	unlock_page(page);
 out:
+	if (anon_rmap && do_dequeue)
+		commit_dequeued_huge_page(vma);
 	return ret;
 
 backout:
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 14/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2013-12-18  6:53 ` Joonsoo Kim
@ 2013-12-18  6:54   ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, we have an infrastructure in order to remove a this awkward mutex
which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 843c554..6edf423 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2595,9 +2595,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
@@ -2941,7 +2939,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2961,17 +2958,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
-	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
-	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
-		goto out_mutex;
-	}
+	if (huge_pte_none(entry))
+		return hugetlb_no_page(mm, vma, address, ptep, flags);
 
 	ret = 0;
 
@@ -2984,10 +2973,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, address) < 0) {
-			ret = VM_FAULT_OOM;
-			goto out_mutex;
-		}
+		if (vma_needs_reservation(h, vma, address) < 0)
+			return VM_FAULT_OOM;
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3037,9 +3024,6 @@ out_ptl:
 		unlock_page(page);
 	put_page(page);
 
-out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 14/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2013-12-18  6:54   ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-18  6:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton, Joonsoo Kim

Now, we have an infrastructure in order to remove a this awkward mutex
which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 843c554..6edf423 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2595,9 +2595,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pte_t pte,
@@ -2941,7 +2939,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2961,17 +2958,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
-	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
-	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
-		goto out_mutex;
-	}
+	if (huge_pte_none(entry))
+		return hugetlb_no_page(mm, vma, address, ptep, flags);
 
 	ret = 0;
 
@@ -2984,10 +2973,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, address) < 0) {
-			ret = VM_FAULT_OOM;
-			goto out_mutex;
-		}
+		if (vma_needs_reservation(h, vma, address) < 0)
+			return VM_FAULT_OOM;
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3037,9 +3024,6 @@ out_ptl:
 		unlock_page(page);
 	put_page(page);
 
-out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
 	return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-18  6:53   ` Joonsoo Kim
@ 2013-12-20  1:02     ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2013-12-20  1:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation.

So the whole point of the patch is to improve performance, but the
changelog doesn't include any performance measurements!

Please, run some quantitative tests and include a nice summary of the
results in the changelog.

This is terribly important, because if the performance benefit is
infinitesimally small or negative, the patch goes into the bit bucket ;)

> For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.

So if I'm understanding this correctly...  if N threads all generate a
fault against the same address, they will all dive in and allocate a
hugepage, will then do an enormous memcpy into that page and will then
attempt to instantiate the page in pagetables.  All threads except one
will lose the race and will free the page again!  This sounds terribly
inefficient; it would be useful to write a microbenchmark which
triggers this scenario so we can explore the impact.

I'm wondering if a better solution to all of this would be to make
hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
with a hash of the faulting address.  That will 99.9% solve the
performance issue which you believe exists without introducing this new
performance issue?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  1:02     ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2013-12-20  1:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who can get hugepages to get a SIGBUS signal.
> 
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
> 
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation.

So the whole point of the patch is to improve performance, but the
changelog doesn't include any performance measurements!

Please, run some quantitative tests and include a nice summary of the
results in the changelog.

This is terribly important, because if the performance benefit is
infinitesimally small or negative, the patch goes into the bit bucket ;)

> For achieving it, at first, we should ensure that
> no one get a SIGBUS if there are enough hugepages.
> 
> For this purpose, if we fail to allocate a new hugepage when there is
> concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> these threads defer to get a SIGBUS signal until there is no
> concurrent user, and so, we can ensure that no one get a SIGBUS if there
> are enough hugepages.

So if I'm understanding this correctly...  if N threads all generate a
fault against the same address, they will all dive in and allocate a
hugepage, will then do an enormous memcpy into that page and will then
attempt to instantiate the page in pagetables.  All threads except one
will lose the race and will free the page again!  This sounds terribly
inefficient; it would be useful to write a microbenchmark which
triggers this scenario so we can explore the impact.

I'm wondering if a better solution to all of this would be to make
hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
with a hash of the faulting address.  That will 99.9% solve the
performance issue which you believe exists without introducing this new
performance issue?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  1:02     ` Andrew Morton
@ 2013-12-20  1:58       ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.
> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Hello, Andrew, Davidlohr.

Yes, I should include the peformance measurements.
I can measure it on artificial circumstance, but I think that the best is
Davidlohr who reported the issue checks the performance improvement.
https://lkml.org/lkml/2013/7/12/428

Davidlohr, could you measure it on your testing environment where you
originally reported the issue?

> 
> > For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> 
> So if I'm understanding this correctly...  if N threads all generate a
> fault against the same address, they will all dive in and allocate a
> hugepage, will then do an enormous memcpy into that page and will then
> attempt to instantiate the page in pagetables.  All threads except one
> will lose the race and will free the page again!  This sounds terribly
> inefficient; it would be useful to write a microbenchmark which
> triggers this scenario so we can explore the impact.

Yes, you understand correctly, I think.

I have an idea to prevent this overhead. It is that marking page when it
is zeroed and unmarking when it is mapped to page table. If page mapping
is failed due to current thread, the zeroed page will keep the marker and
later we can determine if it is zeroed or not.

If you want to include this functionality in this series, I can do it ;)
Please let me know your decision.

> I'm wondering if a better solution to all of this would be to make
> hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> with a hash of the faulting address.  That will 99.9% solve the
> performance issue which you believe exists without introducing this new
> performance issue?

Yes, that approach would solve the performance issue.
IIRC, you already suggested this idea roughly 6 months ago and it is
implemented by Davidlohr. I remembered that there is a race issue on
COW case with this approach. See following link for more information.
https://lkml.org/lkml/2013/8/7/142

And we need 1-3 patches to prevent other theorectical race issue
regardless any approaches.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  1:58       ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.
> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Hello, Andrew, Davidlohr.

Yes, I should include the peformance measurements.
I can measure it on artificial circumstance, but I think that the best is
Davidlohr who reported the issue checks the performance improvement.
https://lkml.org/lkml/2013/7/12/428

Davidlohr, could you measure it on your testing environment where you
originally reported the issue?

> 
> > For achieving it, at first, we should ensure that
> > no one get a SIGBUS if there are enough hugepages.
> > 
> > For this purpose, if we fail to allocate a new hugepage when there is
> > concurrent user, we return just 0, instead of VM_FAULT_SIGBUS. With this,
> > these threads defer to get a SIGBUS signal until there is no
> > concurrent user, and so, we can ensure that no one get a SIGBUS if there
> > are enough hugepages.
> 
> So if I'm understanding this correctly...  if N threads all generate a
> fault against the same address, they will all dive in and allocate a
> hugepage, will then do an enormous memcpy into that page and will then
> attempt to instantiate the page in pagetables.  All threads except one
> will lose the race and will free the page again!  This sounds terribly
> inefficient; it would be useful to write a microbenchmark which
> triggers this scenario so we can explore the impact.

Yes, you understand correctly, I think.

I have an idea to prevent this overhead. It is that marking page when it
is zeroed and unmarking when it is mapped to page table. If page mapping
is failed due to current thread, the zeroed page will keep the marker and
later we can determine if it is zeroed or not.

If you want to include this functionality in this series, I can do it ;)
Please let me know your decision.

> I'm wondering if a better solution to all of this would be to make
> hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> with a hash of the faulting address.  That will 99.9% solve the
> performance issue which you believe exists without introducing this new
> performance issue?

Yes, that approach would solve the performance issue.
IIRC, you already suggested this idea roughly 6 months ago and it is
implemented by Davidlohr. I remembered that there is a race issue on
COW case with this approach. See following link for more information.
https://lkml.org/lkml/2013/8/7/142

And we need 1-3 patches to prevent other theorectical race issue
regardless any approaches.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  1:58       ` Joonsoo Kim
@ 2013-12-20  2:15         ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2013-12-20  2:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > 
> > So if I'm understanding this correctly...  if N threads all generate a
> > fault against the same address, they will all dive in and allocate a
> > hugepage, will then do an enormous memcpy into that page and will then
> > attempt to instantiate the page in pagetables.  All threads except one
> > will lose the race and will free the page again!  This sounds terribly
> > inefficient; it would be useful to write a microbenchmark which
> > triggers this scenario so we can explore the impact.
> 
> Yes, you understand correctly, I think.
> 
> I have an idea to prevent this overhead. It is that marking page when it
> is zeroed and unmarking when it is mapped to page table. If page mapping
> is failed due to current thread, the zeroed page will keep the marker and
> later we can determine if it is zeroed or not.

Well OK, but the other threads will need to test that in-progress flag
and then do <something>.  Where <something> will involve some form of
open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
can avoid using an internal flag and use an external flag instead. 
There's one in struct mutex!

I doubt if the additional complexity of the external flag is worth it,
but convincing performance testing results would sway me ;) Please have
a think about it all.

> If you want to include this functionality in this series, I can do it ;)
> Please let me know your decision.
> 
> > I'm wondering if a better solution to all of this would be to make
> > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > with a hash of the faulting address.  That will 99.9% solve the
> > performance issue which you believe exists without introducing this new
> > performance issue?
> 
> Yes, that approach would solve the performance issue.
> IIRC, you already suggested this idea roughly 6 months ago and it is
> implemented by Davidlohr. I remembered that there is a race issue on
> COW case with this approach. See following link for more information.
> https://lkml.org/lkml/2013/8/7/142

That seems to be unrelated to hugetlb_instantiation_mutex?

> And we need 1-3 patches to prevent other theorectical race issue
> regardless any approaches.

Yes, I'll be going through patches 1-12 very soon, thanks.


And to reiterate: I'm very uncomfortable mucking around with
performance patches when we have run no tests to measure their
magnitude, or even whether they are beneficial at all!

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  2:15         ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2013-12-20  2:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > 
> > So if I'm understanding this correctly...  if N threads all generate a
> > fault against the same address, they will all dive in and allocate a
> > hugepage, will then do an enormous memcpy into that page and will then
> > attempt to instantiate the page in pagetables.  All threads except one
> > will lose the race and will free the page again!  This sounds terribly
> > inefficient; it would be useful to write a microbenchmark which
> > triggers this scenario so we can explore the impact.
> 
> Yes, you understand correctly, I think.
> 
> I have an idea to prevent this overhead. It is that marking page when it
> is zeroed and unmarking when it is mapped to page table. If page mapping
> is failed due to current thread, the zeroed page will keep the marker and
> later we can determine if it is zeroed or not.

Well OK, but the other threads will need to test that in-progress flag
and then do <something>.  Where <something> will involve some form of
open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
can avoid using an internal flag and use an external flag instead. 
There's one in struct mutex!

I doubt if the additional complexity of the external flag is worth it,
but convincing performance testing results would sway me ;) Please have
a think about it all.

> If you want to include this functionality in this series, I can do it ;)
> Please let me know your decision.
> 
> > I'm wondering if a better solution to all of this would be to make
> > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > with a hash of the faulting address.  That will 99.9% solve the
> > performance issue which you believe exists without introducing this new
> > performance issue?
> 
> Yes, that approach would solve the performance issue.
> IIRC, you already suggested this idea roughly 6 months ago and it is
> implemented by Davidlohr. I remembered that there is a race issue on
> COW case with this approach. See following link for more information.
> https://lkml.org/lkml/2013/8/7/142

That seems to be unrelated to hugetlb_instantiation_mutex?

> And we need 1-3 patches to prevent other theorectical race issue
> regardless any approaches.

Yes, I'll be going through patches 1-12 very soon, thanks.


And to reiterate: I'm very uncomfortable mucking around with
performance patches when we have run no tests to measure their
magnitude, or even whether they are beneficial at all!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  1:02     ` Andrew Morton
@ 2013-12-20  2:31       ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2013-12-20  2:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.

I was actually spending this afternoon testing these patches with Oracle
(I haven't seen any issues so far) and unless Joonsoo already did so, I
want to run these by the libhugetlb test cases - I got side tracked by
futexes though :/

Please do consider that performance wise I haven't seen much in
particular. The thing is, I started dealing with this mutex once I
noticed it as the #1 hot lock in Oracle DB starts, but then once the
faults are done, it really goes away. So I wouldn't say that the mutex
is a bottleneck except for the first few minutes.

> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Well, this mutex is infinitesimally ugly and needs to die (as long as
performance isn't hurt).

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  2:31       ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2013-12-20  2:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 
> Please, run some quantitative tests and include a nice summary of the
> results in the changelog.

I was actually spending this afternoon testing these patches with Oracle
(I haven't seen any issues so far) and unless Joonsoo already did so, I
want to run these by the libhugetlb test cases - I got side tracked by
futexes though :/

Please do consider that performance wise I haven't seen much in
particular. The thing is, I started dealing with this mutex once I
noticed it as the #1 hot lock in Oracle DB starts, but then once the
faults are done, it really goes away. So I wouldn't say that the mutex
is a bottleneck except for the first few minutes.

> 
> This is terribly important, because if the performance benefit is
> infinitesimally small or negative, the patch goes into the bit bucket ;)

Well, this mutex is infinitesimally ugly and needs to die (as long as
performance isn't hurt).

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  2:31       ` Davidlohr Bueso
@ 2013-12-20  4:47         ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  4:47 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton

Hello, Davidlohr.

On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> > Please, run some quantitative tests and include a nice summary of the
> > results in the changelog.
> 
> I was actually spending this afternoon testing these patches with Oracle
> (I haven't seen any issues so far) and unless Joonsoo already did so, I
> want to run these by the libhugetlb test cases - I got side tracked by
> futexes though :/

Really thanks for your time to test these patches.
I already did libhugetlbfs test cases and passed it.

> 
> Please do consider that performance wise I haven't seen much in
> particular. The thing is, I started dealing with this mutex once I
> noticed it as the #1 hot lock in Oracle DB starts, but then once the
> faults are done, it really goes away. So I wouldn't say that the mutex
> is a bottleneck except for the first few minutes.

What I want to be sure is for the first few minutes you mentioned.
If possible, let me know the result like as following link.
https://lkml.org/lkml/2013/7/12/428

Thanks in advance. :)

> > 
> > This is terribly important, because if the performance benefit is
> > infinitesimally small or negative, the patch goes into the bit bucket ;)
> 
> Well, this mutex is infinitesimally ugly and needs to die (as long as
> performance isn't hurt).

Yes, I agreed.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  4:47         ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  4:47 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton

Hello, Davidlohr.

On Thu, Dec 19, 2013 at 06:31:21PM -0800, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 17:02 -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> > Please, run some quantitative tests and include a nice summary of the
> > results in the changelog.
> 
> I was actually spending this afternoon testing these patches with Oracle
> (I haven't seen any issues so far) and unless Joonsoo already did so, I
> want to run these by the libhugetlb test cases - I got side tracked by
> futexes though :/

Really thanks for your time to test these patches.
I already did libhugetlbfs test cases and passed it.

> 
> Please do consider that performance wise I haven't seen much in
> particular. The thing is, I started dealing with this mutex once I
> noticed it as the #1 hot lock in Oracle DB starts, but then once the
> faults are done, it really goes away. So I wouldn't say that the mutex
> is a bottleneck except for the first few minutes.

What I want to be sure is for the first few minutes you mentioned.
If possible, let me know the result like as following link.
https://lkml.org/lkml/2013/7/12/428

Thanks in advance. :)

> > 
> > This is terribly important, because if the performance benefit is
> > infinitesimally small or negative, the patch goes into the bit bucket ;)
> 
> Well, this mutex is infinitesimally ugly and needs to die (as long as
> performance isn't hurt).

Yes, I agreed.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  2:15         ` Andrew Morton
@ 2013-12-20  5:00           ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  5:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote:
> On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > 
> > > So if I'm understanding this correctly...  if N threads all generate a
> > > fault against the same address, they will all dive in and allocate a
> > > hugepage, will then do an enormous memcpy into that page and will then
> > > attempt to instantiate the page in pagetables.  All threads except one
> > > will lose the race and will free the page again!  This sounds terribly
> > > inefficient; it would be useful to write a microbenchmark which
> > > triggers this scenario so we can explore the impact.
> > 
> > Yes, you understand correctly, I think.
> > 
> > I have an idea to prevent this overhead. It is that marking page when it
> > is zeroed and unmarking when it is mapped to page table. If page mapping
> > is failed due to current thread, the zeroed page will keep the marker and
> > later we can determine if it is zeroed or not.
> 
> Well OK, but the other threads will need to test that in-progress flag
> and then do <something>.  Where <something> will involve some form of
> open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
> can avoid using an internal flag and use an external flag instead. 
> There's one in struct mutex!

My idea consider only hugetlb_no_page() and doesn't need a sleep.
It just set <some> page flag after zeroing and if some thread takes
the page with this flag when faulting, simply use it without zeroing.

> 
> I doubt if the additional complexity of the external flag is worth it,
> but convincing performance testing results would sway me ;) Please have
> a think about it all.
> 
> > If you want to include this functionality in this series, I can do it ;)
> > Please let me know your decision.
> > 
> > > I'm wondering if a better solution to all of this would be to make
> > > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > > with a hash of the faulting address.  That will 99.9% solve the
> > > performance issue which you believe exists without introducing this new
> > > performance issue?
> > 
> > Yes, that approach would solve the performance issue.
> > IIRC, you already suggested this idea roughly 6 months ago and it is
> > implemented by Davidlohr. I remembered that there is a race issue on
> > COW case with this approach. See following link for more information.
> > https://lkml.org/lkml/2013/8/7/142
> 
> That seems to be unrelated to hugetlb_instantiation_mutex?

Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned
about race condition of table mutex patches which is for replacing
hugetlb_instantiation_mutex, although conversation isn't easy to follow-up.

> 
> > And we need 1-3 patches to prevent other theorectical race issue
> > regardless any approaches.
> 
> Yes, I'll be going through patches 1-12 very soon, thanks.

Okay. Thanks :)

> 
> 
> And to reiterate: I'm very uncomfortable mucking around with
> performance patches when we have run no tests to measure their
> magnitude, or even whether they are beneficial at all!

Okay. I will keep in mind it. :)

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20  5:00           ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-20  5:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 06:15:20PM -0800, Andrew Morton wrote:
> On Fri, 20 Dec 2013 10:58:10 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > 
> > > So if I'm understanding this correctly...  if N threads all generate a
> > > fault against the same address, they will all dive in and allocate a
> > > hugepage, will then do an enormous memcpy into that page and will then
> > > attempt to instantiate the page in pagetables.  All threads except one
> > > will lose the race and will free the page again!  This sounds terribly
> > > inefficient; it would be useful to write a microbenchmark which
> > > triggers this scenario so we can explore the impact.
> > 
> > Yes, you understand correctly, I think.
> > 
> > I have an idea to prevent this overhead. It is that marking page when it
> > is zeroed and unmarking when it is mapped to page table. If page mapping
> > is failed due to current thread, the zeroed page will keep the marker and
> > later we can determine if it is zeroed or not.
> 
> Well OK, but the other threads will need to test that in-progress flag
> and then do <something>.  Where <something> will involve some form of
> open-coded sleep/wakeup thing.  To avoid all that wheel-reinventing we
> can avoid using an internal flag and use an external flag instead. 
> There's one in struct mutex!

My idea consider only hugetlb_no_page() and doesn't need a sleep.
It just set <some> page flag after zeroing and if some thread takes
the page with this flag when faulting, simply use it without zeroing.

> 
> I doubt if the additional complexity of the external flag is worth it,
> but convincing performance testing results would sway me ;) Please have
> a think about it all.
> 
> > If you want to include this functionality in this series, I can do it ;)
> > Please let me know your decision.
> > 
> > > I'm wondering if a better solution to all of this would be to make
> > > hugetlb_instantiation_mutex an array of, say, 1024 mutexes and index it
> > > with a hash of the faulting address.  That will 99.9% solve the
> > > performance issue which you believe exists without introducing this new
> > > performance issue?
> > 
> > Yes, that approach would solve the performance issue.
> > IIRC, you already suggested this idea roughly 6 months ago and it is
> > implemented by Davidlohr. I remembered that there is a race issue on
> > COW case with this approach. See following link for more information.
> > https://lkml.org/lkml/2013/8/7/142
> 
> That seems to be unrelated to hugetlb_instantiation_mutex?

Yes, it is related to hugetlb_instantiation_mutex. In the link, I mentioned
about race condition of table mutex patches which is for replacing
hugetlb_instantiation_mutex, although conversation isn't easy to follow-up.

> 
> > And we need 1-3 patches to prevent other theorectical race issue
> > regardless any approaches.
> 
> Yes, I'll be going through patches 1-12 very soon, thanks.

Okay. Thanks :)

> 
> 
> And to reiterate: I'm very uncomfortable mucking around with
> performance patches when we have run no tests to measure their
> magnitude, or even whether they are beneficial at all!

Okay. I will keep in mind it. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20  1:02     ` Andrew Morton
@ 2013-12-20 14:01       ` Mel Gorman
  -1 siblings, 0 replies; 90+ messages in thread
From: Mel Gorman @ 2013-12-20 14:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 

I don't really deal with hugetlbfs any more and I have not examined this
series but I remember why I never really cared about this mutex. It wrecks
fault scalability but AFAIK fault scalability almost never mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs by far is
sysv shared memory. The memory is faulted early in the lifetime of the
workload and after that it does not matter. At worst, it hurts application
startup time but that is still poor motivation for putting a lot of work
into removing the mutex.

Microbenchmarks will be able to trigger problems in this area but it'd
be important to check if any workload that matters is actually hitting
that problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-20 14:01       ` Mel Gorman
  0 siblings, 0 replies; 90+ messages in thread
From: Mel Gorman @ 2013-12-20 14:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Rik van Riel, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who can get hugepages to get a SIGBUS signal.
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation.
> 
> So the whole point of the patch is to improve performance, but the
> changelog doesn't include any performance measurements!
> 

I don't really deal with hugetlbfs any more and I have not examined this
series but I remember why I never really cared about this mutex. It wrecks
fault scalability but AFAIK fault scalability almost never mattered for
workloads using hugetlbfs.  The most common user of hugetlbfs by far is
sysv shared memory. The memory is faulted early in the lifetime of the
workload and after that it does not matter. At worst, it hurts application
startup time but that is still poor motivation for putting a lot of work
into removing the mutex.

Microbenchmarks will be able to trigger problems in this area but it'd
be important to check if any workload that matters is actually hitting
that problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-20 14:01       ` Mel Gorman
@ 2013-12-21  6:48         ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2013-12-21  6:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Joonsoo Kim, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> 
> I don't really deal with hugetlbfs any more and I have not examined this
> series but I remember why I never really cared about this mutex. It wrecks
> fault scalability but AFAIK fault scalability almost never mattered for
> workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> sysv shared memory. The memory is faulted early in the lifetime of the
> workload and after that it does not matter. At worst, it hurts application
> startup time but that is still poor motivation for putting a lot of work
> into removing the mutex.

Yep, important hugepage workloads initially pound heavily on this lock,
then it naturally decreases.

> Microbenchmarks will be able to trigger problems in this area but it'd
> be important to check if any workload that matters is actually hitting
> that problem.

I was thinking of writing one to actually get some numbers for this
patchset -- I don't know of any benchmark that might stress this lock. 

However I first measured the amount of cycles it costs to start an
Oracle DB and things went south with these changes. A simple 'startup
immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
billion. While there is naturally a fair amount of variation, these
changes do seem to do more harm than good, at least in real world
scenarios.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-21  6:48         ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2013-12-21  6:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Joonsoo Kim, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > 
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > 
> > > To solve this problem, we already have a nice solution, that is,
> > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > a fault handler. This solve the problem clearly, but it introduce
> > > performance degradation, because it serialize all fault handling.
> > > 
> > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > performance degradation.
> > 
> > So the whole point of the patch is to improve performance, but the
> > changelog doesn't include any performance measurements!
> > 
> 
> I don't really deal with hugetlbfs any more and I have not examined this
> series but I remember why I never really cared about this mutex. It wrecks
> fault scalability but AFAIK fault scalability almost never mattered for
> workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> sysv shared memory. The memory is faulted early in the lifetime of the
> workload and after that it does not matter. At worst, it hurts application
> startup time but that is still poor motivation for putting a lot of work
> into removing the mutex.

Yep, important hugepage workloads initially pound heavily on this lock,
then it naturally decreases.

> Microbenchmarks will be able to trigger problems in this area but it'd
> be important to check if any workload that matters is actually hitting
> that problem.

I was thinking of writing one to actually get some numbers for this
patchset -- I don't know of any benchmark that might stress this lock. 

However I first measured the amount of cycles it costs to start an
Oracle DB and things went south with these changes. A simple 'startup
immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
billion. While there is naturally a fair amount of variation, these
changes do seem to do more harm than good, at least in real world
scenarios.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 01/14] mm, hugetlb: unify region structure handling
  2013-12-18  6:53   ` Joonsoo Kim
  (?)
@ 2013-12-21  9:04   ` David Gibson
  -1 siblings, 0 replies; 90+ messages in thread
From: David Gibson @ 2013-12-21  9:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 862 bytes --]

On Wed, Dec 18, 2013 at 03:53:47PM +0900, Joonsoo Kim wrote:
> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head
  2013-12-18  6:53   ` Joonsoo Kim
  (?)
@ 2013-12-21 13:43   ` David Gibson
  -1 siblings, 0 replies; 90+ messages in thread
From: David Gibson @ 2013-12-21 13:43 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

On Wed, Dec 18, 2013 at 03:53:48PM +0900, Joonsoo Kim wrote:
> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-18  6:53   ` Joonsoo Kim
  (?)
@ 2013-12-21 13:58   ` David Gibson
  2013-12-23  1:05       ` Joonsoo Kim
  -1 siblings, 1 reply; 90+ messages in thread
From: David Gibson @ 2013-12-21 13:58 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
> 
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.

It's not clear to me if you're saying there is a list corruption race
bug in the existing code, or only that there will be if the
instantiation mutex goes away.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-21  6:48         ` Davidlohr Bueso
@ 2013-12-23  0:44           ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  0:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > > To solve this problem, we already have a nice solution, that is,
> > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > a fault handler. This solve the problem clearly, but it introduce
> > > > performance degradation, because it serialize all fault handling.
> > > > 
> > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > performance degradation.
> > > 
> > > So the whole point of the patch is to improve performance, but the
> > > changelog doesn't include any performance measurements!
> > > 
> > 
> > I don't really deal with hugetlbfs any more and I have not examined this
> > series but I remember why I never really cared about this mutex. It wrecks
> > fault scalability but AFAIK fault scalability almost never mattered for
> > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > sysv shared memory. The memory is faulted early in the lifetime of the
> > workload and after that it does not matter. At worst, it hurts application
> > startup time but that is still poor motivation for putting a lot of work
> > into removing the mutex.
> 
> Yep, important hugepage workloads initially pound heavily on this lock,
> then it naturally decreases.
> 
> > Microbenchmarks will be able to trigger problems in this area but it'd
> > be important to check if any workload that matters is actually hitting
> > that problem.
> 
> I was thinking of writing one to actually get some numbers for this
> patchset -- I don't know of any benchmark that might stress this lock. 
> 
> However I first measured the amount of cycles it costs to start an
> Oracle DB and things went south with these changes. A simple 'startup
> immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> billion. While there is naturally a fair amount of variation, these
> changes do seem to do more harm than good, at least in real world
> scenarios.

Hello,

I think that number of cycles is not proper to measure this patchset,
because cycles would be wasted by fault handling failure. Instead, it
targeted improved elapsed time. Could you tell me how long it
takes to fault all of it's hugepages?

Anyway, this order of magnitude still seems a problem. :/

I guess that cycles are wasted by zeroing hugepage in fault-path like as
Andrew pointed out.

I will send another patches to fix this problem.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-23  0:44           ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  0:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > 
> > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > This makes reserved pool shortage just for a little while and this cause
> > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > 
> > > > To solve this problem, we already have a nice solution, that is,
> > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > a fault handler. This solve the problem clearly, but it introduce
> > > > performance degradation, because it serialize all fault handling.
> > > > 
> > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > performance degradation.
> > > 
> > > So the whole point of the patch is to improve performance, but the
> > > changelog doesn't include any performance measurements!
> > > 
> > 
> > I don't really deal with hugetlbfs any more and I have not examined this
> > series but I remember why I never really cared about this mutex. It wrecks
> > fault scalability but AFAIK fault scalability almost never mattered for
> > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > sysv shared memory. The memory is faulted early in the lifetime of the
> > workload and after that it does not matter. At worst, it hurts application
> > startup time but that is still poor motivation for putting a lot of work
> > into removing the mutex.
> 
> Yep, important hugepage workloads initially pound heavily on this lock,
> then it naturally decreases.
> 
> > Microbenchmarks will be able to trigger problems in this area but it'd
> > be important to check if any workload that matters is actually hitting
> > that problem.
> 
> I was thinking of writing one to actually get some numbers for this
> patchset -- I don't know of any benchmark that might stress this lock. 
> 
> However I first measured the amount of cycles it costs to start an
> Oracle DB and things went south with these changes. A simple 'startup
> immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> billion. While there is naturally a fair amount of variation, these
> changes do seem to do more harm than good, at least in real world
> scenarios.

Hello,

I think that number of cycles is not proper to measure this patchset,
because cycles would be wasted by fault handling failure. Instead, it
targeted improved elapsed time. Could you tell me how long it
takes to fault all of it's hugepages?

Anyway, this order of magnitude still seems a problem. :/

I guess that cycles are wasted by zeroing hugepage in fault-path like as
Andrew pointed out.

I will send another patches to fix this problem.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-21 13:58   ` David Gibson
@ 2013-12-23  1:05       ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  1:05 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Sun, Dec 22, 2013 at 12:58:19AM +1100, David Gibson wrote:
> On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> > 
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
> 
> It's not clear to me if you're saying there is a list corruption race
> bug in the existing code, or only that there will be if the
> instantiation mutex goes away.

Hello,

The race exists in current code.
Currently, region tracking is protected by either down_write(&mm->mmap_sem) or
down_read(&mm->mmap_sem) + instantiation mutex. But if we map this hugetlbfs
file to two different processes, holding a mmap_sem doesn't have any impact on
the other process and concurrent access to data structure is possible.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2013-12-23  1:05       ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  1:05 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Sun, Dec 22, 2013 at 12:58:19AM +1100, David Gibson wrote:
> On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> > 
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
> 
> It's not clear to me if you're saying there is a list corruption race
> bug in the existing code, or only that there will be if the
> instantiation mutex goes away.

Hello,

The race exists in current code.
Currently, region tracking is protected by either down_write(&mm->mmap_sem) or
down_read(&mm->mmap_sem) + instantiation mutex. But if we map this hugetlbfs
file to two different processes, holding a mmap_sem doesn't have any impact on
the other process and concurrent access to data structure is possible.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-23  0:44           ` Joonsoo Kim
@ 2013-12-23  2:11             ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  2:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > 
> > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > 
> > > > > To solve this problem, we already have a nice solution, that is,
> > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > performance degradation, because it serialize all fault handling.
> > > > > 
> > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > performance degradation.
> > > > 
> > > > So the whole point of the patch is to improve performance, but the
> > > > changelog doesn't include any performance measurements!
> > > > 
> > > 
> > > I don't really deal with hugetlbfs any more and I have not examined this
> > > series but I remember why I never really cared about this mutex. It wrecks
> > > fault scalability but AFAIK fault scalability almost never mattered for
> > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > workload and after that it does not matter. At worst, it hurts application
> > > startup time but that is still poor motivation for putting a lot of work
> > > into removing the mutex.
> > 
> > Yep, important hugepage workloads initially pound heavily on this lock,
> > then it naturally decreases.
> > 
> > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > be important to check if any workload that matters is actually hitting
> > > that problem.
> > 
> > I was thinking of writing one to actually get some numbers for this
> > patchset -- I don't know of any benchmark that might stress this lock. 
> > 
> > However I first measured the amount of cycles it costs to start an
> > Oracle DB and things went south with these changes. A simple 'startup
> > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > billion. While there is naturally a fair amount of variation, these
> > changes do seem to do more harm than good, at least in real world
> > scenarios.
> 
> Hello,
> 
> I think that number of cycles is not proper to measure this patchset,
> because cycles would be wasted by fault handling failure. Instead, it
> targeted improved elapsed time. Could you tell me how long it
> takes to fault all of it's hugepages?
> 
> Anyway, this order of magnitude still seems a problem. :/
> 
> I guess that cycles are wasted by zeroing hugepage in fault-path like as
> Andrew pointed out.
> 
> I will send another patches to fix this problem.

Hello, Davidlohr.

Here goes the fix on top of this series.
Thanks.

-------------->8---------------------------
>From 5f20459d90dfa2f7cd28d62194ce22bd9a0df0f5 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Mon, 23 Dec 2013 10:32:04 +0900
Subject: [PATCH] mm, hugetlb: optimize zeroing hugepage

When parallel faults occur, someone would be failed. In this case,
cpu cycles for zeroing failed hugepage is wasted. To reduce this overhead,
mark the hugepage as zeroed hugepage after zeroing hugepage and unmark
it as non-zeroed hugepage after it is really used. If it isn't used with
any reason, it returns back to the hugepage pool and it will be used
sometime ago. At this time, we would see zeroed page marker and skip to
do zeroing.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6edf423..b90b792 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -582,6 +582,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_private | 1 << PG_writeback);
 	}
 	VM_BUG_ON(hugetlb_cgroup_from_page(page));
+	ClearPageActive(page);
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
 	arch_release_hugepage(page);
@@ -2715,6 +2716,7 @@ retry_avoidcopy:
 	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
+		ClearPageActive(new_page);
 		ClearPagePrivate(new_page);
 
 		/* Break COW */
@@ -2834,7 +2836,10 @@ retry:
 			}
 			goto out;
 		}
-		clear_huge_page(page, address, pages_per_huge_page(h));
+		if (!PageActive(page)) {
+			clear_huge_page(page, address, pages_per_huge_page(h));
+			SetPageActive(page);
+		}
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
@@ -2850,6 +2855,7 @@ retry:
 					goto retry;
 				goto out;
 			}
+			ClearPageActive(page);
 			ClearPagePrivate(page);
 			if (do_dequeue)
 				commit_dequeued_huge_page(vma);
@@ -2901,6 +2907,7 @@ retry:
 		goto backout;
 
 	if (anon_rmap) {
+		ClearPageActive(page);
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2013-12-23  2:11             ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2013-12-23  2:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > 
> > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > 
> > > > > To solve this problem, we already have a nice solution, that is,
> > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > performance degradation, because it serialize all fault handling.
> > > > > 
> > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > performance degradation.
> > > > 
> > > > So the whole point of the patch is to improve performance, but the
> > > > changelog doesn't include any performance measurements!
> > > > 
> > > 
> > > I don't really deal with hugetlbfs any more and I have not examined this
> > > series but I remember why I never really cared about this mutex. It wrecks
> > > fault scalability but AFAIK fault scalability almost never mattered for
> > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > workload and after that it does not matter. At worst, it hurts application
> > > startup time but that is still poor motivation for putting a lot of work
> > > into removing the mutex.
> > 
> > Yep, important hugepage workloads initially pound heavily on this lock,
> > then it naturally decreases.
> > 
> > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > be important to check if any workload that matters is actually hitting
> > > that problem.
> > 
> > I was thinking of writing one to actually get some numbers for this
> > patchset -- I don't know of any benchmark that might stress this lock. 
> > 
> > However I first measured the amount of cycles it costs to start an
> > Oracle DB and things went south with these changes. A simple 'startup
> > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > billion. While there is naturally a fair amount of variation, these
> > changes do seem to do more harm than good, at least in real world
> > scenarios.
> 
> Hello,
> 
> I think that number of cycles is not proper to measure this patchset,
> because cycles would be wasted by fault handling failure. Instead, it
> targeted improved elapsed time. Could you tell me how long it
> takes to fault all of it's hugepages?
> 
> Anyway, this order of magnitude still seems a problem. :/
> 
> I guess that cycles are wasted by zeroing hugepage in fault-path like as
> Andrew pointed out.
> 
> I will send another patches to fix this problem.

Hello, Davidlohr.

Here goes the fix on top of this series.
Thanks.

-------------->8---------------------------

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-23  1:05       ` Joonsoo Kim
  (?)
@ 2013-12-24 12:00       ` David Gibson
  2014-01-06  0:12           ` Joonsoo Kim
  -1 siblings, 1 reply; 90+ messages in thread
From: David Gibson @ 2013-12-24 12:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 1681 bytes --]

On Mon, Dec 23, 2013 at 10:05:17AM +0900, Joonsoo Kim wrote:
> On Sun, Dec 22, 2013 at 12:58:19AM +1100, David Gibson wrote:
> > On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> > > There is a race condition if we map a same file on different processes.
> > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > > grab a mmap_sem. This doesn't prevent other process to modify region
> > > structure, so it can be modified by two processes concurrently.
> > > 
> > > To solve this, I introduce a lock to resv_map and make region manipulation
> > > function grab a lock before they do actual work. This makes region
> > > tracking safe.
> > 
> > It's not clear to me if you're saying there is a list corruption race
> > bug in the existing code, or only that there will be if the
> > instantiation mutex goes away.
> 
> Hello,
> 
> The race exists in current code.
> Currently, region tracking is protected by either down_write(&mm->mmap_sem) or
> down_read(&mm->mmap_sem) + instantiation mutex. But if we map this hugetlbfs
> file to two different processes, holding a mmap_sem doesn't have any impact on
> the other process and concurrent access to data structure is possible.

Ouch.  In that case:

Acked-by: David Gibson <david@gibson.dropbear.id.au>

It would be really nice to add a testcase for this race to the
libhugetlbfs testsuite.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2013-12-23  2:11             ` Joonsoo Kim
@ 2014-01-03 19:55               ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-03 19:55 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > 
> > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > 
> > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > performance degradation, because it serialize all fault handling.
> > > > > > 
> > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > performance degradation.
> > > > > 
> > > > > So the whole point of the patch is to improve performance, but the
> > > > > changelog doesn't include any performance measurements!
> > > > > 
> > > > 
> > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > workload and after that it does not matter. At worst, it hurts application
> > > > startup time but that is still poor motivation for putting a lot of work
> > > > into removing the mutex.
> > > 
> > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > then it naturally decreases.
> > > 
> > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > be important to check if any workload that matters is actually hitting
> > > > that problem.
> > > 
> > > I was thinking of writing one to actually get some numbers for this
> > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > 
> > > However I first measured the amount of cycles it costs to start an
> > > Oracle DB and things went south with these changes. A simple 'startup
> > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > billion. While there is naturally a fair amount of variation, these
> > > changes do seem to do more harm than good, at least in real world
> > > scenarios.
> > 
> > Hello,
> > 
> > I think that number of cycles is not proper to measure this patchset,
> > because cycles would be wasted by fault handling failure. Instead, it
> > targeted improved elapsed time. 

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

> Could you tell me how long it
> > takes to fault all of it's hugepages?
> > 
> > Anyway, this order of magnitude still seems a problem. :/
> > 
> > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > Andrew pointed out.
> > 
> > I will send another patches to fix this problem.
> 
> Hello, Davidlohr.
> 
> Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to ~25.1 seconds, which is a nice -24.7% time
reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
This hash table was on a 80 core system, so since we do the power of two
round up we end up with 256 entries -- I think we can do better if we
enlarger further, maybe something like statically 1024, or probably
better, 8-ish * nr cpus.

Thoughts? Is there any reason why we cannot go with this instead? Yes,
we still keep the mutex, but the approach is (1) proven better for
performance on real world workloads and (2) far less invasive. 

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-03 19:55               ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-03 19:55 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > 
> > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > 
> > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > performance degradation, because it serialize all fault handling.
> > > > > > 
> > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > performance degradation.
> > > > > 
> > > > > So the whole point of the patch is to improve performance, but the
> > > > > changelog doesn't include any performance measurements!
> > > > > 
> > > > 
> > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > workload and after that it does not matter. At worst, it hurts application
> > > > startup time but that is still poor motivation for putting a lot of work
> > > > into removing the mutex.
> > > 
> > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > then it naturally decreases.
> > > 
> > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > be important to check if any workload that matters is actually hitting
> > > > that problem.
> > > 
> > > I was thinking of writing one to actually get some numbers for this
> > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > 
> > > However I first measured the amount of cycles it costs to start an
> > > Oracle DB and things went south with these changes. A simple 'startup
> > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > billion. While there is naturally a fair amount of variation, these
> > > changes do seem to do more harm than good, at least in real world
> > > scenarios.
> > 
> > Hello,
> > 
> > I think that number of cycles is not proper to measure this patchset,
> > because cycles would be wasted by fault handling failure. Instead, it
> > targeted improved elapsed time. 

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

> Could you tell me how long it
> > takes to fault all of it's hugepages?
> > 
> > Anyway, this order of magnitude still seems a problem. :/
> > 
> > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > Andrew pointed out.
> > 
> > I will send another patches to fix this problem.
> 
> Hello, Davidlohr.
> 
> Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to ~25.1 seconds, which is a nice -24.7% time
reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
This hash table was on a 80 core system, so since we do the power of two
round up we end up with 256 entries -- I think we can do better if we
enlarger further, maybe something like statically 1024, or probably
better, 8-ish * nr cpus.

Thoughts? Is there any reason why we cannot go with this instead? Yes,
we still keep the mutex, but the approach is (1) proven better for
performance on real world workloads and (2) far less invasive. 

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-24 12:00       ` David Gibson
@ 2014-01-06  0:12           ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-06  0:12 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Tue, Dec 24, 2013 at 11:00:12PM +1100, David Gibson wrote:
> On Mon, Dec 23, 2013 at 10:05:17AM +0900, Joonsoo Kim wrote:
> > On Sun, Dec 22, 2013 at 12:58:19AM +1100, David Gibson wrote:
> > > On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> > > > There is a race condition if we map a same file on different processes.
> > > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > > > grab a mmap_sem. This doesn't prevent other process to modify region
> > > > structure, so it can be modified by two processes concurrently.
> > > > 
> > > > To solve this, I introduce a lock to resv_map and make region manipulation
> > > > function grab a lock before they do actual work. This makes region
> > > > tracking safe.
> > > 
> > > It's not clear to me if you're saying there is a list corruption race
> > > bug in the existing code, or only that there will be if the
> > > instantiation mutex goes away.
> > 
> > Hello,
> > 
> > The race exists in current code.
> > Currently, region tracking is protected by either down_write(&mm->mmap_sem) or
> > down_read(&mm->mmap_sem) + instantiation mutex. But if we map this hugetlbfs
> > file to two different processes, holding a mmap_sem doesn't have any impact on
> > the other process and concurrent access to data structure is possible.
> 
> Ouch.  In that case:
> 
> Acked-by: David Gibson <david@gibson.dropbear.id.au>
> 
> It would be really nice to add a testcase for this race to the
> libhugetlbfs testsuite.

Okay!
I will add it.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2014-01-06  0:12           ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-06  0:12 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, linux-mm, linux-kernel, Wanpeng Li,
	Naoya Horiguchi, Hillf Danton

On Tue, Dec 24, 2013 at 11:00:12PM +1100, David Gibson wrote:
> On Mon, Dec 23, 2013 at 10:05:17AM +0900, Joonsoo Kim wrote:
> > On Sun, Dec 22, 2013 at 12:58:19AM +1100, David Gibson wrote:
> > > On Wed, Dec 18, 2013 at 03:53:49PM +0900, Joonsoo Kim wrote:
> > > > There is a race condition if we map a same file on different processes.
> > > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > > > grab a mmap_sem. This doesn't prevent other process to modify region
> > > > structure, so it can be modified by two processes concurrently.
> > > > 
> > > > To solve this, I introduce a lock to resv_map and make region manipulation
> > > > function grab a lock before they do actual work. This makes region
> > > > tracking safe.
> > > 
> > > It's not clear to me if you're saying there is a list corruption race
> > > bug in the existing code, or only that there will be if the
> > > instantiation mutex goes away.
> > 
> > Hello,
> > 
> > The race exists in current code.
> > Currently, region tracking is protected by either down_write(&mm->mmap_sem) or
> > down_read(&mm->mmap_sem) + instantiation mutex. But if we map this hugetlbfs
> > file to two different processes, holding a mmap_sem doesn't have any impact on
> > the other process and concurrent access to data structure is possible.
> 
> Ouch.  In that case:
> 
> Acked-by: David Gibson <david@gibson.dropbear.id.au>
> 
> It would be really nice to add a testcase for this race to the
> libhugetlbfs testsuite.

Okay!
I will add it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-03 19:55               ` Davidlohr Bueso
@ 2014-01-06  0:19                 ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-06  0:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> Hi Joonsoo,
> 
> Sorry about the delay...
> 
> On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > 
> > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > 
> > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > 
> > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > performance degradation.
> > > > > > 
> > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > changelog doesn't include any performance measurements!
> > > > > > 
> > > > > 
> > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > into removing the mutex.
> > > > 
> > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > then it naturally decreases.
> > > > 
> > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > be important to check if any workload that matters is actually hitting
> > > > > that problem.
> > > > 
> > > > I was thinking of writing one to actually get some numbers for this
> > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > 
> > > > However I first measured the amount of cycles it costs to start an
> > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > billion. While there is naturally a fair amount of variation, these
> > > > changes do seem to do more harm than good, at least in real world
> > > > scenarios.
> > > 
> > > Hello,
> > > 
> > > I think that number of cycles is not proper to measure this patchset,
> > > because cycles would be wasted by fault handling failure. Instead, it
> > > targeted improved elapsed time. 
> 
> Fair enough, however the fact of the matter is this approach does en up
> hurting performance. Regarding total startup time, I didn't see hardly
> any differences, with both vanilla and this patchset it takes close to
> 33.5 seconds.
> 
> > Could you tell me how long it
> > > takes to fault all of it's hugepages?
> > > 
> > > Anyway, this order of magnitude still seems a problem. :/
> > > 
> > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > Andrew pointed out.
> > > 
> > > I will send another patches to fix this problem.
> > 
> > Hello, Davidlohr.
> > 
> > Here goes the fix on top of this series.
> 
> ... and with this patch we go from 27 down to 11 billion cycles, so this
> approach still costs more than what we currently have. A perf stat shows
> that an entire 1Gb huge page aware DB startup costs around ~30 billion
> cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> definitely non trivial and IMO worth considering.

Really thanks for your help. :)

> 
> Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> ride and things do look quite better, which is basically what Andrew was
> suggesting previously anyway. With the hash table approach the startup
> time did go down to ~25.1 seconds, which is a nice -24.7% time
> reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> This hash table was on a 80 core system, so since we do the power of two
> round up we end up with 256 entries -- I think we can do better if we
> enlarger further, maybe something like statically 1024, or probably
> better, 8-ish * nr cpus.
> 
> Thoughts? Is there any reason why we cannot go with this instead? Yes,
> we still keep the mutex, but the approach is (1) proven better for
> performance on real world workloads and (2) far less invasive. 

I have no more idea to improve my patches now, so I agree with your approach.
When I reviewed your approach last time, I found one race condition. In that
time, I didn't think of a solution about it. If you resend it, I will review
and re-think about it.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-06  0:19                 ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-06  0:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> Hi Joonsoo,
> 
> Sorry about the delay...
> 
> On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > 
> > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > 
> > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > 
> > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > performance degradation.
> > > > > > 
> > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > changelog doesn't include any performance measurements!
> > > > > > 
> > > > > 
> > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > into removing the mutex.
> > > > 
> > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > then it naturally decreases.
> > > > 
> > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > be important to check if any workload that matters is actually hitting
> > > > > that problem.
> > > > 
> > > > I was thinking of writing one to actually get some numbers for this
> > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > 
> > > > However I first measured the amount of cycles it costs to start an
> > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > billion. While there is naturally a fair amount of variation, these
> > > > changes do seem to do more harm than good, at least in real world
> > > > scenarios.
> > > 
> > > Hello,
> > > 
> > > I think that number of cycles is not proper to measure this patchset,
> > > because cycles would be wasted by fault handling failure. Instead, it
> > > targeted improved elapsed time. 
> 
> Fair enough, however the fact of the matter is this approach does en up
> hurting performance. Regarding total startup time, I didn't see hardly
> any differences, with both vanilla and this patchset it takes close to
> 33.5 seconds.
> 
> > Could you tell me how long it
> > > takes to fault all of it's hugepages?
> > > 
> > > Anyway, this order of magnitude still seems a problem. :/
> > > 
> > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > Andrew pointed out.
> > > 
> > > I will send another patches to fix this problem.
> > 
> > Hello, Davidlohr.
> > 
> > Here goes the fix on top of this series.
> 
> ... and with this patch we go from 27 down to 11 billion cycles, so this
> approach still costs more than what we currently have. A perf stat shows
> that an entire 1Gb huge page aware DB startup costs around ~30 billion
> cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> definitely non trivial and IMO worth considering.

Really thanks for your help. :)

> 
> Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> ride and things do look quite better, which is basically what Andrew was
> suggesting previously anyway. With the hash table approach the startup
> time did go down to ~25.1 seconds, which is a nice -24.7% time
> reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> This hash table was on a 80 core system, so since we do the power of two
> round up we end up with 256 entries -- I think we can do better if we
> enlarger further, maybe something like statically 1024, or probably
> better, 8-ish * nr cpus.
> 
> Thoughts? Is there any reason why we cannot go with this instead? Yes,
> we still keep the mutex, but the approach is (1) proven better for
> performance on real world workloads and (2) far less invasive. 

I have no more idea to improve my patches now, so I agree with your approach.
When I reviewed your approach last time, I found one race condition. In that
time, I didn't think of a solution about it. If you resend it, I will review
and re-think about it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-06  0:19                 ` Joonsoo Kim
@ 2014-01-06 12:19                   ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-06 12:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > Hi Joonsoo,
> > 
> > Sorry about the delay...
> > 
> > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > 
> > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > 
> > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > 
> > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > performance degradation.
> > > > > > > 
> > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > changelog doesn't include any performance measurements!
> > > > > > > 
> > > > > > 
> > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > into removing the mutex.
> > > > > 
> > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > then it naturally decreases.
> > > > > 
> > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > be important to check if any workload that matters is actually hitting
> > > > > > that problem.
> > > > > 
> > > > > I was thinking of writing one to actually get some numbers for this
> > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > 
> > > > > However I first measured the amount of cycles it costs to start an
> > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > billion. While there is naturally a fair amount of variation, these
> > > > > changes do seem to do more harm than good, at least in real world
> > > > > scenarios.
> > > > 
> > > > Hello,
> > > > 
> > > > I think that number of cycles is not proper to measure this patchset,
> > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > targeted improved elapsed time. 
> > 
> > Fair enough, however the fact of the matter is this approach does en up
> > hurting performance. Regarding total startup time, I didn't see hardly
> > any differences, with both vanilla and this patchset it takes close to
> > 33.5 seconds.
> > 
> > > Could you tell me how long it
> > > > takes to fault all of it's hugepages?
> > > > 
> > > > Anyway, this order of magnitude still seems a problem. :/
> > > > 
> > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > Andrew pointed out.
> > > > 
> > > > I will send another patches to fix this problem.
> > > 
> > > Hello, Davidlohr.
> > > 
> > > Here goes the fix on top of this series.
> > 
> > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > approach still costs more than what we currently have. A perf stat shows
> > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > definitely non trivial and IMO worth considering.
> 
> Really thanks for your help. :)
> 
> > 
> > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > ride and things do look quite better, which is basically what Andrew was
> > suggesting previously anyway. With the hash table approach the startup
> > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > This hash table was on a 80 core system, so since we do the power of two
> > round up we end up with 256 entries -- I think we can do better if we
> > enlarger further, maybe something like statically 1024, or probably
> > better, 8-ish * nr cpus.
> > 
> > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > we still keep the mutex, but the approach is (1) proven better for
> > performance on real world workloads and (2) far less invasive. 
> 
> I have no more idea to improve my patches now, so I agree with your approach.
> When I reviewed your approach last time, I found one race condition. In that
> time, I didn't think of a solution about it. If you resend it, I will review
> and re-think about it.

hmm so how do you want to play this? Your first 3 patches basically
deals (more elegantly) with my patch 1/2 which I was on my way of just
changing the lock -- we had agreed that serializing regions was with a
spinlock was better than with a sleeping one as the critical region was
small enough and we just had to deal with that trivial kmalloc case in
region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
instantiation mutex hash table change, sounds good?

Thanks,
Davidlohr 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-06 12:19                   ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-06 12:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > Hi Joonsoo,
> > 
> > Sorry about the delay...
> > 
> > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > 
> > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > 
> > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > 
> > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > performance degradation.
> > > > > > > 
> > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > changelog doesn't include any performance measurements!
> > > > > > > 
> > > > > > 
> > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > into removing the mutex.
> > > > > 
> > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > then it naturally decreases.
> > > > > 
> > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > be important to check if any workload that matters is actually hitting
> > > > > > that problem.
> > > > > 
> > > > > I was thinking of writing one to actually get some numbers for this
> > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > 
> > > > > However I first measured the amount of cycles it costs to start an
> > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > billion. While there is naturally a fair amount of variation, these
> > > > > changes do seem to do more harm than good, at least in real world
> > > > > scenarios.
> > > > 
> > > > Hello,
> > > > 
> > > > I think that number of cycles is not proper to measure this patchset,
> > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > targeted improved elapsed time. 
> > 
> > Fair enough, however the fact of the matter is this approach does en up
> > hurting performance. Regarding total startup time, I didn't see hardly
> > any differences, with both vanilla and this patchset it takes close to
> > 33.5 seconds.
> > 
> > > Could you tell me how long it
> > > > takes to fault all of it's hugepages?
> > > > 
> > > > Anyway, this order of magnitude still seems a problem. :/
> > > > 
> > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > Andrew pointed out.
> > > > 
> > > > I will send another patches to fix this problem.
> > > 
> > > Hello, Davidlohr.
> > > 
> > > Here goes the fix on top of this series.
> > 
> > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > approach still costs more than what we currently have. A perf stat shows
> > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > definitely non trivial and IMO worth considering.
> 
> Really thanks for your help. :)
> 
> > 
> > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > ride and things do look quite better, which is basically what Andrew was
> > suggesting previously anyway. With the hash table approach the startup
> > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > This hash table was on a 80 core system, so since we do the power of two
> > round up we end up with 256 entries -- I think we can do better if we
> > enlarger further, maybe something like statically 1024, or probably
> > better, 8-ish * nr cpus.
> > 
> > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > we still keep the mutex, but the approach is (1) proven better for
> > performance on real world workloads and (2) far less invasive. 
> 
> I have no more idea to improve my patches now, so I agree with your approach.
> When I reviewed your approach last time, I found one race condition. In that
> time, I didn't think of a solution about it. If you resend it, I will review
> and re-think about it.

hmm so how do you want to play this? Your first 3 patches basically
deals (more elegantly) with my patch 1/2 which I was on my way of just
changing the lock -- we had agreed that serializing regions was with a
spinlock was better than with a sleeping one as the critical region was
small enough and we just had to deal with that trivial kmalloc case in
region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
instantiation mutex hash table change, sounds good?

Thanks,
Davidlohr 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-06 12:19                   ` Davidlohr Bueso
@ 2014-01-07  1:57                     ` Joonsoo Kim
  -1 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-07  1:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > Hi Joonsoo,
> > > 
> > > Sorry about the delay...
> > > 
> > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > > 
> > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > 
> > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > > 
> > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > performance degradation.
> > > > > > > > 
> > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > into removing the mutex.
> > > > > > 
> > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > then it naturally decreases.
> > > > > > 
> > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > that problem.
> > > > > > 
> > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > > 
> > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > scenarios.
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > targeted improved elapsed time. 
> > > 
> > > Fair enough, however the fact of the matter is this approach does en up
> > > hurting performance. Regarding total startup time, I didn't see hardly
> > > any differences, with both vanilla and this patchset it takes close to
> > > 33.5 seconds.
> > > 
> > > > Could you tell me how long it
> > > > > takes to fault all of it's hugepages?
> > > > > 
> > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > 
> > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > Andrew pointed out.
> > > > > 
> > > > > I will send another patches to fix this problem.
> > > > 
> > > > Hello, Davidlohr.
> > > > 
> > > > Here goes the fix on top of this series.
> > > 
> > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > approach still costs more than what we currently have. A perf stat shows
> > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > definitely non trivial and IMO worth considering.
> > 
> > Really thanks for your help. :)
> > 
> > > 
> > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > ride and things do look quite better, which is basically what Andrew was
> > > suggesting previously anyway. With the hash table approach the startup
> > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > This hash table was on a 80 core system, so since we do the power of two
> > > round up we end up with 256 entries -- I think we can do better if we
> > > enlarger further, maybe something like statically 1024, or probably
> > > better, 8-ish * nr cpus.
> > > 
> > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > we still keep the mutex, but the approach is (1) proven better for
> > > performance on real world workloads and (2) far less invasive. 
> > 
> > I have no more idea to improve my patches now, so I agree with your approach.
> > When I reviewed your approach last time, I found one race condition. In that
> > time, I didn't think of a solution about it. If you resend it, I will review
> > and re-think about it.
> 
> hmm so how do you want to play this? Your first 3 patches basically
> deals (more elegantly) with my patch 1/2 which I was on my way of just
> changing the lock -- we had agreed that serializing regions was with a
> spinlock was better than with a sleeping one as the critical region was
> small enough and we just had to deal with that trivial kmalloc case in
> region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> instantiation mutex hash table change, sounds good?

Hello,

If Andrew agree, It would be great to merge 1-7 patches into mainline
before your mutex approach. There are some of clean-up patches and, IMO,
it makes the code more readable and maintainable, so it is worth to merge
separately.

If disagree, 1-3 patches and then add your approach is find to me.

Andrew, what's your thought?

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-07  1:57                     ` Joonsoo Kim
  0 siblings, 0 replies; 90+ messages in thread
From: Joonsoo Kim @ 2014-01-07  1:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > Hi Joonsoo,
> > > 
> > > Sorry about the delay...
> > > 
> > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > > 
> > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > 
> > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > > 
> > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > performance degradation.
> > > > > > > > 
> > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > into removing the mutex.
> > > > > > 
> > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > then it naturally decreases.
> > > > > > 
> > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > that problem.
> > > > > > 
> > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > > 
> > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > scenarios.
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > targeted improved elapsed time. 
> > > 
> > > Fair enough, however the fact of the matter is this approach does en up
> > > hurting performance. Regarding total startup time, I didn't see hardly
> > > any differences, with both vanilla and this patchset it takes close to
> > > 33.5 seconds.
> > > 
> > > > Could you tell me how long it
> > > > > takes to fault all of it's hugepages?
> > > > > 
> > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > 
> > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > Andrew pointed out.
> > > > > 
> > > > > I will send another patches to fix this problem.
> > > > 
> > > > Hello, Davidlohr.
> > > > 
> > > > Here goes the fix on top of this series.
> > > 
> > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > approach still costs more than what we currently have. A perf stat shows
> > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > definitely non trivial and IMO worth considering.
> > 
> > Really thanks for your help. :)
> > 
> > > 
> > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > ride and things do look quite better, which is basically what Andrew was
> > > suggesting previously anyway. With the hash table approach the startup
> > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > This hash table was on a 80 core system, so since we do the power of two
> > > round up we end up with 256 entries -- I think we can do better if we
> > > enlarger further, maybe something like statically 1024, or probably
> > > better, 8-ish * nr cpus.
> > > 
> > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > we still keep the mutex, but the approach is (1) proven better for
> > > performance on real world workloads and (2) far less invasive. 
> > 
> > I have no more idea to improve my patches now, so I agree with your approach.
> > When I reviewed your approach last time, I found one race condition. In that
> > time, I didn't think of a solution about it. If you resend it, I will review
> > and re-think about it.
> 
> hmm so how do you want to play this? Your first 3 patches basically
> deals (more elegantly) with my patch 1/2 which I was on my way of just
> changing the lock -- we had agreed that serializing regions was with a
> spinlock was better than with a sleeping one as the critical region was
> small enough and we just had to deal with that trivial kmalloc case in
> region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> instantiation mutex hash table change, sounds good?

Hello,

If Andrew agree, It would be great to merge 1-7 patches into mainline
before your mutex approach. There are some of clean-up patches and, IMO,
it makes the code more readable and maintainable, so it is worth to merge
separately.

If disagree, 1-3 patches and then add your approach is find to me.

Andrew, what's your thought?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-07  1:57                     ` Joonsoo Kim
@ 2014-01-07  2:36                       ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > > Hi Joonsoo,
> > > > 
> > > > Sorry about the delay...
> > > > 
> > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > > > 
> > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > > 
> > > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > > > 
> > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > > performance degradation.
> > > > > > > > > 
> > > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > > into removing the mutex.
> > > > > > > 
> > > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > > then it naturally decreases.
> > > > > > > 
> > > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > > that problem.
> > > > > > > 
> > > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > > > 
> > > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > > scenarios.
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > > targeted improved elapsed time. 
> > > > 
> > > > Fair enough, however the fact of the matter is this approach does en up
> > > > hurting performance. Regarding total startup time, I didn't see hardly
> > > > any differences, with both vanilla and this patchset it takes close to
> > > > 33.5 seconds.
> > > > 
> > > > > Could you tell me how long it
> > > > > > takes to fault all of it's hugepages?
> > > > > > 
> > > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > > 
> > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > > Andrew pointed out.
> > > > > > 
> > > > > > I will send another patches to fix this problem.
> > > > > 
> > > > > Hello, Davidlohr.
> > > > > 
> > > > > Here goes the fix on top of this series.
> > > > 
> > > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > > approach still costs more than what we currently have. A perf stat shows
> > > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > > definitely non trivial and IMO worth considering.
> > > 
> > > Really thanks for your help. :)
> > > 
> > > > 
> > > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > > ride and things do look quite better, which is basically what Andrew was
> > > > suggesting previously anyway. With the hash table approach the startup
> > > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > > This hash table was on a 80 core system, so since we do the power of two
> > > > round up we end up with 256 entries -- I think we can do better if we
> > > > enlarger further, maybe something like statically 1024, or probably
> > > > better, 8-ish * nr cpus.
> > > > 
> > > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > > we still keep the mutex, but the approach is (1) proven better for
> > > > performance on real world workloads and (2) far less invasive. 
> > > 
> > > I have no more idea to improve my patches now, so I agree with your approach.
> > > When I reviewed your approach last time, I found one race condition. In that
> > > time, I didn't think of a solution about it. If you resend it, I will review
> > > and re-think about it.
> > 
> > hmm so how do you want to play this? Your first 3 patches basically
> > deals (more elegantly) with my patch 1/2 which I was on my way of just
> > changing the lock -- we had agreed that serializing regions was with a
> > spinlock was better than with a sleeping one as the critical region was
> > small enough and we just had to deal with that trivial kmalloc case in
> > region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> > instantiation mutex hash table change, sounds good?
> 
> Hello,
> 
> If Andrew agree, It would be great to merge 1-7 patches into mainline
> before your mutex approach. There are some of clean-up patches and, IMO,
> it makes the code more readable and maintainable, so it is worth to merge
> separately.

Fine by me.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-07  2:36                       ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > > Hi Joonsoo,
> > > > 
> > > > Sorry about the delay...
> > > > 
> > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > > > > > > > > 
> > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > > 
> > > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > > > 
> > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > > performance degradation.
> > > > > > > > > 
> > > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > > workloads using hugetlbfs.  The most common user of hugetlbfs by far is
> > > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > > into removing the mutex.
> > > > > > > 
> > > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > > then it naturally decreases.
> > > > > > > 
> > > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > > that problem.
> > > > > > > 
> > > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > > patchset -- I don't know of any benchmark that might stress this lock. 
> > > > > > > 
> > > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > > scenarios.
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > > targeted improved elapsed time. 
> > > > 
> > > > Fair enough, however the fact of the matter is this approach does en up
> > > > hurting performance. Regarding total startup time, I didn't see hardly
> > > > any differences, with both vanilla and this patchset it takes close to
> > > > 33.5 seconds.
> > > > 
> > > > > Could you tell me how long it
> > > > > > takes to fault all of it's hugepages?
> > > > > > 
> > > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > > 
> > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > > Andrew pointed out.
> > > > > > 
> > > > > > I will send another patches to fix this problem.
> > > > > 
> > > > > Hello, Davidlohr.
> > > > > 
> > > > > Here goes the fix on top of this series.
> > > > 
> > > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > > approach still costs more than what we currently have. A perf stat shows
> > > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > > definitely non trivial and IMO worth considering.
> > > 
> > > Really thanks for your help. :)
> > > 
> > > > 
> > > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > > ride and things do look quite better, which is basically what Andrew was
> > > > suggesting previously anyway. With the hash table approach the startup
> > > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > > This hash table was on a 80 core system, so since we do the power of two
> > > > round up we end up with 256 entries -- I think we can do better if we
> > > > enlarger further, maybe something like statically 1024, or probably
> > > > better, 8-ish * nr cpus.
> > > > 
> > > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > > we still keep the mutex, but the approach is (1) proven better for
> > > > performance on real world workloads and (2) far less invasive. 
> > > 
> > > I have no more idea to improve my patches now, so I agree with your approach.
> > > When I reviewed your approach last time, I found one race condition. In that
> > > time, I didn't think of a solution about it. If you resend it, I will review
> > > and re-think about it.
> > 
> > hmm so how do you want to play this? Your first 3 patches basically
> > deals (more elegantly) with my patch 1/2 which I was on my way of just
> > changing the lock -- we had agreed that serializing regions was with a
> > spinlock was better than with a sleeping one as the critical region was
> > small enough and we just had to deal with that trivial kmalloc case in
> > region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> > instantiation mutex hash table change, sounds good?
> 
> Hello,
> 
> If Andrew agree, It would be great to merge 1-7 patches into mainline
> before your mutex approach. There are some of clean-up patches and, IMO,
> it makes the code more readable and maintainable, so it is worth to merge
> separately.

Fine by me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 01/14] mm, hugetlb: unify region structure handling
  2013-12-18  6:53   ` Joonsoo Kim
@ 2014-01-07  2:37     ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 01/14] mm, hugetlb: unify region structure handling
@ 2014-01-07  2:37     ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head
  2013-12-18  6:53   ` Joonsoo Kim
@ 2014-01-07  2:39     ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head
@ 2014-01-07  2:39     ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
  2013-12-18  6:53   ` Joonsoo Kim
@ 2014-01-07  2:39     ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
> 
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock
@ 2014-01-07  2:39     ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  2:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michal Hocko,
	Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Wed, 2013-12-18 at 15:53 +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
> 
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-07  2:36                       ` Davidlohr Bueso
@ 2014-01-15  3:08                         ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2014-01-15  3:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, 6 Jan 2014, Davidlohr Bueso wrote:

> > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > before your mutex approach. There are some of clean-up patches and, IMO,
> > it makes the code more readable and maintainable, so it is worth to merge
> > separately.
> 
> Fine by me.
> 

It appears like patches 1-7 are still missing from linux-next, would you 
mind posting them in a series with your approach?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-15  3:08                         ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2014-01-15  3:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Mon, 6 Jan 2014, Davidlohr Bueso wrote:

> > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > before your mutex approach. There are some of clean-up patches and, IMO,
> > it makes the code more readable and maintainable, so it is worth to merge
> > separately.
> 
> Fine by me.
> 

It appears like patches 1-7 are still missing from linux-next, would you 
mind posting them in a series with your approach?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-15  3:08                         ` David Rientjes
@ 2014-01-15  4:37                           ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-15  4:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> 
> > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > it makes the code more readable and maintainable, so it is worth to merge
> > > separately.
> > 
> > Fine by me.
> > 
> 
> It appears like patches 1-7 are still missing from linux-next, would you 
> mind posting them in a series with your approach?

I haven't looked much into patches 4-7, but at least the first three are
ok. I was waiting for Andrew to take all seven for linux-next and then
I'd rebase my approach on top. Anyway, unless Andrew has any
preferences, if by later this week they're not picked up, I'll resend
everything.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-15  4:37                           ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-15  4:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> 
> > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > it makes the code more readable and maintainable, so it is worth to merge
> > > separately.
> > 
> > Fine by me.
> > 
> 
> It appears like patches 1-7 are still missing from linux-next, would you 
> mind posting them in a series with your approach?

I haven't looked much into patches 4-7, but at least the first three are
ok. I was waiting for Andrew to take all seven for linux-next and then
I'd rebase my approach on top. Anyway, unless Andrew has any
preferences, if by later this week they're not picked up, I'll resend
everything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-15  4:37                           ` Davidlohr Bueso
@ 2014-01-15  4:56                             ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2014-01-15  4:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > 
> > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > separately.
> > > 
> > > Fine by me.
> > > 
> > 
> > It appears like patches 1-7 are still missing from linux-next, would you 
> > mind posting them in a series with your approach?
> 
> I haven't looked much into patches 4-7, but at least the first three are
> ok. I was waiting for Andrew to take all seven for linux-next and then
> I'd rebase my approach on top. Anyway, unless Andrew has any
> preferences, if by later this week they're not picked up, I'll resend
> everything.

Well, we're mainly looking for bugfixes this last in the cycle. 
"[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
introduced resv_map lock" fixes a bug, but I'd assumed that it depended
on earlier patches.  If we think that one is serious then it would be
better to cook up a minimal fix which is backportable into 3.12 and
eariler?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-15  4:56                             ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2014-01-15  4:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > 
> > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > separately.
> > > 
> > > Fine by me.
> > > 
> > 
> > It appears like patches 1-7 are still missing from linux-next, would you 
> > mind posting them in a series with your approach?
> 
> I haven't looked much into patches 4-7, but at least the first three are
> ok. I was waiting for Andrew to take all seven for linux-next and then
> I'd rebase my approach on top. Anyway, unless Andrew has any
> preferences, if by later this week they're not picked up, I'll resend
> everything.

Well, we're mainly looking for bugfixes this last in the cycle. 
"[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
introduced resv_map lock" fixes a bug, but I'd assumed that it depended
on earlier patches.  If we think that one is serious then it would be
better to cook up a minimal fix which is backportable into 3.12 and
eariler?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-15  4:56                             ` Andrew Morton
@ 2014-01-15 20:47                               ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-15 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > > 
> > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > > separately.
> > > > 
> > > > Fine by me.
> > > > 
> > > 
> > > It appears like patches 1-7 are still missing from linux-next, would you 
> > > mind posting them in a series with your approach?
> > 
> > I haven't looked much into patches 4-7, but at least the first three are
> > ok. I was waiting for Andrew to take all seven for linux-next and then
> > I'd rebase my approach on top. Anyway, unless Andrew has any
> > preferences, if by later this week they're not picked up, I'll resend
> > everything.
> 
> Well, we're mainly looking for bugfixes this last in the cycle. 
> "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> on earlier patches. 

It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).

>  If we think that one is serious then it would be
> better to cook up a minimal fix which is backportable into 3.12 and
> eariler?

I don't think it's too serious, afaik it's a theoretical race and I
haven't seen any bug reports for it. So we can probably just wait for
3.14, as you say, it's already late in the cycle anyways. Just let me
know what you want to do so we can continue working on the actual
performance issue.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-15 20:47                               ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-01-15 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > > 
> > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > > separately.
> > > > 
> > > > Fine by me.
> > > > 
> > > 
> > > It appears like patches 1-7 are still missing from linux-next, would you 
> > > mind posting them in a series with your approach?
> > 
> > I haven't looked much into patches 4-7, but at least the first three are
> > ok. I was waiting for Andrew to take all seven for linux-next and then
> > I'd rebase my approach on top. Anyway, unless Andrew has any
> > preferences, if by later this week they're not picked up, I'll resend
> > everything.
> 
> Well, we're mainly looking for bugfixes this last in the cycle. 
> "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> on earlier patches. 

It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).

>  If we think that one is serious then it would be
> better to cook up a minimal fix which is backportable into 3.12 and
> eariler?

I don't think it's too serious, afaik it's a theoretical race and I
haven't seen any bug reports for it. So we can probably just wait for
3.14, as you say, it's already late in the cycle anyways. Just let me
know what you want to do so we can continue working on the actual
performance issue.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
  2014-01-15 20:47                               ` Davidlohr Bueso
@ 2014-01-15 20:50                                 ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2014-01-15 20:50 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:

> > Well, we're mainly looking for bugfixes this last in the cycle. 
> > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> > introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> > on earlier patches. 
> 
> It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
> linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
> dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).
> 
> >  If we think that one is serious then it would be
> > better to cook up a minimal fix which is backportable into 3.12 and
> > eariler?
> 
> I don't think it's too serious, afaik it's a theoretical race and I
> haven't seen any bug reports for it. So we can probably just wait for
> 3.14, as you say, it's already late in the cycle anyways.

OK, thanks.

> Just let me
> know what you want to do so we can continue working on the actual
> performance issue.

A resend after -rc1 would suit.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user
@ 2014-01-15 20:50                                 ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2014-01-15 20:50 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Rientjes, Joonsoo Kim, Mel Gorman, Rik van Riel,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Wanpeng Li, Naoya Horiguchi, Hillf Danton, aswin

On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:

> > Well, we're mainly looking for bugfixes this last in the cycle. 
> > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> > introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> > on earlier patches. 
> 
> It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
> linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
> dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).
> 
> >  If we think that one is serious then it would be
> > better to cook up a minimal fix which is backportable into 3.12 and
> > eariler?
> 
> I don't think it's too serious, afaik it's a theoretical race and I
> haven't seen any bug reports for it. So we can probably just wait for
> 3.14, as you say, it's already late in the cycle anyways.

OK, thanks.

> Just let me
> know what you want to do so we can continue working on the actual
> performance issue.

A resend after -rc1 would suit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2013-12-18  6:53 ` Joonsoo Kim
@ 2014-03-31 16:27   ` Dave Hansen
  -1 siblings, 0 replies; 90+ messages in thread
From: Dave Hansen @ 2014-03-31 16:27 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
> * NOTE for v3
> - Updating patchset is so late because of other works, not issue from
> this patchset.

Hey Joonsoo,

Any plans to repost these?

I've got some folks with a couple TB of RAM seeing long startup times
with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
hugetlb_instantiation_mutex because everyone is trying to zero hugepages
under that lock in parallel.  Just removing the lock sped things up
quite a bit.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2014-03-31 16:27   ` Dave Hansen
  0 siblings, 0 replies; 90+ messages in thread
From: Dave Hansen @ 2014-03-31 16:27 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michal Hocko, Aneesh Kumar K.V,
	KAMEZAWA Hiroyuki, Hugh Dickins, Davidlohr Bueso, David Gibson,
	linux-mm, linux-kernel, Joonsoo Kim, Wanpeng Li, Naoya Horiguchi,
	Hillf Danton

On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
> * NOTE for v3
> - Updating patchset is so late because of other works, not issue from
> this patchset.

Hey Joonsoo,

Any plans to repost these?

I've got some folks with a couple TB of RAM seeing long startup times
with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
hugetlb_instantiation_mutex because everyone is trying to zero hugepages
under that lock in parallel.  Just removing the lock sped things up
quite a bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2014-03-31 16:27   ` Dave Hansen
@ 2014-03-31 17:26     ` Davidlohr Bueso
  -1 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-03-31 17:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Mon, 2014-03-31 at 09:27 -0700, Dave Hansen wrote:
> On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
> > * NOTE for v3
> > - Updating patchset is so late because of other works, not issue from
> > this patchset.
> 
> Hey Joonsoo,
> 
> Any plans to repost these?
> 
> I've got some folks with a couple TB of RAM seeing long startup times
> with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
> hugetlb_instantiation_mutex because everyone is trying to zero hugepages
> under that lock in parallel.  Just removing the lock sped things up
> quite a bit.

Welcome to my world. Regarding the instantiation mutex, it is addressed,
see commit c999c05ff595 in -next. 

As for the clear page overhead, I brought this up in lsfmm last week,
proposing some daemon to clear pages when we have idle cpu... but didn't
get much positive feedback. Basically (i) not worth the additional
complexity and (ii) can trigger different application startup times,
which seems to be something negative. I do have a patch that implements
huge_clear_page with non-temporal hinting but I didn't see much
difference on my environment, would you want to give it a try?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2014-03-31 17:26     ` Davidlohr Bueso
  0 siblings, 0 replies; 90+ messages in thread
From: Davidlohr Bueso @ 2014-03-31 17:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Wanpeng Li, Naoya Horiguchi, Hillf Danton

On Mon, 2014-03-31 at 09:27 -0700, Dave Hansen wrote:
> On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
> > * NOTE for v3
> > - Updating patchset is so late because of other works, not issue from
> > this patchset.
> 
> Hey Joonsoo,
> 
> Any plans to repost these?
> 
> I've got some folks with a couple TB of RAM seeing long startup times
> with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
> hugetlb_instantiation_mutex because everyone is trying to zero hugepages
> under that lock in parallel.  Just removing the lock sped things up
> quite a bit.

Welcome to my world. Regarding the instantiation mutex, it is addressed,
see commit c999c05ff595 in -next. 

As for the clear page overhead, I brought this up in lsfmm last week,
proposing some daemon to clear pages when we have idle cpu... but didn't
get much positive feedback. Basically (i) not worth the additional
complexity and (ii) can trigger different application startup times,
which seems to be something negative. I do have a patch that implements
huge_clear_page with non-temporal hinting but I didn't see much
difference on my environment, would you want to give it a try?

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
  2014-03-31 17:26     ` Davidlohr Bueso
@ 2014-03-31 18:41       ` Dave Hansen
  -1 siblings, 0 replies; 90+ messages in thread
From: Dave Hansen @ 2014-03-31 18:41 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Naoya Horiguchi, Hillf Danton

On 03/31/2014 10:26 AM, Davidlohr Bueso wrote:
> On Mon, 2014-03-31 at 09:27 -0700, Dave Hansen wrote:
>> On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
>>> * NOTE for v3
>>> - Updating patchset is so late because of other works, not issue from
>>> this patchset.
>>
>> I've got some folks with a couple TB of RAM seeing long startup times
>> with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
>> hugetlb_instantiation_mutex because everyone is trying to zero hugepages
>> under that lock in parallel.  Just removing the lock sped things up
>> quite a bit.
> 
> Welcome to my world. Regarding the instantiation mutex, it is addressed,
> see commit c999c05ff595 in -next. 

Cool stuff.  That does seem to fix my parallel-fault hugetlbfs
microbenchmark.  I'll recommend that the $DATABASE folks check it as well.

> As for the clear page overhead, I brought this up in lsfmm last week,
> proposing some daemon to clear pages when we have idle cpu... but didn't
> get much positive feedback. Basically (i) not worth the additional
> complexity and (ii) can trigger different application startup times,
> which seems to be something negative. I do have a patch that implements
> huge_clear_page with non-temporal hinting but I didn't see much
> difference on my environment, would you want to give it a try?

I'd just be happy to see it happen outside of the locks.  As it stands
now, I have 1 CPU zeroing a huge page, and 159 sitting there sleeping
waiting for it to release the hugetlb_instantiation_mutex.  That's just
nonsense.  I don't think making them non-temporal will fundamentally
help that.  We need them parallelized.  According to ftrace, a
hugetlb_fault() takes ~700us.  Literally 99% of that is zeroing the page.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex
@ 2014-03-31 18:41       ` Dave Hansen
  0 siblings, 0 replies; 90+ messages in thread
From: Dave Hansen @ 2014-03-31 18:41 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Davidlohr Bueso, David Gibson, linux-mm, linux-kernel,
	Joonsoo Kim, Naoya Horiguchi, Hillf Danton

On 03/31/2014 10:26 AM, Davidlohr Bueso wrote:
> On Mon, 2014-03-31 at 09:27 -0700, Dave Hansen wrote:
>> On 12/17/2013 10:53 PM, Joonsoo Kim wrote:
>>> * NOTE for v3
>>> - Updating patchset is so late because of other works, not issue from
>>> this patchset.
>>
>> I've got some folks with a couple TB of RAM seeing long startup times
>> with $LARGE_DATABASE_PRODUCT.  It looks to be contention on
>> hugetlb_instantiation_mutex because everyone is trying to zero hugepages
>> under that lock in parallel.  Just removing the lock sped things up
>> quite a bit.
> 
> Welcome to my world. Regarding the instantiation mutex, it is addressed,
> see commit c999c05ff595 in -next. 

Cool stuff.  That does seem to fix my parallel-fault hugetlbfs
microbenchmark.  I'll recommend that the $DATABASE folks check it as well.

> As for the clear page overhead, I brought this up in lsfmm last week,
> proposing some daemon to clear pages when we have idle cpu... but didn't
> get much positive feedback. Basically (i) not worth the additional
> complexity and (ii) can trigger different application startup times,
> which seems to be something negative. I do have a patch that implements
> huge_clear_page with non-temporal hinting but I didn't see much
> difference on my environment, would you want to give it a try?

I'd just be happy to see it happen outside of the locks.  As it stands
now, I have 1 CPU zeroing a huge page, and 159 sitting there sleeping
waiting for it to release the hugetlb_instantiation_mutex.  That's just
nonsense.  I don't think making them non-temporal will fundamentally
help that.  We need them parallelized.  According to ftrace, a
hugetlb_fault() takes ~700us.  Literally 99% of that is zeroing the page.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2014-03-31 18:42 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-18  6:53 [PATCH v3 00/14] mm, hugetlb: remove a hugetlb_instantiation_mutex Joonsoo Kim
2013-12-18  6:53 ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 01/14] mm, hugetlb: unify region structure handling Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-21  9:04   ` David Gibson
2014-01-07  2:37   ` Davidlohr Bueso
2014-01-07  2:37     ` Davidlohr Bueso
2013-12-18  6:53 ` [PATCH v3 02/14] mm, hugetlb: region manipulation functions take resv_map rather list_head Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-21 13:43   ` David Gibson
2014-01-07  2:39   ` Davidlohr Bueso
2014-01-07  2:39     ` Davidlohr Bueso
2013-12-18  6:53 ` [PATCH v3 03/14] mm, hugetlb: protect region tracking via newly introduced resv_map lock Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-21 13:58   ` David Gibson
2013-12-23  1:05     ` Joonsoo Kim
2013-12-23  1:05       ` Joonsoo Kim
2013-12-24 12:00       ` David Gibson
2014-01-06  0:12         ` Joonsoo Kim
2014-01-06  0:12           ` Joonsoo Kim
2014-01-07  2:39   ` Davidlohr Bueso
2014-01-07  2:39     ` Davidlohr Bueso
2013-12-18  6:53 ` [PATCH v3 04/14] mm, hugetlb: remove resv_map_put() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 05/14] mm, hugetlb: make vma_resv_map() works for all mapping type Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 06/14] mm, hugetlb: remove vma_has_reserves() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 07/14] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 08/14] mm, hugetlb: call vma_needs_reservation before entering alloc_huge_page() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 09/14] mm, hugetlb: remove a check for return value of alloc_huge_page() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 10/14] mm, hugetlb: move down outside_reserve check Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 11/14] mm, hugetlb: move up anon_vma_prepare() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 12/14] mm, hugetlb: clean-up error handling in hugetlb_cow() Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-18  6:53 ` [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user Joonsoo Kim
2013-12-18  6:53   ` Joonsoo Kim
2013-12-20  1:02   ` Andrew Morton
2013-12-20  1:02     ` Andrew Morton
2013-12-20  1:58     ` Joonsoo Kim
2013-12-20  1:58       ` Joonsoo Kim
2013-12-20  2:15       ` Andrew Morton
2013-12-20  2:15         ` Andrew Morton
2013-12-20  5:00         ` Joonsoo Kim
2013-12-20  5:00           ` Joonsoo Kim
2013-12-20  2:31     ` Davidlohr Bueso
2013-12-20  2:31       ` Davidlohr Bueso
2013-12-20  4:47       ` Joonsoo Kim
2013-12-20  4:47         ` Joonsoo Kim
2013-12-20 14:01     ` Mel Gorman
2013-12-20 14:01       ` Mel Gorman
2013-12-21  6:48       ` Davidlohr Bueso
2013-12-21  6:48         ` Davidlohr Bueso
2013-12-23  0:44         ` Joonsoo Kim
2013-12-23  0:44           ` Joonsoo Kim
2013-12-23  2:11           ` Joonsoo Kim
2013-12-23  2:11             ` Joonsoo Kim
2014-01-03 19:55             ` Davidlohr Bueso
2014-01-03 19:55               ` Davidlohr Bueso
2014-01-06  0:19               ` Joonsoo Kim
2014-01-06  0:19                 ` Joonsoo Kim
2014-01-06 12:19                 ` Davidlohr Bueso
2014-01-06 12:19                   ` Davidlohr Bueso
2014-01-07  1:57                   ` Joonsoo Kim
2014-01-07  1:57                     ` Joonsoo Kim
2014-01-07  2:36                     ` Davidlohr Bueso
2014-01-07  2:36                       ` Davidlohr Bueso
2014-01-15  3:08                       ` David Rientjes
2014-01-15  3:08                         ` David Rientjes
2014-01-15  4:37                         ` Davidlohr Bueso
2014-01-15  4:37                           ` Davidlohr Bueso
2014-01-15  4:56                           ` Andrew Morton
2014-01-15  4:56                             ` Andrew Morton
2014-01-15 20:47                             ` Davidlohr Bueso
2014-01-15 20:47                               ` Davidlohr Bueso
2014-01-15 20:50                               ` Andrew Morton
2014-01-15 20:50                                 ` Andrew Morton
2013-12-18  6:54 ` [PATCH v3 14/14] mm, hugetlb: remove a hugetlb_instantiation_mutex Joonsoo Kim
2013-12-18  6:54   ` Joonsoo Kim
2014-03-31 16:27 ` [PATCH v3 00/14] " Dave Hansen
2014-03-31 16:27   ` Dave Hansen
2014-03-31 17:26   ` Davidlohr Bueso
2014-03-31 17:26     ` Davidlohr Bueso
2014-03-31 18:41     ` Dave Hansen
2014-03-31 18:41       ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.