linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
@ 2008-07-28 19:17 Eric Munson
  2008-07-28 19:17 ` [PATCH 1/5 V2] Align stack boundaries based on personality Eric Munson
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson

Certain workloads benefit if their data or text segments are backed by
huge pages. The stack is no exception to this rule but there is no
mechanism currently that allows the backing of a stack reliably with
huge pages.  Doing this from userspace is excessively messy and has some
awkward restrictions.  Particularly on POWER where 256MB of address space
gets wasted if the stack is setup there.

This patch stack introduces a personality flag that indicates the kernel
should setup the stack as a hugetlbfs-backed region. A userspace utility
may set this flag then exec a process whose stack is to be backed by
hugetlb pages.

Eric Munson (5):
  Align stack boundaries based on personality
  Add shared and reservation control to hugetlb_file_setup
  Split boundary checking from body of do_munmap
  Build hugetlb backed process stacks
  [PPC] Setup stack memory segment for hugetlb pages

 arch/powerpc/mm/hugetlbpage.c |    6 +
 arch/powerpc/mm/slice.c       |   11 ++
 fs/exec.c                     |  209 ++++++++++++++++++++++++++++++++++++++---
 fs/hugetlbfs/inode.c          |   52 +++++++----
 include/asm-powerpc/hugetlb.h |    3 +
 include/linux/hugetlb.h       |   22 ++++-
 include/linux/mm.h            |    1 +
 include/linux/personality.h   |    3 +
 ipc/shm.c                     |    2 +-
 mm/mmap.c                     |   11 ++-
 10 files changed, 284 insertions(+), 36 deletions(-)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5 V2] Align stack boundaries based on personality
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
@ 2008-07-28 19:17 ` Eric Munson
  2008-07-28 20:09   ` Dave Hansen
  2008-07-28 19:17 ` [PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup Eric Munson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson,
	Andy Whitcroft

This patch adds a personality flag that requests hugetlb pages be used for
a processes stack.  It adds a helper function that chooses the proper ALIGN
macro based on tthe process personality and calls this function from
setup_arg_pages when aligning the stack address.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Rebase to 2.6.26-rc8-mm1

 fs/exec.c                   |   15 ++++++++++++++-
 include/linux/hugetlb.h     |    3 +++
 include/linux/personality.h |    3 +++
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index af9b29c..c99ba24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -49,6 +49,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/hugetlb.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -155,6 +156,18 @@ exit:
 	goto out;
 }
 
+static unsigned long personality_page_align(unsigned long addr)
+{
+	if (current->personality & HUGETLB_STACK)
+#ifdef CONFIG_STACK_GROWSUP
+		return HPAGE_ALIGN(addr);
+#else
+		return addr & HPAGE_MASK;
+#endif
+
+	return PAGE_ALIGN(addr);
+}
+
 #ifdef CONFIG_MMU
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
@@ -596,7 +609,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	bprm->p = vma->vm_end - stack_shift;
 #else
 	stack_top = arch_align_stack(stack_top);
-	stack_top = PAGE_ALIGN(stack_top);
+	stack_top = personality_page_align(stack_top);
 	stack_shift = vma->vm_end - stack_top;
 
 	bprm->p -= stack_shift;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9a71d4c..eed37d7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -95,6 +95,9 @@ static inline unsigned long hugetlb_total_pages(void)
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
+
+/* to align the pointer to the (next) huge page boundary */
+#define HPAGE_ALIGN(addr)	ALIGN(addr, HPAGE_SIZE)
 #endif
 
 #endif /* !CONFIG_HUGETLB_PAGE */
diff --git a/include/linux/personality.h b/include/linux/personality.h
index a84e9ff..2bb0f95 100644
--- a/include/linux/personality.h
+++ b/include/linux/personality.h
@@ -22,6 +22,9 @@ extern int		__set_personality(unsigned long);
  * These occupy the top three bytes.
  */
 enum {
+	HUGETLB_STACK =		0x0020000,	/* Attempt to use hugetlb pages
+						 * for the process stack
+						 */
 	ADDR_NO_RANDOMIZE = 	0x0040000,	/* disable randomization of VA space */
 	FDPIC_FUNCPTRS =	0x0080000,	/* userspace function ptrs point to descriptors
 						 * (signal handling)
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
  2008-07-28 19:17 ` [PATCH 1/5 V2] Align stack boundaries based on personality Eric Munson
@ 2008-07-28 19:17 ` Eric Munson
  2008-07-28 19:17 ` [PATCH 3/5] Split boundary checking from body of do_munmap Eric Munson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson

There are two kinds of "Shared" hugetlbfs mappings:
   1. using internal vfsmount use ipc/shm.c and shmctl()
   2. mmap() of /hugetlbfs/file with MAP_SHARED

There is one kind of private: mmap() of /hugetlbfs/file file with
MAP_PRIVATE

This patch adds a second class of "private" hugetlb-backed mapping.  But
we do it by sharing code with the ipc shm.  This is mostly because we
need to do our stack setup at execve() time and can't go opening files
from hugetlbfs.  The kernel-internal vfsmount for shm lets us get around
this.  We truly want anonymous memory, but MAP_PRIVATE is close enough
for now.

Currently, if the mapping on an internal mount is larger than a single
huge page, one page is allocated, one is reserved, and the rest are
faulted as needed.  For hugetlb backed stacks we do not want any
reserved pages.  This patch gives the caller of hugetlb_file_steup the
ability to control this behavior by specifying flags for private inodes
and page reservations.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add creat_flags to struct hugetlbfs_inode_info
Check if space should be reserved in hugetlbfs_file_mmap
Rebase to 2.6.26-rc8-mm1

 fs/hugetlbfs/inode.c    |   52 ++++++++++++++++++++++++++++++----------------
 include/linux/hugetlb.h |   18 ++++++++++++---
 ipc/shm.c               |    2 +-
 3 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dbd01d2..2e960d6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -92,7 +92,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap_pgoff unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
+	vma->vm_flags |= VM_HUGETLB;
 	vma->vm_ops = &hugetlb_vm_ops;
 
 	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
@@ -106,10 +106,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	ret = -ENOMEM;
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	if (hugetlb_reserve_pages(inode,
+	if (HUGETLBFS_I(inode)->creat_flags & HUGETLB_RESERVE) {
+		vma->vm_flags |= VM_RESERVED;
+		if (hugetlb_reserve_pages(inode,
 				vma->vm_pgoff >> huge_page_order(h),
 				len >> huge_page_shift(h), vma))
-		goto out;
+			goto out;
+	}
 
 	ret = 0;
 	hugetlb_prefault_arch_hook(vma->vm_mm);
@@ -496,7 +499,8 @@ out:
 }
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, 
-					gid_t gid, int mode, dev_t dev)
+					gid_t gid, int mode, dev_t dev,
+					unsigned long creat_flags)
 {
 	struct inode *inode;
 
@@ -512,7 +516,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy, NULL);
+		info->creat_flags = creat_flags;
+		if (!(creat_flags & HUGETLB_PRIVATE_INODE))
+			mpol_shared_policy_init(&info->policy, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -553,7 +559,8 @@ static int hugetlbfs_mknod(struct inode *dir,
 	} else {
 		gid = current->fsgid;
 	}
-	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid, gid, mode, dev);
+	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid, gid, mode, dev,
+					HUGETLB_RESERVE);
 	if (inode) {
 		dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 		d_instantiate(dentry, inode);
@@ -589,7 +596,8 @@ static int hugetlbfs_symlink(struct inode *dir,
 		gid = current->fsgid;
 
 	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid,
-					gid, S_IFLNK|S_IRWXUGO, 0);
+					gid, S_IFLNK|S_IRWXUGO, 0,
+					HUGETLB_RESERVE);
 	if (inode) {
 		int l = strlen(symname)+1;
 		error = page_symlink(inode, symname, l);
@@ -693,7 +701,8 @@ static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
+	if (!(HUGETLBFS_I(inode)->creat_flags & HUGETLB_PRIVATE_INODE))
+		mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
@@ -879,7 +888,8 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
 	inode = hugetlbfs_get_inode(sb, config.uid, config.gid,
-					S_IFDIR | config.mode, 0);
+					S_IFDIR | config.mode, 0,
+					HUGETLB_RESERVE);
 	if (!inode)
 		goto out_free;
 
@@ -944,7 +954,8 @@ static int can_do_hugetlb_shm(void)
 			can_do_mlock());
 }
 
-struct file *hugetlb_file_setup(const char *name, size_t size)
+struct file *hugetlb_file_setup(const char *name, size_t size,
+				unsigned long creat_flags)
 {
 	int error = -ENOMEM;
 	struct file *file;
@@ -955,11 +966,13 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 	if (!hugetlbfs_vfsmount)
 		return ERR_PTR(-ENOENT);
 
-	if (!can_do_hugetlb_shm())
-		return ERR_PTR(-EPERM);
+	if (!(creat_flags & HUGETLB_PRIVATE_INODE)) {
+		if (!can_do_hugetlb_shm())
+			return ERR_PTR(-EPERM);
 
-	if (!user_shm_lock(size, current->user))
-		return ERR_PTR(-ENOMEM);
+		if (!user_shm_lock(size, current->user))
+			return ERR_PTR(-ENOMEM);
+	}
 
 	root = hugetlbfs_vfsmount->mnt_root;
 	quick_string.name = name;
@@ -971,13 +984,15 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 
 	error = -ENOSPC;
 	inode = hugetlbfs_get_inode(root->d_sb, current->fsuid,
-				current->fsgid, S_IFREG | S_IRWXUGO, 0);
+				current->fsgid, S_IFREG | S_IRWXUGO, 0,
+				creat_flags);
 	if (!inode)
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0,
-			size >> huge_page_shift(hstate_inode(inode)), NULL))
+	if ((creat_flags & HUGETLB_RESERVE) &&
+		(hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL)))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
@@ -998,7 +1013,8 @@ out_inode:
 out_dentry:
 	dput(dentry);
 out_shm_unlock:
-	user_shm_unlock(size, current->user);
+	if (!(creat_flags & HUGETLB_PRIVATE_INODE))
+		user_shm_unlock(size, current->user);
 	return ERR_PTR(error);
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index eed37d7..26ffed9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -95,12 +95,20 @@ static inline unsigned long hugetlb_total_pages(void)
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	PAGE_MASK		/* Keep the compiler happy */
 #define HPAGE_SIZE	PAGE_SIZE
+#endif
+
+#endif /* !CONFIG_HUGETLB_PAGE */
 
 /* to align the pointer to the (next) huge page boundary */
 #define HPAGE_ALIGN(addr)	ALIGN(addr, HPAGE_SIZE)
-#endif
 
-#endif /* !CONFIG_HUGETLB_PAGE */
+#define HUGETLB_PRIVATE_INODE	0x00000001UL	/* The file is being created on
+						 * the internal hugetlbfs mount
+						 * and is private to the
+						 * process */
+
+#define HUGETLB_RESERVE	0x00000002UL	/* Reserve the huge pages backed by the
+					 * new file */
 
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_config {
@@ -125,6 +133,7 @@ struct hugetlbfs_sb_info {
 struct hugetlbfs_inode_info {
 	struct shared_policy policy;
 	struct inode vfs_inode;
+	unsigned long creat_flags;
 };
 
 static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
@@ -139,7 +148,8 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 
 extern const struct file_operations hugetlbfs_file_operations;
 extern struct vm_operations_struct hugetlb_vm_ops;
-struct file *hugetlb_file_setup(const char *name, size_t);
+struct file *hugetlb_file_setup(const char *name, size_t,
+				unsigned long creat_flags);
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
@@ -161,7 +171,7 @@ static inline void set_file_hugepages(struct file *file)
 
 #define is_file_hugepages(file)		0
 #define set_file_hugepages(file)	BUG()
-#define hugetlb_file_setup(name,size)	ERR_PTR(-ENOSYS)
+#define hugetlb_file_setup(name,size,creat_flags)	ERR_PTR(-ENOSYS)
 
 #endif /* !CONFIG_HUGETLBFS */
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 2774bad..3b5849f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -365,7 +365,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	sprintf (name, "SYSV%08x", key);
 	if (shmflg & SHM_HUGETLB) {
 		/* hugetlb_file_setup takes care of mlock user accounting */
-		file = hugetlb_file_setup(name, size);
+		file = hugetlb_file_setup(name, size, HUGETLB_RESERVE);
 		shp->mlock_user = current->user;
 	} else {
 		int acctflag = VM_ACCOUNT;
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] Split boundary checking from body of do_munmap
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
  2008-07-28 19:17 ` [PATCH 1/5 V2] Align stack boundaries based on personality Eric Munson
  2008-07-28 19:17 ` [PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup Eric Munson
@ 2008-07-28 19:17 ` Eric Munson
  2008-07-28 19:17 ` [PATCH 4/5 V2] Build hugetlb backed process stacks Eric Munson
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson

Currently do_unmap pre-checks the unmapped address range against the
valid address range for the process size.  However during initial setup
the stack may actually be outside this range, particularly it may be
initially placed at the 64 bit stack address and later moved to the
normal 32 bit stack location.  In a later patch we will want to unmap
the stack as part of relocating it into huge pages.

This patch moves the bulk of do_munmap into __do_munmap which will not
be protected by the boundary checking.  When an area that would normally
fail at these checks needs to be unmapped (e.g. unmapping a stack that
was setup at 64 bit TASK_SIZE for a 32 bit process) __do_munmap should
be called directly.  do_munmap will continue to do the boundary checking
and will call __do_munmap as appropriate.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

 include/linux/mm.h |    1 +
 mm/mmap.c          |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4eeb3c..59c6f89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,6 +1152,7 @@ out:
 	return ret;
 }
 
+extern int __do_munmap(struct mm_struct *, unsigned long, size_t);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
diff --git a/mm/mmap.c b/mm/mmap.c
index 5b62e5d..4e56369 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1881,17 +1881,24 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	if (start > TASK_SIZE || len > TASK_SIZE-start)
+		return -EINVAL;
+	return __do_munmap(mm, start, len);
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge <jeremy@goop.org>
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
 
-	if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
+	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
 	if ((len = PAGE_ALIGN(len)) == 0)
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5 V2] Build hugetlb backed process stacks
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
                   ` (2 preceding siblings ...)
  2008-07-28 19:17 ` [PATCH 3/5] Split boundary checking from body of do_munmap Eric Munson
@ 2008-07-28 19:17 ` Eric Munson
  2008-07-28 20:37   ` Dave Hansen
  2008-07-28 19:17 ` [PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages Eric Munson
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson

This patch allows a processes stack to be backed by huge pages on request.
The personality flag defined in a previous patch should be set before
exec is called for the target process to use a huge page backed stack.

When the hugetlb file is setup to back the stack it is sized to fit the
ulimit for stack size or 256 MB if ulimit is unlimited.  The GROWSUP and
GROWSDOWN VM flags are turned off because a hugetlb backed vma is not
resizable so it will be appropriately sized when created.  When a process
exceeds stack size it recieves a segfault as it would if it exceeded the
ulimit.

Also certain architectures require special setup for a memory region before
huge pages can be used in that region.  This patch defines a function with
__attribute__ ((weak)) set that can be defined by these architectures to
do any necessary setup.  If it exists, it will be called right before the
hugetlb file is mmapped.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add comment about not padding huge stacks
Break personality_page_align helper and personality flag into separate patch
Add move_to_huge_pages function that moves the stack onto huge pages
Add hugetlb_mm_setup weak function for archs that require special setup to
 use hugetlb pages
Rebase to 2.6.26-rc8-mm1

 fs/exec.c               |  194 ++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/hugetlb.h |    5 +
 2 files changed, 187 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c99ba24..bf9ead2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -50,6 +50,7 @@
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
 #include <linux/hugetlb.h>
+#include <linux/mman.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -59,6 +60,8 @@
 #include <linux/kmod.h>
 #endif
 
+#define HUGE_STACK_MAX (256*1024*1024)
+
 #ifdef __alpha__
 /* for /sbin/loader handling in search_binary_handler() */
 #include <linux/a.out.h>
@@ -189,7 +192,12 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		return NULL;
 
 	if (write) {
-		unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
+		/*
+		 * Args are always placed at the high end of the stack space
+		 * so this calculation will give the proper size and it is
+		 * compatible with huge page stacks.
+		 */
+		unsigned long size = bprm->vma->vm_end - pos;
 		struct rlimit *rlim;
 
 		/*
@@ -255,7 +263,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	 * configured yet.
 	 */
 	vma->vm_end = STACK_TOP_MAX;
-	vma->vm_start = vma->vm_end - PAGE_SIZE;
+	if (current->personality & HUGETLB_STACK)
+		vma->vm_start = vma->vm_end - HPAGE_SIZE;
+	else
+		vma->vm_start = vma->vm_end - PAGE_SIZE;
 
 	vma->vm_flags = VM_STACK_FLAGS;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
@@ -574,6 +585,156 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	return 0;
 }
 
+static struct file *hugetlb_stack_file(int stack_hpages)
+{
+	struct file *hugefile = NULL;
+
+	if (!stack_hpages) {
+		set_personality(current->personality & (~HUGETLB_STACK));
+		printk(KERN_DEBUG
+			"Stack rlimit set too low for huge page backed stack.\n");
+		return NULL;
+	}
+
+	hugefile = hugetlb_file_setup(HUGETLB_STACK_FILE,
+					HPAGE_SIZE * stack_hpages,
+					HUGETLB_PRIVATE_INODE);
+	if (unlikely(IS_ERR(hugefile))) {
+		/*
+		 * If huge pages are not available for this stack fall
+		 * fall back to normal pages for execution instead of
+		 * failing.
+		 */
+		printk(KERN_DEBUG
+			"Huge page backed stack unavailable for process %lu.\n",
+			(unsigned long)current->pid);
+		set_personality(current->personality & (~HUGETLB_STACK));
+		return NULL;
+	}
+	return hugefile;
+}
+
+static int move_to_huge_pages(struct linux_binprm *bprm,
+				struct vm_area_struct *vma, unsigned long shift)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *new_vma;
+	unsigned long old_end = vma->vm_end;
+	unsigned long old_start = vma->vm_start;
+	unsigned long new_end = old_end - shift;
+	unsigned long new_start, length;
+	unsigned long arg_size = new_end - bprm->p;
+	unsigned long flags = vma->vm_flags;
+	struct file *hugefile = NULL;
+	unsigned int stack_hpages = 0;
+	struct page **from_pages = NULL;
+	struct page **to_pages = NULL;
+	unsigned long num_pages = (arg_size / PAGE_SIZE) + 1;
+	int ret;
+	int i;
+
+#ifdef CONFIG_STACK_GROWSUP
+	/*
+	 * Huge page stacks are not currently supported on GROWSUP
+	 * archs.
+	 */
+	set_personality(current->personality & (~HUGETLB_STACK));
+#else
+	if (current->signal->rlim[RLIMIT_STACK].rlim_cur == _STK_LIM_MAX)
+		stack_hpages = HUGE_STACK_MAX / HPAGE_SIZE;
+	else
+		stack_hpages = current->signal->rlim[RLIMIT_STACK].rlim_cur /
+				HPAGE_SIZE;
+	hugefile = hugetlb_stack_file(stack_hpages);
+	if (!hugefile)
+		goto out_small_stack;
+
+	length = stack_hpages * HPAGE_SIZE;
+	new_start = new_end - length;
+
+	from_pages = kmalloc(num_pages * sizeof(struct page*), GFP_KERNEL);
+	to_pages = kmalloc(num_pages * sizeof(struct page*), GFP_KERNEL);
+	if (!from_pages || !to_pages)
+		goto out_small_stack;
+
+	ret = get_user_pages(current, mm, (old_end - arg_size) & PAGE_MASK,
+				num_pages, 0, 0, from_pages, NULL);
+	if (ret <= 0)
+		goto out_small_stack;
+
+	/*
+	 * __do_munmap is used here because the boundary checking done in
+	 * do_munmap will fail out every time where the kernel is 64 bit and the
+	 * target program is 32 bit as the stack will start at TASK_SIZE for the
+	 * 64 bit address space.
+	 */
+	ret = __do_munmap(mm, old_start, old_end - old_start);
+	if (ret)
+		goto out_small_stack;
+
+	ret = -EINVAL;
+	if (hugetlb_mm_setup)
+		hugetlb_mm_setup(mm, new_start, length);
+	if (IS_ERR_VALUE(do_mmap(hugefile, new_start, length,
+			PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE, 0)))
+		goto out_error;
+	/* We don't want to fput this if the mmap succeeded */
+	hugefile = NULL;
+
+	ret = get_user_pages(current, mm, (new_end - arg_size) & PAGE_MASK,
+				num_pages, 0, 0, to_pages, NULL);
+	if (ret <= 0) {
+		ret = -ENOMEM;
+		goto out_error;
+	}
+
+	for (i = 0; i < num_pages; i++) {
+		char *vfrom, *vto;
+		vfrom = kmap(from_pages[i]);
+		vto = kmap(to_pages[i]);
+		memcpy(vto, vfrom, PAGE_SIZE);
+		kunmap(from_pages[i]);
+		kunmap(to_pages[i]);
+		put_page(from_pages[i]);
+		put_page(to_pages[i]);
+	}
+
+	kfree(from_pages);
+	kfree(to_pages);
+	new_vma = find_vma(current->mm, new_start);
+	if (!new_vma)
+		return -ENOSPC;
+	new_vma->vm_flags |= flags;
+	new_vma->vm_flags &= ~(VM_GROWSUP|VM_GROWSDOWN);
+	new_vma->vm_page_prot = vm_get_page_prot(new_vma->vm_flags);
+
+	bprm->vma = new_vma;
+	return 0;
+
+out_error:
+	for (i = 0; i < num_pages; i++)
+		put_page(from_pages[i]);
+	if (hugefile)
+		fput(hugefile);
+	if (from_pages)
+		kfree(from_pages);
+	if (to_pages)
+		kfree(to_pages);
+	return ret;
+
+out_small_stack:
+	if (hugefile)
+		fput(hugefile);
+	if (from_pages)
+		kfree(from_pages);
+	if (to_pages)
+		kfree(to_pages);
+#endif /* !CONFIG_STACK_GROWSUP */
+	if (shift)
+		return shift_arg_pages(vma, shift);
+	return 0;
+}
+
 #define EXTRA_STACK_VM_PAGES	20	/* random */
 
 /*
@@ -640,23 +801,32 @@ int setup_arg_pages(struct linux_binprm *bprm,
 		goto out_unlock;
 	BUG_ON(prev != vma);
 
+	/* Move stack to hugetlb pages if requested */
+	if (current->personality & HUGETLB_STACK)
+		ret = move_to_huge_pages(bprm, vma, stack_shift);
 	/* Move stack pages down in memory. */
-	if (stack_shift) {
+	else if (stack_shift)
 		ret = shift_arg_pages(vma, stack_shift);
-		if (ret) {
-			up_write(&mm->mmap_sem);
-			return ret;
-		}
+
+	if (ret) {
+		up_write(&mm->mmap_sem);
+		return ret;
 	}
 
+	/*
+	 * Stack padding code is skipped for huge stacks because the vma
+	 * is not expandable when backed by a hugetlb file.
+	 */
+	if (!(current->personality & HUGETLB_STACK)) {
 #ifdef CONFIG_STACK_GROWSUP
-	stack_base = vma->vm_end + EXTRA_STACK_VM_PAGES * PAGE_SIZE;
+		stack_base = vma->vm_end + EXTRA_STACK_VM_PAGES * PAGE_SIZE;
 #else
-	stack_base = vma->vm_start - EXTRA_STACK_VM_PAGES * PAGE_SIZE;
+		stack_base = vma->vm_start - EXTRA_STACK_VM_PAGES * PAGE_SIZE;
 #endif
-	ret = expand_stack(vma, stack_base);
-	if (ret)
-		ret = -EFAULT;
+		ret = expand_stack(vma, stack_base);
+		if (ret)
+			ret = -EFAULT;
+	}
 
 out_unlock:
 	up_write(&mm->mmap_sem);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 26ffed9..b4c88bb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -110,6 +110,11 @@ static inline unsigned long hugetlb_total_pages(void)
 #define HUGETLB_RESERVE	0x00000002UL	/* Reserve the huge pages backed by the
 					 * new file */
 
+#define HUGETLB_STACK_FILE "hugetlb-stack"
+
+extern void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr,
+				unsigned long len) __attribute__ ((weak));
+
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_config {
 	uid_t   uid;
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
                   ` (3 preceding siblings ...)
  2008-07-28 19:17 ` [PATCH 4/5 V2] Build hugetlb backed process stacks Eric Munson
@ 2008-07-28 19:17 ` Eric Munson
  2008-07-28 20:33 ` [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Dave Hansen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Eric Munson @ 2008-07-28 19:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linuxppc-dev, libhugetlbfs-devel, Eric Munson

Currently the memory slice that holds the process stack is always initialized
to hold small pages.  This patch defines the weak function that is declared
in the previos patch to convert the stack slice to hugetlb pages.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Instead of setting the mm-wide page size to huge pages, set only the relavent
 slice psize using an arch defined weak function.

 arch/powerpc/mm/hugetlbpage.c |    6 ++++++
 arch/powerpc/mm/slice.c       |   11 +++++++++++
 include/asm-powerpc/hugetlb.h |    3 +++
 3 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fb42c4d..bd7f777 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -152,6 +152,12 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
 }
 #endif
 
+void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr,
+			unsigned long len)
+{
+	slice_convert_address(mm, addr, len, shift_to_mmu_psize(HPAGE_SHIFT));
+}
+
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
  */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 583be67..d984733 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -30,6 +30,7 @@
 #include <linux/err.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
+#include <linux/hugetlb.h>
 #include <asm/mman.h>
 #include <asm/mmu.h>
 #include <asm/spu.h>
@@ -397,6 +398,16 @@ static unsigned long slice_find_area(struct mm_struct *mm, unsigned long len,
 #define MMU_PAGE_BASE	MMU_PAGE_4K
 #endif
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+				unsigned long len, unsigned int psize)
+{
+	struct slice_mask mask;
+
+	mask = slice_range_to_mask(addr, len);
+	slice_convert(mm, mask, psize);
+	slice_flush_segments(mm);
+}
+
 unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 				      unsigned long flags, unsigned int psize,
 				      int topdown, int use_cache)
diff --git a/include/asm-powerpc/hugetlb.h b/include/asm-powerpc/hugetlb.h
index 26f0d0a..10ef089 100644
--- a/include/asm-powerpc/hugetlb.h
+++ b/include/asm-powerpc/hugetlb.h
@@ -17,6 +17,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep);
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+				unsigned long len, unsigned int psize);
+
 /*
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
-- 
1.5.6.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5 V2] Align stack boundaries based on personality
  2008-07-28 19:17 ` [PATCH 1/5 V2] Align stack boundaries based on personality Eric Munson
@ 2008-07-28 20:09   ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2008-07-28 20:09 UTC (permalink / raw)
  To: Eric Munson
  Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel, Andy Whitcroft

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> +static unsigned long personality_page_align(unsigned long addr)
> +{
> +       if (current->personality & HUGETLB_STACK)
> +#ifdef CONFIG_STACK_GROWSUP
> +               return HPAGE_ALIGN(addr);
> +#else
> +               return addr & HPAGE_MASK;
> +#endif
> +
> +       return PAGE_ALIGN(addr);
> +}
...
> -       stack_top = PAGE_ALIGN(stack_top);
> +       stack_top = personality_page_align(stack_top);

Just out of curiosity, why doesn't the existing small-page case seem to
care about the stack growing up/down?  Why do you need to care in the
large page case?

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
                   ` (4 preceding siblings ...)
  2008-07-28 19:17 ` [PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages Eric Munson
@ 2008-07-28 20:33 ` Dave Hansen
  2008-07-28 21:23   ` Eric B Munson
  2008-07-30  8:41 ` Andrew Morton
  2008-07-30  8:43 ` Andrew Morton
  7 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-07-28 20:33 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> This patch stack introduces a personality flag that indicates the
> kernel
> should setup the stack as a hugetlbfs-backed region. A userspace
> utility
> may set this flag then exec a process whose stack is to be backed by
> hugetlb pages.

I didn't see it mentioned here, but these stacks are fixed-size, right?
They can't actually grow and are fixed in size at exec() time, right?

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5 V2] Build hugetlb backed process stacks
  2008-07-28 19:17 ` [PATCH 4/5 V2] Build hugetlb backed process stacks Eric Munson
@ 2008-07-28 20:37   ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2008-07-28 20:37 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> 
> +static int move_to_huge_pages(struct linux_binprm *bprm,
> +                               struct vm_area_struct *vma, unsigned
> long shift)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct vm_area_struct *new_vma;
> +       unsigned long old_end = vma->vm_end;
> +       unsigned long old_start = vma->vm_start;
> +       unsigned long new_end = old_end - shift;
> +       unsigned long new_start, length;
> +       unsigned long arg_size = new_end - bprm->p;
> +       unsigned long flags = vma->vm_flags;
> +       struct file *hugefile = NULL;
> +       unsigned int stack_hpages = 0;
> +       struct page **from_pages = NULL;
> +       struct page **to_pages = NULL;
> +       unsigned long num_pages = (arg_size / PAGE_SIZE) + 1;
> +       int ret;
> +       int i;
> +
> +#ifdef CONFIG_STACK_GROWSUP

Why do you have the #ifdef for the CONFIG_STACK_GROWSUP=y case in that
first patch if you don't support CONFIG_STACK_GROWSUP=y?

I think it might be worth some time to break this up a wee little bit.
16 local variables is a big on the beefy side. :)

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-28 20:33 ` [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Dave Hansen
@ 2008-07-28 21:23   ` Eric B Munson
  0 siblings, 0 replies; 38+ messages in thread
From: Eric B Munson @ 2008-07-28 21:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

[-- Attachment #1: Type: text/plain, Size: 684 bytes --]

On Mon, 28 Jul 2008, Dave Hansen wrote:

> On Mon, 2008-07-28 at 12:17 -0700, Eric Munson wrote:
> > 
> > This patch stack introduces a personality flag that indicates the
> > kernel
> > should setup the stack as a hugetlbfs-backed region. A userspace
> > utility
> > may set this flag then exec a process whose stack is to be backed by
> > hugetlb pages.
> 
> I didn't see it mentioned here, but these stacks are fixed-size, right?
> They can't actually grow and are fixed in size at exec() time, right?
> 
> -- Dave

The stack VMA is a fixed size but the pages will be faulted in as needed.

-- 
Eric B Munson
IBM Linux Technology Center
ebmunson@us.ibm.com


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
                   ` (5 preceding siblings ...)
  2008-07-28 20:33 ` [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Dave Hansen
@ 2008-07-30  8:41 ` Andrew Morton
  2008-07-30 15:04   ` Eric B Munson
  2008-07-30 15:08   ` Eric B Munson
  2008-07-30  8:43 ` Andrew Morton
  7 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2008-07-30  8:41 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:

> Certain workloads benefit if their data or text segments are backed by
> huge pages. The stack is no exception to this rule but there is no
> mechanism currently that allows the backing of a stack reliably with
> huge pages.  Doing this from userspace is excessively messy and has some
> awkward restrictions.  Particularly on POWER where 256MB of address space
> gets wasted if the stack is setup there.
> 
> This patch stack introduces a personality flag that indicates the kernel
> should setup the stack as a hugetlbfs-backed region. A userspace utility
> may set this flag then exec a process whose stack is to be backed by
> hugetlb pages.
> 
> Eric Munson (5):
>   Align stack boundaries based on personality
>   Add shared and reservation control to hugetlb_file_setup
>   Split boundary checking from body of do_munmap
>   Build hugetlb backed process stacks
>   [PPC] Setup stack memory segment for hugetlb pages
> 
>  arch/powerpc/mm/hugetlbpage.c |    6 +
>  arch/powerpc/mm/slice.c       |   11 ++
>  fs/exec.c                     |  209 ++++++++++++++++++++++++++++++++++++++---
>  fs/hugetlbfs/inode.c          |   52 +++++++----
>  include/asm-powerpc/hugetlb.h |    3 +
>  include/linux/hugetlb.h       |   22 ++++-
>  include/linux/mm.h            |    1 +
>  include/linux/personality.h   |    3 +
>  ipc/shm.c                     |    2 +-
>  mm/mmap.c                     |   11 ++-
>  10 files changed, 284 insertions(+), 36 deletions(-)

That all looks surprisingly straightforward.

Might there exist an x86 port which people can play with?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
                   ` (6 preceding siblings ...)
  2008-07-30  8:41 ` Andrew Morton
@ 2008-07-30  8:43 ` Andrew Morton
  2008-07-30 17:23   ` Mel Gorman
  7 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2008-07-30  8:43 UTC (permalink / raw)
  To: Eric Munson; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:

> Certain workloads benefit if their data or text segments are backed by
> huge pages.

oh.  As this is a performance patch, it would be much better if its
description contained some performance measurement results!  Please.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30  8:41 ` Andrew Morton
@ 2008-07-30 15:04   ` Eric B Munson
  2008-07-30 15:08   ` Eric B Munson
  1 sibling, 0 replies; 38+ messages in thread
From: Eric B Munson @ 2008-07-30 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

On Wed, 30 Jul 2008, Andrew Morton wrote:

> On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:
> 
> > Certain workloads benefit if their data or text segments are backed by
> > huge pages. The stack is no exception to this rule but there is no
> > mechanism currently that allows the backing of a stack reliably with
> > huge pages.  Doing this from userspace is excessively messy and has some
> > awkward restrictions.  Particularly on POWER where 256MB of address space
> > gets wasted if the stack is setup there.
> > 
> > This patch stack introduces a personality flag that indicates the kernel
> > should setup the stack as a hugetlbfs-backed region. A userspace utility
> > may set this flag then exec a process whose stack is to be backed by
> > hugetlb pages.
> > 
> > Eric Munson (5):
> >   Align stack boundaries based on personality
> >   Add shared and reservation control to hugetlb_file_setup
> >   Split boundary checking from body of do_munmap
> >   Build hugetlb backed process stacks
> >   [PPC] Setup stack memory segment for hugetlb pages
> > 
> >  arch/powerpc/mm/hugetlbpage.c |    6 +
> >  arch/powerpc/mm/slice.c       |   11 ++
> >  fs/exec.c                     |  209 ++++++++++++++++++++++++++++++++++++++---
> >  fs/hugetlbfs/inode.c          |   52 +++++++----
> >  include/asm-powerpc/hugetlb.h |    3 +
> >  include/linux/hugetlb.h       |   22 ++++-
> >  include/linux/mm.h            |    1 +
> >  include/linux/personality.h   |    3 +
> >  ipc/shm.c                     |    2 +-
> >  mm/mmap.c                     |   11 ++-
> >  10 files changed, 284 insertions(+), 36 deletions(-)
> 
> That all looks surprisingly straightforward.
> 
> Might there exist an x86 port which people can play with?
> 

I have tested these patches on x86, x86_64, and ppc64, but not yet on ia64.
There is a user space utility that I have been using to test which would be
included in libhugetlbfs if this is merged into the kernel.  I will send it
out as a reply to this thread, performance numbers are also on the way.

-- 
Eric B Munson
IBM Linux Technology Center
ebmunson@us.ibm.com


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30  8:41 ` Andrew Morton
  2008-07-30 15:04   ` Eric B Munson
@ 2008-07-30 15:08   ` Eric B Munson
  1 sibling, 0 replies; 38+ messages in thread
From: Eric B Munson @ 2008-07-30 15:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

[-- Attachment #1: Type: text/plain, Size: 2985 bytes --]

/***************************************************************************
 *   User front end for using huge pages Copyright (C) 2008, IBM           *
 *                                                                         *
 *   This program is free software; you can redistribute it and/or modify  *
 *   it under the terms of the Lesser GNU General Public License as        *
 *   published by the Free Software Foundation; either version 2.1 of the  *
 *   License, or at your option) any later version.                        *
 *                                                                         *
 *   This program is distributed in the hope that it will be useful,       *
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of        *
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         *
 *   GNU Lesser General Public License for more details.                   *
 *                                                                         *
 *   You should have received a copy of the Lesser GNU General Public      *
 *   License along with this program; if not, write to the                 *
 *   Free Software Foundation, Inc.,                                       *
 *   59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.             *
 ***************************************************************************/

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

#define _GNU_SOURCE /* for getopt_long */
#include <unistd.h>
#include <getopt.h>
#include <sys/personality.h>

/* Peronsality bit for huge page backed stack */
#ifndef HUGETLB_STACK
#define HUGETLB_STACK 0x0020000
#endif

extern int errno;
extern int optind;
extern char *optarg;

void print_usage()
{
	fprintf(stderr, "hugectl [options] target\n");
	fprintf(stderr, "options:\n");
	fprintf(stderr, " --help,  -h  Prints this message.\n");
	fprintf(stderr,
		" --stack, -s  Attempts to execute target program with a hugtlb page backed stack.\n");
}

void set_huge_stack()
{
	char * err;
	unsigned long curr_per = personality(0xffffffff);
	if (personality(curr_per | HUGETLB_STACK) == -1) {
		err = strerror(errno);
		fprintf(stderr,
			"Error setting HUGE_STACK personality flag: '%s'\n",
			err);
		exit(-1);
	}
}

int main(int argc, char** argv)
{
	char opts [] = "+hs";
	int ret = 0, index = 0;
	struct option long_opts [] = {
		{"help",          0, 0, 'h'},
		{"stack",         0, 0, 's'},
		{0,               0, 0, 0},
	};

	if (argc < 2) {
		print_usage();
		return 0;
	}

	while (ret != -1) {
		ret = getopt_long(argc, argv, opts, long_opts, &index);
		switch (ret) {
		case 's':
			set_huge_stack();
			break;

		case '?':
		case 'h':
			print_usage();
			return 0;

		case -1:
			break;

		default:
			ret = -1;
			break;
		}
	}
	index = optind;

	if (execvp(argv[index], &argv[index]) == -1) {
		ret = errno;
		fprintf(stderr, "Error calling execvp: '%s'\n", strerror(ret));
		return ret;
	}

	return 0;
}


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30  8:43 ` Andrew Morton
@ 2008-07-30 17:23   ` Mel Gorman
  2008-07-30 17:34     ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-07-30 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Munson, linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On (30/07/08 01:43), Andrew Morton didst pronounce:
> On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:
> 
> > Certain workloads benefit if their data or text segments are backed by
> > huge pages.
> 
> oh.  As this is a performance patch, it would be much better if its
> description contained some performance measurement results!  Please.
> 

I ran these patches through STREAM (http://www.cs.virginia.edu/stream/).
STREAM itself was patched to allocate data from the stack instead of statically
for the test. They completed without any problem on x86, x86_64 and PPC64
and each test showed a performance gain from using hugepages.  I can post
the raw figures but they are not currently in an eye-friendly format. Here
are some plots of the data though;

x86: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86-stream-stack.ps
x86_64: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86_64-stream-stack.ps
ppc64-small: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-small-stream-stack.ps
ppc64-large: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-large-stream-stack.ps

The test was to run STREAM with different array sizes (plotted on X-axis)
and measure the average throughput (y-axis). In each case, backing the stack
with large pages with a performance gain.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 17:23   ` Mel Gorman
@ 2008-07-30 17:34     ` Andrew Morton
  2008-07-30 19:30       ` Mel Gorman
                         ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Andrew Morton @ 2008-07-30 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Eric Munson, linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel

On Wed, 30 Jul 2008 18:23:18 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> On (30/07/08 01:43), Andrew Morton didst pronounce:
> > On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:
> > 
> > > Certain workloads benefit if their data or text segments are backed by
> > > huge pages.
> > 
> > oh.  As this is a performance patch, it would be much better if its
> > description contained some performance measurement results!  Please.
> > 
> 
> I ran these patches through STREAM (http://www.cs.virginia.edu/stream/).
> STREAM itself was patched to allocate data from the stack instead of statically
> for the test. They completed without any problem on x86, x86_64 and PPC64
> and each test showed a performance gain from using hugepages.  I can post
> the raw figures but they are not currently in an eye-friendly format. Here
> are some plots of the data though;
> 
> x86: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86-stream-stack.ps
> x86_64: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86_64-stream-stack.ps
> ppc64-small: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-small-stream-stack.ps
> ppc64-large: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-large-stream-stack.ps
> 
> The test was to run STREAM with different array sizes (plotted on X-axis)
> and measure the average throughput (y-axis). In each case, backing the stack
> with large pages with a performance gain.

So about a 10% speedup on x86 for most STREAM configurations.  Handy -
that's somewhat larger than most hugepage-conversions, iirc.

Do we expect that this change will be replicated in other
memory-intensive apps?  (I do).


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 17:34     ` Andrew Morton
@ 2008-07-30 19:30       ` Mel Gorman
  2008-07-30 19:40         ` Christoph Lameter
  2008-07-30 20:07         ` Andrew Morton
  2008-07-31  6:04       ` Nick Piggin
  2008-08-06 18:49       ` Andi Kleen
  2 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2008-07-30 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, Andrew Hastings

On (30/07/08 10:34), Andrew Morton didst pronounce:
> On Wed, 30 Jul 2008 18:23:18 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On (30/07/08 01:43), Andrew Morton didst pronounce:
> > > On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> wrote:
> > > 
> > > > Certain workloads benefit if their data or text segments are backed by
> > > > huge pages.
> > > 
> > > oh.  As this is a performance patch, it would be much better if its
> > > description contained some performance measurement results!  Please.
> > > 
> > 
> > I ran these patches through STREAM (http://www.cs.virginia.edu/stream/).
> > STREAM itself was patched to allocate data from the stack instead of statically
> > for the test. They completed without any problem on x86, x86_64 and PPC64
> > and each test showed a performance gain from using hugepages.  I can post
> > the raw figures but they are not currently in an eye-friendly format. Here
> > are some plots of the data though;
> > 
> > x86: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86-stream-stack.ps
> > x86_64: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86_64-stream-stack.ps
> > ppc64-small: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-small-stream-stack.ps
> > ppc64-large: http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-large-stream-stack.ps
> > 
> > The test was to run STREAM with different array sizes (plotted on X-axis)
> > and measure the average throughput (y-axis). In each case, backing the stack
> > with large pages with a performance gain.
> 
> So about a 10% speedup on x86 for most STREAM configurations.  Handy -
> that's somewhat larger than most hugepage-conversions, iirc.
> 

It is a bit. Usually, I expect around 5%.

> Do we expect that this change will be replicated in other
> memory-intensive apps?  (I do).
> 

I expect so. I know SpecCPU has some benchmarks that are stack-dependent and
would benefit from this patchset. I haven't experimented enough yet with other
workloads to give a decent estimate. I've added Andrew Hastings to the cc as
I believe he can make a good estimate on what sort of gains had by backing
the stack with huge pages based on experiments along those lines. Andrew?

With Erics patch and libhugetlbfs, we can automatically back text/data[1],
malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
will also be able to override shmget() to add SHM_HUGETLB. That should cover
a lot of the memory-intensive apps without source modification.

[1] It can partially remap non-hugepage-aligned segments but ideally the
application would be relinked

[2] Allocated via the morecore hook in glibc

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 19:30       ` Mel Gorman
@ 2008-07-30 19:40         ` Christoph Lameter
  2008-07-30 20:07         ` Andrew Morton
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2008-07-30 19:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, Andrew Hastings

Mel Gorman wrote:

> With Erics patch and libhugetlbfs, we can automatically back text/data[1],
> malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
> will also be able to override shmget() to add SHM_HUGETLB. That should cover
> a lot of the memory-intensive apps without source modification.

So we are quite far down the road to having a VM that supports 2 page sizes 4k and 2M?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 19:30       ` Mel Gorman
  2008-07-30 19:40         ` Christoph Lameter
@ 2008-07-30 20:07         ` Andrew Morton
  2008-07-31 10:31           ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2008-07-30 20:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: ebmunson, linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel, abh

On Wed, 30 Jul 2008 20:30:10 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> With Erics patch and libhugetlbfs, we can automatically back text/data[1],
> malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
> will also be able to override shmget() to add SHM_HUGETLB. That should cover
> a lot of the memory-intensive apps without source modification.

The weak link in all of this still might be the need to reserve
hugepages and the unreliability of dynamically allocating them.

The dynamic allocation should be better nowadays, but I've lost track
of how reliable it really is.  What's our status there?

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 17:34     ` Andrew Morton
  2008-07-30 19:30       ` Mel Gorman
@ 2008-07-31  6:04       ` Nick Piggin
  2008-07-31  6:14         ` Andrew Morton
  2008-08-06 18:49       ` Andi Kleen
  2 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-07-31  6:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On Thursday 31 July 2008 03:34, Andrew Morton wrote:
> On Wed, 30 Jul 2008 18:23:18 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> > On (30/07/08 01:43), Andrew Morton didst pronounce:
> > > On Mon, 28 Jul 2008 12:17:10 -0700 Eric Munson <ebmunson@us.ibm.com> 
wrote:
> > > > Certain workloads benefit if their data or text segments are backed
> > > > by huge pages.
> > >
> > > oh.  As this is a performance patch, it would be much better if its
> > > description contained some performance measurement results!  Please.
> >
> > I ran these patches through STREAM (http://www.cs.virginia.edu/stream/).
> > STREAM itself was patched to allocate data from the stack instead of
> > statically for the test. They completed without any problem on x86,
> > x86_64 and PPC64 and each test showed a performance gain from using
> > hugepages.  I can post the raw figures but they are not currently in an
> > eye-friendly format. Here are some plots of the data though;
> >
> > x86:
> > http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86-stream-stac
> >k.ps x86_64:
> > http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/x86_64-stream-s
> >tack.ps ppc64-small:
> > http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-small-str
> >eam-stack.ps ppc64-large:
> > http://www.csn.ul.ie/~mel/postings/stack-backing-20080730/ppc64-large-str
> >eam-stack.ps
> >
> > The test was to run STREAM with different array sizes (plotted on X-axis)
> > and measure the average throughput (y-axis). In each case, backing the
> > stack with large pages with a performance gain.
>
> So about a 10% speedup on x86 for most STREAM configurations.  Handy -
> that's somewhat larger than most hugepage-conversions, iirc.

Although it might be a bit unusual to have codes doing huge streaming
memory operations on stack memory...

We can see why IBM is so keen on their hugepages though :)


> Do we expect that this change will be replicated in other
> memory-intensive apps?  (I do).

Such as what? It would be nice to see some numbers with some HPC or java
or DBMS workload using this. Not that I dispute it will help some cases,
but 10% (or 20% for ppc) I guess is getting toward the best case, short
of a specifically written TLB thrasher.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31  6:04       ` Nick Piggin
@ 2008-07-31  6:14         ` Andrew Morton
  2008-07-31  6:26           ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2008-07-31  6:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On Thu, 31 Jul 2008 16:04:14 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Do we expect that this change will be replicated in other
> > memory-intensive apps?  (I do).
> 
> Such as what? It would be nice to see some numbers with some HPC or java
> or DBMS workload using this. Not that I dispute it will help some cases,
> but 10% (or 20% for ppc) I guess is getting toward the best case, short
> of a specifically written TLB thrasher.

I didn't realise the STREAM is using vast amounts of automatic memory. 
I'd assumed that it was using sane amounts of stack, but the stack TLB
slots were getting zapped by all the heap-memory activity.  Oh well.

I guess that effect is still there, but smaller.

I agree that few real-world apps are likely to see gains of this
order.  More benchmarks, please :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31  6:14         ` Andrew Morton
@ 2008-07-31  6:26           ` Nick Piggin
  2008-07-31 11:27             ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-07-31  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On Thursday 31 July 2008 16:14, Andrew Morton wrote:
> On Thu, 31 Jul 2008 16:04:14 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > > Do we expect that this change will be replicated in other
> > > memory-intensive apps?  (I do).
> >
> > Such as what? It would be nice to see some numbers with some HPC or java
> > or DBMS workload using this. Not that I dispute it will help some cases,
> > but 10% (or 20% for ppc) I guess is getting toward the best case, short
> > of a specifically written TLB thrasher.
>
> I didn't realise the STREAM is using vast amounts of automatic memory.
> I'd assumed that it was using sane amounts of stack, but the stack TLB
> slots were getting zapped by all the heap-memory activity.  Oh well.

An easy mistake to make because that's probabably how STREAM would normally
work. I think what Mel had done is to modify the stream kernel so as to
have it operate on arrays of stack memory.


> I guess that effect is still there, but smaller.

I imagine it should be, unless you're using a CPU with seperate TLBs for
small and huge pages, and your large data set is mapped with huge pages,
in which case you might now introduce *new* TLB contention between the
stack and the dataset :)

Also, interestingly I have actually seen some CPUs whos memory operations
get significantly slower when operating on large pages than small (in the
case when there is full TLB coverage for both sizes). This would make
sense if the CPU only implements a fast L1 TLB for small pages.

So for the vast majority of workloads, where stacks are relatively small
(or slowly changing), and relatively hot, I suspect this could easily have
no benefit at best and slowdowns at worst.

But I'm not saying that as a reason not to merge it -- this is no
different from any other hugepage allocations and as usual they have to be
used selectively where they help.... I just wonder exactly where huge
stacks will help.


> I agree that few real-world apps are likely to see gains of this
> order.  More benchmarks, please :)

Would be nice, if just out of morbid curiosity :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 20:07         ` Andrew Morton
@ 2008-07-31 10:31           ` Mel Gorman
  2008-08-04 21:10             ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-07-31 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ebmunson, linux-mm, linux-kernel, linuxppc-dev, libhugetlbfs-devel, abh

On (30/07/08 13:07), Andrew Morton didst pronounce:
> On Wed, 30 Jul 2008 20:30:10 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > With Erics patch and libhugetlbfs, we can automatically back text/data[1],
> > malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
> > will also be able to override shmget() to add SHM_HUGETLB. That should cover
> > a lot of the memory-intensive apps without source modification.
> 
> The weak link in all of this still might be the need to reserve
> hugepages and the unreliability of dynamically allocating them.
> 
> The dynamic allocation should be better nowadays, but I've lost track
> of how reliable it really is.  What's our status there?
> 

We are a lot more reliable than we were although exact quantification is
difficult because it's workload dependent. For a long time, I've been able
to test bits and pieces with hugepages by allocating the pool at the time
I needed it even after days of uptime. Previously this required a reboot.

I've also been able to use the dynamic hugepage pool resizing effectively
and we track how much it is succeeding and failing in /proc/vmstat (see the
htlb fields) to watch for problems. Between that and /proc/pagetypeinfo, I am
expecting to be able to identify availablilty problems. As an administrator
can now set a minimum pool size and the maximum size of the pool (nr_hugepages
and nr_overcommit_hugepages), the configuration difficulties should be relaxed.

If it is found that anti-fragmentation can be broken down and pool
resizing starts failing after X amount of time on Y workloads, there is
still the option of using movablecore=BiggestPoolSizeIWillEverNeed
and writing 1 to /proc/sys/vm/hugepages_treat_as_movable so the hugepage
pool can grow/shrink reliably there.

Overall, it's in pretty good shape.

To be fair, one snag is that that swap is almost required for pool
resizing to work as I never pushed to complete memory compaction
(http://lwn.net/Articles/238837/).  Hence, we depend on the workload to
have lots of filesystem-backed data for lumpy-reclaim to do its job, for
pool resizing to take place between batch jobs or for swap to be configured
even if it's just for the duration of a pool resize.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31  6:26           ` Nick Piggin
@ 2008-07-31 11:27             ` Mel Gorman
  2008-07-31 11:51               ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-07-31 11:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On (31/07/08 16:26), Nick Piggin didst pronounce:
> On Thursday 31 July 2008 16:14, Andrew Morton wrote:
> > On Thu, 31 Jul 2008 16:04:14 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > > Do we expect that this change will be replicated in other
> > > > memory-intensive apps?  (I do).
> > >
> > > Such as what? It would be nice to see some numbers with some HPC or java
> > > or DBMS workload using this. Not that I dispute it will help some cases,
> > > but 10% (or 20% for ppc) I guess is getting toward the best case, short
> > > of a specifically written TLB thrasher.
> >
> > I didn't realise the STREAM is using vast amounts of automatic memory.
> > I'd assumed that it was using sane amounts of stack, but the stack TLB
> > slots were getting zapped by all the heap-memory activity.  Oh well.
> 
> An easy mistake to make because that's probabably how STREAM would normally
> work. I think what Mel had done is to modify the stream kernel so as to
> have it operate on arrays of stack memory.
> 

Yes, I mentioned in the mail that STREAM was patched to use stack for
its data. It was as much to show the patches were working as advertised
even though it is an extreme case obviously.

I had seen stack-hugepage-backing as something that would improve performance
in addition to something else as opposed to having to stand entirely on its
own. For example, I would expect many memory-intensive applications to gain
by just having malloc and stack backed more than backing either in isolation.

> > I guess that effect is still there, but smaller.
> 
> I imagine it should be, unless you're using a CPU with seperate TLBs for
> small and huge pages, and your large data set is mapped with huge pages,
> in which case you might now introduce *new* TLB contention between the
> stack and the dataset :)

Yes, this can happen particularly on older CPUs. For example, on my
crash-test laptop the Pentium III there reports

TLB and cache info:
01: Instruction TLB: 4KB pages, 4-way set assoc, 32 entries
02: Instruction TLB: 4MB pages, 4-way set assoc, 2 entries

so a workload that sparsely addressed memory (i.e. >= 4MB strides on each
reference) might suffer more TLB misses with large pages than with small.
It's hardly new that there are is uncertainity around when and if hugepages
are of benefit and where.

> Also, interestingly I have actually seen some CPUs whos memory operations
> get significantly slower when operating on large pages than small (in the
> case when there is full TLB coverage for both sizes). This would make
> sense if the CPU only implements a fast L1 TLB for small pages.
> 

It's also possible there is a micro-TLB involved that only support small
pages. It's been the case for a while that what wins on one machine type
may lose on another.

> So for the vast majority of workloads, where stacks are relatively small
> (or slowly changing), and relatively hot, I suspect this could easily have
> no benefit at best and slowdowns at worst.
> 

I wouldn't expect an application with small stacks to request its stack
to be backed by hugepages either. Ideally, it would be enabled because a
large enough number of DTLB misses were found to be in the stack
although catching this sort of data is tricky. 

> But I'm not saying that as a reason not to merge it -- this is no
> different from any other hugepage allocations and as usual they have to be
> used selectively where they help.... I just wonder exactly where huge
> stacks will help.
> 

Benchmark wise, SPECcpu and SPEComp have stack-dependent benchmarks.
Computations that partition problems with recursion I would expect to benefit
as well as some JVMs that heavily use the stack (see how many docs suggest
setting ulimit -s unlimited). Bit out there, but stack-based languages would
stand to gain by this. The potential gap is for threaded apps as there will
be stacks that are not the "main" stack.  Backing those with hugepages depends
on how they are allocated (malloc, it's easy, MAP_ANONYMOUS not so much).

> > I agree that few real-world apps are likely to see gains of this
> > order.  More benchmarks, please :)
> 
> Would be nice, if just out of morbid curiosity :)
> 

Benchmarks will happen, they just take time, you know the way. The STREAM one
in the meantime is a "this works" and has an effect. I'm hoping Andrew Hastings
will have figures at hand and I cc'd him elsewhere in the thread for comment.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31 11:27             ` Mel Gorman
@ 2008-07-31 11:51               ` Nick Piggin
  2008-07-31 13:50                 ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-07-31 11:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On Thursday 31 July 2008 21:27, Mel Gorman wrote:
> On (31/07/08 16:26), Nick Piggin didst pronounce:

> > I imagine it should be, unless you're using a CPU with seperate TLBs for
> > small and huge pages, and your large data set is mapped with huge pages,
> > in which case you might now introduce *new* TLB contention between the
> > stack and the dataset :)
>
> Yes, this can happen particularly on older CPUs. For example, on my
> crash-test laptop the Pentium III there reports
>
> TLB and cache info:
> 01: Instruction TLB: 4KB pages, 4-way set assoc, 32 entries
> 02: Instruction TLB: 4MB pages, 4-way set assoc, 2 entries

Oh? Newer CPUs tend to have unified TLBs?


> > Also, interestingly I have actually seen some CPUs whos memory operations
> > get significantly slower when operating on large pages than small (in the
> > case when there is full TLB coverage for both sizes). This would make
> > sense if the CPU only implements a fast L1 TLB for small pages.
>
> It's also possible there is a micro-TLB involved that only support small
> pages.

That is the case on a couple of contemporary CPUs I've tested with
(although granted they are engineering samples, but I don't expect
that to be the cause)


> > So for the vast majority of workloads, where stacks are relatively small
> > (or slowly changing), and relatively hot, I suspect this could easily
> > have no benefit at best and slowdowns at worst.
>
> I wouldn't expect an application with small stacks to request its stack
> to be backed by hugepages either. Ideally, it would be enabled because a
> large enough number of DTLB misses were found to be in the stack
> although catching this sort of data is tricky.

Sure, as I said, I have nothing against this functionality just because
it has the possibility to cause a regression. I was just pointing out
there are a few possibilities there, so it will take a particular type
of app to take advantage of it. Ie. it is not something you would ever
just enable "just in case the stack starts thrashing the TLB".


> > But I'm not saying that as a reason not to merge it -- this is no
> > different from any other hugepage allocations and as usual they have to
> > be used selectively where they help.... I just wonder exactly where huge
> > stacks will help.
>
> Benchmark wise, SPECcpu and SPEComp have stack-dependent benchmarks.
> Computations that partition problems with recursion I would expect to
> benefit as well as some JVMs that heavily use the stack (see how many docs
> suggest setting ulimit -s unlimited). Bit out there, but stack-based
> languages would stand to gain by this. The potential gap is for threaded
> apps as there will be stacks that are not the "main" stack.  Backing those
> with hugepages depends on how they are allocated (malloc, it's easy,
> MAP_ANONYMOUS not so much).

Oh good, then there should be lots of possibilities to demonstrate it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31 11:51               ` Nick Piggin
@ 2008-07-31 13:50                 ` Mel Gorman
  2008-07-31 14:32                   ` Michael Ellerman
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-07-31 13:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Eric Munson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel

On (31/07/08 21:51), Nick Piggin didst pronounce:
> On Thursday 31 July 2008 21:27, Mel Gorman wrote:
> > On (31/07/08 16:26), Nick Piggin didst pronounce:
> 
> > > I imagine it should be, unless you're using a CPU with seperate TLBs for
> > > small and huge pages, and your large data set is mapped with huge pages,
> > > in which case you might now introduce *new* TLB contention between the
> > > stack and the dataset :)
> >
> > Yes, this can happen particularly on older CPUs. For example, on my
> > crash-test laptop the Pentium III there reports
> >
> > TLB and cache info:
> > 01: Instruction TLB: 4KB pages, 4-way set assoc, 32 entries
> > 02: Instruction TLB: 4MB pages, 4-way set assoc, 2 entries
> 
> Oh? Newer CPUs tend to have unified TLBs?
> 

I've seen more unified DTLBs (ITLB tends to be split) than not but it could
just be where I'm looking. For example, on the machine I'm writing this
(Core Duo), it's

TLB and cache info:
51: Instruction TLB: 4KB and 2MB or 4MB pages, 128 entries
5b: Data TLB: 4KB and 4MB pages, 64 entries

DTLB is unified there but on my T60p laptop where I guess they want the CPU
to be using less power and be cheaper, it's

TLB info
 Instruction TLB: 4K pages, 4-way associative, 128 entries.
 Instruction TLB: 4MB pages, fully associative, 2 entries
 Data TLB: 4K pages, 4-way associative, 128 entries.
 Data TLB: 4MB pages, 4-way associative, 8 entries

So I would expect huge pages to be slower there than in other cases.
On one Xeon, I see 32 entries for huge pages and 256 for small pages so
it's not straight-forward to predict. On another Xeon, I see the DLB is 64
entries unified.

To make all this more complex, huge pages can be a win because less L2 cache
is consumed on page table information. The gains are due to fewer access to
main memory and less to do with TLB misses. So lets say we do have a TLB
that is set-associative with very few large page entries, it could still
end up winning because the increased usage of L2 offset the increased TLB
misses. Predicting when huge pages are a win and when they are a loss is
just not particularly straight-forward.

> 
> > > Also, interestingly I have actually seen some CPUs whos memory operations
> > > get significantly slower when operating on large pages than small (in the
> > > case when there is full TLB coverage for both sizes). This would make
> > > sense if the CPU only implements a fast L1 TLB for small pages.
> >
> > It's also possible there is a micro-TLB involved that only support small
> > pages.
> 
> That is the case on a couple of contemporary CPUs I've tested with
> (although granted they are engineering samples, but I don't expect
> that to be the cause)
> 

I found it hard to determine if the CPU I was using at a uTLB or not. The
manuals didn't cover the subject but it was a theory as to why large pages
might be slower on a particular CPU. Whatever the reason, I'm ok
admitting that large pages can be slower on smaller data sets and in
other situations for whatever reason. It's not a major surprise.

> 
> > > So for the vast majority of workloads, where stacks are relatively small
> > > (or slowly changing), and relatively hot, I suspect this could easily
> > > have no benefit at best and slowdowns at worst.
> >
> > I wouldn't expect an application with small stacks to request its stack
> > to be backed by hugepages either. Ideally, it would be enabled because a
> > large enough number of DTLB misses were found to be in the stack
> > although catching this sort of data is tricky.
> 
> Sure, as I said, I have nothing against this functionality just because
> it has the possibility to cause a regression. I was just pointing out
> there are a few possibilities there, so it will take a particular type
> of app to take advantage of it. Ie. it is not something you would ever
> just enable "just in case the stack starts thrashing the TLB".
> 

No, it's something you'd enable because you know your app is using a lot
of stack. If you are lazy, you might do a test run of the app with it
enabled for the sake of curiousity and take the option that's faster :)

> 
> > > But I'm not saying that as a reason not to merge it -- this is no
> > > different from any other hugepage allocations and as usual they have to
> > > be used selectively where they help.... I just wonder exactly where huge
> > > stacks will help.
> >
> > Benchmark wise, SPECcpu and SPEComp have stack-dependent benchmarks.
> > Computations that partition problems with recursion I would expect to
> > benefit as well as some JVMs that heavily use the stack (see how many docs
> > suggest setting ulimit -s unlimited). Bit out there, but stack-based
> > languages would stand to gain by this. The potential gap is for threaded
> > apps as there will be stacks that are not the "main" stack.  Backing those
> > with hugepages depends on how they are allocated (malloc, it's easy,
> > MAP_ANONYMOUS not so much).
> 
> Oh good, then there should be lots of possibilities to demonstrate it.
> 

There should :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31 13:50                 ` Mel Gorman
@ 2008-07-31 14:32                   ` Michael Ellerman
  0 siblings, 0 replies; 38+ messages in thread
From: Michael Ellerman @ 2008-07-31 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, linux-mm, libhugetlbfs-devel, linux-kernel,
	linuxppc-dev, Eric Munson, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Thu, 2008-07-31 at 14:50 +0100, Mel Gorman wrote:
> On (31/07/08 21:51), Nick Piggin didst pronounce:
> > On Thursday 31 July 2008 21:27, Mel Gorman wrote:
> > > On (31/07/08 16:26), Nick Piggin didst pronounce:
> > 
> > > > I imagine it should be, unless you're using a CPU with seperate TLBs for
> > > > small and huge pages, and your large data set is mapped with huge pages,
> > > > in which case you might now introduce *new* TLB contention between the
> > > > stack and the dataset :)
> > >
> > > Yes, this can happen particularly on older CPUs. For example, on my
> > > crash-test laptop the Pentium III there reports
> > >
> > > TLB and cache info:
> > > 01: Instruction TLB: 4KB pages, 4-way set assoc, 32 entries
> > > 02: Instruction TLB: 4MB pages, 4-way set assoc, 2 entries
> > 
> > Oh? Newer CPUs tend to have unified TLBs?
> > 
> 
> I've seen more unified DTLBs (ITLB tends to be split) than not but it could
> just be where I'm looking. For example, on the machine I'm writing this
> (Core Duo), it's
> 
> TLB and cache info:
> 51: Instruction TLB: 4KB and 2MB or 4MB pages, 128 entries
> 5b: Data TLB: 4KB and 4MB pages, 64 entries
> 
> DTLB is unified there but on my T60p laptop where I guess they want the CPU
> to be using less power and be cheaper, it's
> 
> TLB info
>  Instruction TLB: 4K pages, 4-way associative, 128 entries.
>  Instruction TLB: 4MB pages, fully associative, 2 entries
>  Data TLB: 4K pages, 4-way associative, 128 entries.
>  Data TLB: 4MB pages, 4-way associative, 8 entries

Clearly I've been living under a rock, but I didn't know one could get
such nicely formatted info.

In case I'm not the only one, a bit of googling turned up "x86info",
courtesy of davej - apt-get'able and presumably yum'able too.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-31 10:31           ` Mel Gorman
@ 2008-08-04 21:10             ` Dave Hansen
  2008-08-05 11:11               ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-08-04 21:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On Thu, 2008-07-31 at 11:31 +0100, Mel Gorman wrote:
> We are a lot more reliable than we were although exact quantification is
> difficult because it's workload dependent. For a long time, I've been able
> to test bits and pieces with hugepages by allocating the pool at the time
> I needed it even after days of uptime. Previously this required a reboot.

This is also a pretty big expansion of fs/hugetlb/ use outside of the
filesystem itself.  It is hacking the existing shared memory
kernel-internal user to spit out effectively anonymous memory.

Where do we draw the line where we stop using the filesystem for this?
Other than the immediate code reuse, does it gain us anything?

I have to think that actually refactoring the filesystem code and making
it usable for really anonymous memory, then using *that* in these
patches would be a lot more sane.  Especially for someone that goes to
look at it in a year. :)

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-04 21:10             ` Dave Hansen
@ 2008-08-05 11:11               ` Mel Gorman
  2008-08-05 16:12                 ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-08-05 11:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On (04/08/08 14:10), Dave Hansen didst pronounce:
> On Thu, 2008-07-31 at 11:31 +0100, Mel Gorman wrote:
> > We are a lot more reliable than we were although exact quantification is
> > difficult because it's workload dependent. For a long time, I've been able
> > to test bits and pieces with hugepages by allocating the pool at the time
> > I needed it even after days of uptime. Previously this required a reboot.
> 
> This is also a pretty big expansion of fs/hugetlb/ use outside of the
> filesystem itself.  It is hacking the existing shared memory
> kernel-internal user to spit out effectively anonymous memory.
> 
> Where do we draw the line where we stop using the filesystem for this?
> Other than the immediate code reuse, does it gain us anything?
> 
> I have to think that actually refactoring the filesystem code and making
> it usable for really anonymous memory, then using *that* in these
> patches would be a lot more sane.  Especially for someone that goes to
> look at it in a year. :)
> 

See, that's great until you start dealing with MAP_SHARED|MAP_ANONYMOUS.
To get that right between children, you end up something very fs-like
when the child needs to fault in a page that is already populated by the
parent. I strongly suspect we end up back at hugetlbfs backing it :/

If you were going to do such a thing, you'd end up converting something
like ramfs to hugetlbfs and sharing that.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-05 11:11               ` Mel Gorman
@ 2008-08-05 16:12                 ` Dave Hansen
  2008-08-05 16:28                   ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-08-05 16:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On Tue, 2008-08-05 at 12:11 +0100, Mel Gorman wrote:
> See, that's great until you start dealing with MAP_SHARED|MAP_ANONYMOUS.
> To get that right between children, you end up something very fs-like
> when the child needs to fault in a page that is already populated by the
> parent. I strongly suspect we end up back at hugetlbfs backing it :/

Yeah, but the case I'm worried about is plain anonymous.  We already
have the fs to back SHARED|ANONYMOUS, and they're not really
anonymous. :)

This patch *really* needs anonymous pages, and it kinda shoehorns them
in with the filesystem.  Stacks aren't shared at all, so this is a
perfect example of where we can forget the fs, right?

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-05 16:12                 ` Dave Hansen
@ 2008-08-05 16:28                   ` Mel Gorman
  2008-08-05 17:53                     ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-08-05 16:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On (05/08/08 09:12), Dave Hansen didst pronounce:
> On Tue, 2008-08-05 at 12:11 +0100, Mel Gorman wrote:
> > See, that's great until you start dealing with MAP_SHARED|MAP_ANONYMOUS.
> > To get that right between children, you end up something very fs-like
> > when the child needs to fault in a page that is already populated by the
> > parent. I strongly suspect we end up back at hugetlbfs backing it :/
> 
> Yeah, but the case I'm worried about is plain anonymous.  We already
> have the fs to back SHARED|ANONYMOUS, and they're not really
> anonymous. :)
> 
> This patch *really* needs anonymous pages, and it kinda shoehorns them
> in with the filesystem.  Stacks aren't shared at all, so this is a
> perfect example of where we can forget the fs, right?
> 

Ok sure, you could do direct inserts for MAP_PRIVATE as conceptually it
suits this patch.  However, I don't see what you gain. By reusing hugetlbfs,
we get things like proper reservations which we can do for MAP_PRIVATE these
days. Again, we could call that sort of thing directly if the reservation
layer was split out separate from hugetlbfs but I still don't see the gain
for all that churn.

What am I missing?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-05 16:28                   ` Mel Gorman
@ 2008-08-05 17:53                     ` Dave Hansen
  2008-08-06  9:02                       ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-08-05 17:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On Tue, 2008-08-05 at 17:28 +0100, Mel Gorman wrote:
> Ok sure, you could do direct inserts for MAP_PRIVATE as conceptually it
> suits this patch.  However, I don't see what you gain. By reusing hugetlbfs,
> we get things like proper reservations which we can do for MAP_PRIVATE these
> days. Again, we could call that sort of thing directly if the reservation
> layer was split out separate from hugetlbfs but I still don't see the gain
> for all that churn.
> 
> What am I missing?

This is good for getting us incremental functionality.  It is probably
the smallest amount of code to get it functional.

My concern is that we're going down a path that all large page usage
should be through the one and only filesystem.  Once we establish that
dependency, it is going to be awfully hard to undo it; just think of all
of the inherent behavior in hugetlbfs.  So, we better be sure that the
filesystem really is the way to go, especially if we're going to start
having other areas of the kernel depend on it internally.

That said, this particular patch doesn't appear *too* bound to hugetlb
itself.  But, some of its limitations *do* come from the filesystem,
like its inability to handle VM_GROWS...  

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-05 17:53                     ` Dave Hansen
@ 2008-08-06  9:02                       ` Mel Gorman
  2008-08-06 19:50                         ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-08-06  9:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On (05/08/08 10:53), Dave Hansen didst pronounce:
> On Tue, 2008-08-05 at 17:28 +0100, Mel Gorman wrote:
> > Ok sure, you could do direct inserts for MAP_PRIVATE as conceptually it
> > suits this patch.  However, I don't see what you gain. By reusing hugetlbfs,
> > we get things like proper reservations which we can do for MAP_PRIVATE these
> > days. Again, we could call that sort of thing directly if the reservation
> > layer was split out separate from hugetlbfs but I still don't see the gain
> > for all that churn.
> > 
> > What am I missing?
> 
> This is good for getting us incremental functionality.  It is probably
> the smallest amount of code to get it functional.
> 

I'm not keen on the idea of introducing another specialised path just for
stacks. Testing coverage is tricky enough as it is and problems still slip
through occasionally. Maybe going through hugetlbfs is less than ideal,
but at least it is a shared path.

> My concern is that we're going down a path that all large page usage
> should be through the one and only filesystem.  Once we establish that
> dependency, it is going to be awfully hard to undo it;

Not much harder than it is to write any alternative in the first place
:/

> just think of all
> of the inherent behavior in hugetlbfs.  So, we better be sure that the
> filesystem really is the way to go, especially if we're going to start
> having other areas of the kernel depend on it internally.
> 

So far, it is working out as a decent model. It is able to track reservations
and deal with the differences between SHARED and PRIVATE without massive
difficulties. While we could add another specialised path to directly insert
the pages into pagetables for private mappings, I find it hard to justify
adding more test coverage problems. There might be minimal gains to be had
in lock granularity but that's about it.

> That said, this particular patch doesn't appear *too* bound to hugetlb
> itself.  But, some of its limitations *do* come from the filesystem,
> like its inability to handle VM_GROWS...  
> 

The lack of VM_GROWSX is an issue, but on its own it does not justify
the amount of churn necessary to support direct pagetable insertions for
MAP_ANONYMOUS|MAP_PRIVATE. I think we'd need another case or two that would
really benefit from direct insertions to pagetables instead of hugetlbfs so
that the path would get adequately tested.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-07-30 17:34     ` Andrew Morton
  2008-07-30 19:30       ` Mel Gorman
  2008-07-31  6:04       ` Nick Piggin
@ 2008-08-06 18:49       ` Andi Kleen
  2 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2008-08-06 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linuxppc-dev, libhugetlbfs-devel, linux-kernel, linux-mm

Andrew Morton <akpm@linux-foundation.org> writes:

> Do we expect that this change will be replicated in other
> memory-intensive apps?  (I do).

The catch with 2MB pages on x86 is that x86 CPUs generally have
much less 2MB TLB entries than 4K entries. So if you're unlucky
and access a lot of mappings you might actually thrash more with
them. That is why they are not necessarily an universal win.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-06  9:02                       ` Mel Gorman
@ 2008-08-06 19:50                         ` Dave Hansen
  2008-08-07 16:06                           ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-08-06 19:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On Wed, 2008-08-06 at 10:02 +0100, Mel Gorman wrote:
> > That said, this particular patch doesn't appear *too* bound to hugetlb
> > itself.  But, some of its limitations *do* come from the filesystem,
> > like its inability to handle VM_GROWS...  
> 
> The lack of VM_GROWSX is an issue, but on its own it does not justify
> the amount of churn necessary to support direct pagetable insertions for
> MAP_ANONYMOUS|MAP_PRIVATE. I think we'd need another case or two that would
> really benefit from direct insertions to pagetables instead of hugetlbfs so
> that the path would get adequately tested.

I'm jumping around here a bit, but I'm trying to get to the core of what
my problem with these patches is.  I'll see if I can close the loop
here.

The main thing this set of patches does that I care about is take an
anonymous VMA and replace it with a hugetlb VMA.  It does this on a
special cue, but does it nonetheless.

This patch has crossed a line in that it is really the first
*replacement* of a normal VMA with a hugetlb VMA instead of the creation
of the VMAs at the user's request.  I'm really curious what the plan is
to follow up on this.  Will this stack stuff turn out to be one-off
code, or is this *the* route for getting transparent large pages in the
future?

Because of the limitations like its inability to grow the VMA, I can't
imagine that this would be a generic mechanism that we can use
elsewhere.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-06 19:50                         ` Dave Hansen
@ 2008-08-07 16:06                           ` Mel Gorman
  2008-08-07 17:29                             ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2008-08-07 16:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On (06/08/08 12:50), Dave Hansen didst pronounce:
> On Wed, 2008-08-06 at 10:02 +0100, Mel Gorman wrote:
> > > That said, this particular patch doesn't appear *too* bound to hugetlb
> > > itself.  But, some of its limitations *do* come from the filesystem,
> > > like its inability to handle VM_GROWS...  
> > 
> > The lack of VM_GROWSX is an issue, but on its own it does not justify
> > the amount of churn necessary to support direct pagetable insertions for
> > MAP_ANONYMOUS|MAP_PRIVATE. I think we'd need another case or two that would
> > really benefit from direct insertions to pagetables instead of hugetlbfs so
> > that the path would get adequately tested.
> 
> I'm jumping around here a bit, but I'm trying to get to the core of what
> my problem with these patches is.  I'll see if I can close the loop
> here.
> 
> The main thing this set of patches does that I care about is take an
> anonymous VMA and replace it with a hugetlb VMA.  It does this on a
> special cue, but does it nonetheless.
> 

This is not actually a new thing. For a long time now, it has been possible to
back  malloc() with hugepages at a userspace level using the morecore glibc
hook. That is replacing anonymous memory with a file-backed VMA. It happens
in a different place but it's just as deliberate as backing stack and the
end result is very similar. As the file is ram-based, it doesn't have the
same types of consequences like dirty page syncing that you'd ordinarily
watch for when moving from anonymous to file-backed memory.

> This patch has crossed a line in that it is really the first
> *replacement* of a normal VMA with a hugetlb VMA instead of the creation
> of the VMAs at the user's request. 

We crossed that line with morecore, it's back there somewhere. We're just
doing in kernel this time because backing stacks with hugepages in userspace
turned out to be a hairy endevour.

Properly supporting anonymous hugepages would either require larger
changes to the core or reimplementing yet more of mm/ in mm/hugetlb.c.
Neither is a particularly appealing approach, nor is it likely to be a
very popular one.

> I'm really curious what the plan is
> to follow up on this.  Will this stack stuff turn out to be one-off
> code, or is this *the* route for getting transparent large pages in the
> future?
> 

Conceivably, we could also implement MAP_LARGEPAGE for MAP_ANONYMOUS
which would use the same hugetlb_file_setup() as for shmem and stacks
with this patch. It would be a reliavely straight-forward patch if reusing
hugetlb_file_setup() as the flags can be passed through almost verbatim. In
that case, hugetlbfs still makes a good fit without making direct pagetable
insertions necessary.

> Because of the limitations like its inability to grow the VMA, I can't
> imagine that this would be a generic mechanism that we can use
> elsewhere.
> 

What other than a stack even cares about VM_GROWSDOWN working? Besides,
VM_GROWSDOWN could be supported in a hugetlbfs file by mapping the end of
the file and moving the offset backwards (yeah ok, it ain't the prettiest
but it's less churn than introducing significantly different codepaths). It's
just not something that needs to be supported at first cut.

brk() if you wanted to back hugepages with it conceivably needs a resizing
VMA but in that case it's growing up in which case just extend the end of
the VMA and increase the size of the file.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-07 16:06                           ` Mel Gorman
@ 2008-08-07 17:29                             ` Dave Hansen
  2008-08-11  8:04                               ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2008-08-07 17:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On Thu, 2008-08-07 at 17:06 +0100, Mel Gorman wrote:
> On (06/08/08 12:50), Dave Hansen didst pronounce:
> > The main thing this set of patches does that I care about is take an
> > anonymous VMA and replace it with a hugetlb VMA.  It does this on a
> > special cue, but does it nonetheless.
> 
> This is not actually a new thing. For a long time now, it has been possible to
> back  malloc() with hugepages at a userspace level using the morecore glibc
> hook. That is replacing anonymous memory with a file-backed VMA. It happens
> in a different place but it's just as deliberate as backing stack and the
> end result is very similar. As the file is ram-based, it doesn't have the
> same types of consequences like dirty page syncing that you'd ordinarily
> watch for when moving from anonymous to file-backed memory.

Yes, it has already been done in userspace.  That's fine.  It isn't
adding any complexity to the kernel.  This is adding behavior that the
kernel has to support as well as complexity.

> > This patch has crossed a line in that it is really the first
> > *replacement* of a normal VMA with a hugetlb VMA instead of the creation
> > of the VMAs at the user's request. 
> 
> We crossed that line with morecore, it's back there somewhere. We're just
> doing in kernel this time because backing stacks with hugepages in userspace
> turned out to be a hairy endevour.
> 
> Properly supporting anonymous hugepages would either require larger
> changes to the core or reimplementing yet more of mm/ in mm/hugetlb.c.
> Neither is a particularly appealing approach, nor is it likely to be a
> very popular one.

I agree.  It is always much harder to write code that can work
generically (and get it accepted) than just write the smallest possible
hack and stick it in fs/exec.c.

Could this patch at least get fixed up to look like it could be used
more generically?  Some code to look up and replace anonymous VMAs with
hugetlb-backed ones.

> > Because of the limitations like its inability to grow the VMA, I can't
> > imagine that this would be a generic mechanism that we can use
> > elsewhere.
> 
> What other than a stack even cares about VM_GROWSDOWN working? Besides,
> VM_GROWSDOWN could be supported in a hugetlbfs file by mapping the end of
> the file and moving the offset backwards (yeah ok, it ain't the prettiest
> but it's less churn than introducing significantly different codepaths). It's
> just not something that needs to be supported at first cut.
> 
> brk() if you wanted to back hugepages with it conceivably needs a resizing
> VMA but in that case it's growing up in which case just extend the end of
> the VMA and increase the size of the file.

I'm more worried about a small huge page size (say 64k) and not being
able to merge the VMAs.  I guess it could start in the *middle* of a
file and map both directions.

I guess you could always just have a single (very sparse) hugetlb file
per mm to do all of this 'anonymous' hugetlb memory memory stuff, and
just map its offsets 1:1 on to the process's virtual address space.
That would make sure you could always merge VMAs, no matter how they
grew together.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
  2008-08-07 17:29                             ` Dave Hansen
@ 2008-08-11  8:04                               ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2008-08-11  8:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, ebmunson, linux-mm, linux-kernel, linuxppc-dev,
	libhugetlbfs-devel, abh

On (07/08/08 10:29), Dave Hansen didst pronounce:
> On Thu, 2008-08-07 at 17:06 +0100, Mel Gorman wrote:
> > On (06/08/08 12:50), Dave Hansen didst pronounce:
> > > The main thing this set of patches does that I care about is take an
> > > anonymous VMA and replace it with a hugetlb VMA.  It does this on a
> > > special cue, but does it nonetheless.
> > 
> > This is not actually a new thing. For a long time now, it has been possible to
> > back  malloc() with hugepages at a userspace level using the morecore glibc
> > hook. That is replacing anonymous memory with a file-backed VMA. It happens
> > in a different place but it's just as deliberate as backing stack and the
> > end result is very similar. As the file is ram-based, it doesn't have the
> > same types of consequences like dirty page syncing that you'd ordinarily
> > watch for when moving from anonymous to file-backed memory.
> 
> Yes, it has already been done in userspace.  That's fine.  It isn't
> adding any complexity to the kernel.  This is adding behavior that the
> kernel has to support as well as complexity.
> 

The complexity is minimal and the progression logical.
hugetlb_file_setup() is the API shmem was using to create a file on an
internal mount suitable for MAP_SHARED. This patchset adds support for
MAP_PRIVATE and the additional complexity is a lot less than supporting
direct pagetable inserts.

> > > This patch has crossed a line in that it is really the first
> > > *replacement* of a normal VMA with a hugetlb VMA instead of the creation
> > > of the VMAs at the user's request. 
> > 
> > We crossed that line with morecore, it's back there somewhere. We're just
> > doing in kernel this time because backing stacks with hugepages in userspace
> > turned out to be a hairy endevour.
> > 
> > Properly supporting anonymous hugepages would either require larger
> > changes to the core or reimplementing yet more of mm/ in mm/hugetlb.c.
> > Neither is a particularly appealing approach, nor is it likely to be a
> > very popular one.
> 
> I agree.  It is always much harder to write code that can work
> generically??? (and get it accepted) than just write the smallest possible
> hack and stick it in fs/exec.c.
> 
> Could this patch at least get fixed up to look like it could be used
> more generically?  Some code to look up and replace anonymous VMAs with
> hugetlb-backed ones???

Ok, this latter point can be looked into at least although the
underlying principal may still be using hugetlb_file_setup() rather than
direct pagetable insertions.

> > > Because of the limitations like its inability to grow the VMA, I can't
> > > imagine that this would be a generic mechanism that we can use
> > > elsewhere.
> > 
> > What other than a stack even cares about VM_GROWSDOWN working? Besides,
> > VM_GROWSDOWN could be supported in a hugetlbfs file by mapping the end of
> > the file and moving the offset backwards (yeah ok, it ain't the prettiest
> > but it's less churn than introducing significantly different codepaths). It's
> > just not something that needs to be supported at first cut.
> > 
> > brk() if you wanted to back hugepages with it conceivably needs a resizing
> > VMA but in that case it's growing up in which case just extend the end of
> > the VMA and increase the size of the file.
> 
> I'm more worried about a small huge page size (say 64k) and not being
> able to merge the VMAs.  I guess it could start in the *middle* of a
> file and map both directions.
> 
> I guess you could always just have a single (very sparse) hugetlb file
> per mm to do all of this 'anonymous' hugetlb memory memory stuff, and
> just map its offsets 1:1 on to the process's virtual address space.
> That would make sure you could always merge VMAs, no matter how they
> grew together.
> 

That's an interesting idea. It isn't as straight-forward as it sounds
due to reservation tracking but at the face of it, I can't see why it
couldn't be made work.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-08-11  8:05 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-28 19:17 [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Eric Munson
2008-07-28 19:17 ` [PATCH 1/5 V2] Align stack boundaries based on personality Eric Munson
2008-07-28 20:09   ` Dave Hansen
2008-07-28 19:17 ` [PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup Eric Munson
2008-07-28 19:17 ` [PATCH 3/5] Split boundary checking from body of do_munmap Eric Munson
2008-07-28 19:17 ` [PATCH 4/5 V2] Build hugetlb backed process stacks Eric Munson
2008-07-28 20:37   ` Dave Hansen
2008-07-28 19:17 ` [PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages Eric Munson
2008-07-28 20:33 ` [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks Dave Hansen
2008-07-28 21:23   ` Eric B Munson
2008-07-30  8:41 ` Andrew Morton
2008-07-30 15:04   ` Eric B Munson
2008-07-30 15:08   ` Eric B Munson
2008-07-30  8:43 ` Andrew Morton
2008-07-30 17:23   ` Mel Gorman
2008-07-30 17:34     ` Andrew Morton
2008-07-30 19:30       ` Mel Gorman
2008-07-30 19:40         ` Christoph Lameter
2008-07-30 20:07         ` Andrew Morton
2008-07-31 10:31           ` Mel Gorman
2008-08-04 21:10             ` Dave Hansen
2008-08-05 11:11               ` Mel Gorman
2008-08-05 16:12                 ` Dave Hansen
2008-08-05 16:28                   ` Mel Gorman
2008-08-05 17:53                     ` Dave Hansen
2008-08-06  9:02                       ` Mel Gorman
2008-08-06 19:50                         ` Dave Hansen
2008-08-07 16:06                           ` Mel Gorman
2008-08-07 17:29                             ` Dave Hansen
2008-08-11  8:04                               ` Mel Gorman
2008-07-31  6:04       ` Nick Piggin
2008-07-31  6:14         ` Andrew Morton
2008-07-31  6:26           ` Nick Piggin
2008-07-31 11:27             ` Mel Gorman
2008-07-31 11:51               ` Nick Piggin
2008-07-31 13:50                 ` Mel Gorman
2008-07-31 14:32                   ` Michael Ellerman
2008-08-06 18:49       ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).