All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-19 20:05 ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Andrew, given the favorable review of these patches the last time around, would
you consider them for the -mm tree?  Does anyone else have any objections?

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
core VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled "out of line" by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags &
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64
to get a feel for the fast-path performance impact.  The following tables show
kernbench performance comparisons between a clean 2.6.20 kernel and one with my
patches applied.  These numbers seem well within statistical noise to me.

Changes since V1:
	- Made hugetlbfs_pagetable_ops const (Thanks Arjan)

--

KernBench Comparison (ppc64)
----------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time              708.82                 708.59      0.03
System CPU time               62.50                  62.58     -0.13
Total  CPU time              771.32                 771.17      0.02
Elapsed    time              115.40                 115.35      0.04

KernBench Comparison (x86)
--------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time             1382.62                1381.88      0.05
System CPU time              146.06                 146.86     -0.55
Total  CPU time             1528.68                1528.74     -0.00
Elapsed    time              394.92                 396.70     -0.45

KernBench Comparison (x86_64)
-----------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time              559.39                 557.97      0.25
System CPU time               65.10                  66.17     -1.64
Total  CPU time              624.49                 624.14      0.06
Elapsed    time              158.54                 158.59     -0.03

The lack of a performance impact makes sense to me.  The following is a
simplified instruction comparison for each case:

2.6.20-clean                           2.6.20-pgtable_ops
-------------------                    --------------------
/* Load vm_flags */                    /* Load pagetable_ops pointer */
mov 	0x18(ecx),eax                  mov	0x48(ecx),eax
/* Test for VM_HUGETLB */              /* Test if it's NULL */
test 	$0x400000,eax                  test   eax,eax
/* If set, jump to call stub */        /* If so, jump away to main code */
jne 	c0148f04                       je	c0148ba1
...                                    /* Lookup the operation's function pointer */
/* copy_hugetlb_page_range call */     mov	0x4(eax),ebx
c0148f04:                              /* Test if it's NULL */
mov	0xffffff98(ebp),ecx            test   ebx,ebx
mov	0xffffff9c(ebp),edx            /* If so, jump away to main code */
mov	0xffffffa0(ebp),eax            je	c0148ba1
call	c01536e0                       /* pagetable operation call */
                                       mov	0xffffff9c(ebp),edx
				       mov	0xffffffa0(ebp),eax
				       call	*ebx

For the common case (vma->pagetable_ops == NULL), we do almost the same thing as the current code: load and test.  The third instruction is different in that we jump for the common case instead of jumping in the hugetlb case.  I don't think this is a big deal though.  If it is, would an unlikely() macro fix it?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-19 20:05 ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Andrew, given the favorable review of these patches the last time around, would
you consider them for the -mm tree?  Does anyone else have any objections?

The page tables for hugetlb mappings are handled differently than page tables
for normal pages.  Rather than integrating multiple page size support into the
core VM (which would tremendously complicate the code) some hooks were created.
This allows hugetlb special cases to be handled "out of line" by a separate
interface.

Hugetlbfs was the huge page interface chosen.  At the time, large database
users were the only big users of huge pages and the hugetlbfs design meets
their needs pretty well.  Over time, hugetlbfs has been expanded to enable new
uses of huge page memory with varied results.  As features are added, the
semantics become a permanent part of the Linux API.  This makes maintenance of
hugetlbfs an increasingly difficult task and inhibits the addition of features
and functionality in support of ever-changing hardware.

To remedy the situation, I propose an API (currently called
pagetable_operations).  All of the current hugetlbfs-specific hooks are moved
into an operations struct that is attached to VMAs.  The end result is a more
explicit and IMO a cleaner interface between hugetlbfs and the core VM.  We are
then free to add other hugetlb interfaces (such as a /dev/zero-styled character
device) that can operate either in concert with or independent of hugetlbfs.

There should be no measurable performance impact for normal page users (we're
checking if pagetable_ops != NULL instead of checking for vm_flags &
VM_HUGETLB).  Of course we do increase the VMA size by one pointer.  For huge
pages, there is an added indirection for pt_op() calls.  This patch series does
not change the logic of the the hugetlbfs operations, just moves them into the
pagetable_operations struct.

I did some pretty basic benchmarking of these patches on ppc64, x86, and x86_64
to get a feel for the fast-path performance impact.  The following tables show
kernbench performance comparisons between a clean 2.6.20 kernel and one with my
patches applied.  These numbers seem well within statistical noise to me.

Changes since V1:
	- Made hugetlbfs_pagetable_ops const (Thanks Arjan)

--

KernBench Comparison (ppc64)
----------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time              708.82                 708.59      0.03
System CPU time               62.50                  62.58     -0.13
Total  CPU time              771.32                 771.17      0.02
Elapsed    time              115.40                 115.35      0.04

KernBench Comparison (x86)
--------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time             1382.62                1381.88      0.05
System CPU time              146.06                 146.86     -0.55
Total  CPU time             1528.68                1528.74     -0.00
Elapsed    time              394.92                 396.70     -0.45

KernBench Comparison (x86_64)
-----------------------------
                       2.6.20-clean      2.6.20-pgtable_ops    pct. diff
User   CPU time              559.39                 557.97      0.25
System CPU time               65.10                  66.17     -1.64
Total  CPU time              624.49                 624.14      0.06
Elapsed    time              158.54                 158.59     -0.03

The lack of a performance impact makes sense to me.  The following is a
simplified instruction comparison for each case:

2.6.20-clean                           2.6.20-pgtable_ops
-------------------                    --------------------
/* Load vm_flags */                    /* Load pagetable_ops pointer */
mov 	0x18(ecx),eax                  mov	0x48(ecx),eax
/* Test for VM_HUGETLB */              /* Test if it's NULL */
test 	$0x400000,eax                  test   eax,eax
/* If set, jump to call stub */        /* If so, jump away to main code */
jne 	c0148f04                       je	c0148ba1
...                                    /* Lookup the operation's function pointer */
/* copy_hugetlb_page_range call */     mov	0x4(eax),ebx
c0148f04:                              /* Test if it's NULL */
mov	0xffffff98(ebp),ecx            test   ebx,ebx
mov	0xffffff9c(ebp),edx            /* If so, jump away to main code */
mov	0xffffffa0(ebp),eax            je	c0148ba1
call	c01536e0                       /* pagetable operation call */
                                       mov	0xffffff9c(ebp),edx
				       mov	0xffffffa0(ebp),eax
				       call	*ebx

For the common case (vma->pagetable_ops == NULL), we do almost the same thing as the current code: load and test.  The third instruction is different in that we jump for the common case instead of jumping in the hugetlb case.  I don't think this is a big deal though.  If it is, would an unlikely() macro fix it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:05   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 include/linux/mm.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60e0e4a..7089323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;
+	const struct pagetable_operations_struct * pagetable_ops;
 
 	/* Information about our backing store: */
 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+	int (*fault)(struct mm_struct *mm,
+		struct vm_area_struct *vma,
+		unsigned long address, int write_access);
+	int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+		struct vm_area_struct *vma);
+	int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+		struct page **pages, struct vm_area_struct **vmas,
+		unsigned long *position, int *length, int i);
+	void (*change_protection)(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, pgprot_t newprot);
+	unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, long *zap_work);
+	void (*free_pgtable_range)(struct mmu_gather **tlb,
+		unsigned long addr, unsigned long end,
+		unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+	((vma)->pagetable_ops && (vma)->pagetable_ops->op)
+#define pt_op(vma, call) \
+	((vma)->pagetable_ops->call)
+
 struct inode;
 
 #define page_private(page)		((page)->private)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-19 20:05   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 include/linux/mm.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60e0e4a..7089323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,6 +98,7 @@ struct vm_area_struct {
 
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;
+	const struct pagetable_operations_struct * pagetable_ops;
 
 	/* Information about our backing store: */
 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
@@ -218,6 +219,30 @@ struct vm_operations_struct {
 };
 
 struct mmu_gather;
+
+struct pagetable_operations_struct {
+	int (*fault)(struct mm_struct *mm,
+		struct vm_area_struct *vma,
+		unsigned long address, int write_access);
+	int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
+		struct vm_area_struct *vma);
+	int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
+		struct page **pages, struct vm_area_struct **vmas,
+		unsigned long *position, int *length, int i);
+	void (*change_protection)(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, pgprot_t newprot);
+	unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, long *zap_work);
+	void (*free_pgtable_range)(struct mmu_gather **tlb,
+		unsigned long addr, unsigned long end,
+		unsigned long floor, unsigned long ceiling);
+};
+
+#define has_pt_op(vma, op) \
+	((vma)->pagetable_ops && (vma)->pagetable_ops->op)
+#define pt_op(vma, call) \
+	((vma)->pagetable_ops->call)
+
 struct inode;
 
 #define page_private(page)		((page)->private)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/7] copy_vma for hugetlbfs
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:05   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    6 ++++++
 mm/memory.c          |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8c718a3..2452dde 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static const struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 */
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
+	vma->pagetable_ops = &hugetlbfs_pagetable_ops;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
@@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = {
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
 };
 
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+	.copy_vma		= copy_hugetlb_page_range,
+};
+
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
 	.create		= hugetlbfs_create,
 	.lookup		= simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index e7066e7..69bb0b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (has_pt_op(vma, copy_vma))
+		return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/7] copy_vma for hugetlbfs
@ 2007-03-19 20:05   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    6 ++++++
 mm/memory.c          |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8c718a3..2452dde 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 static const struct super_operations hugetlbfs_ops;
 static const struct address_space_operations hugetlbfs_aops;
 const struct file_operations hugetlbfs_file_operations;
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops;
 static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
@@ -70,6 +71,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 */
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
+	vma->pagetable_ops = &hugetlbfs_pagetable_ops;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
@@ -563,6 +565,10 @@ const struct file_operations hugetlbfs_file_operations = {
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
 };
 
+static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
+	.copy_vma		= copy_hugetlb_page_range,
+};
+
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
 	.create		= hugetlbfs_create,
 	.lookup		= simple_lookup,
diff --git a/mm/memory.c b/mm/memory.c
index e7066e7..69bb0b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,8 +602,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (has_pt_op(vma, copy_vma))
+		return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
 
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/7] pin_pages for hugetlb
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:05   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/memory.c          |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2452dde..d0b4b46 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
+	.pin_pages		= follow_hugetlb_page,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 69bb0b3..01256cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				|| !(vm_flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
-		if (is_vm_hugetlb_page(vma)) {
-			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i);
+		if (has_pt_op(vma, pin_pages)) {
+			i = pt_op(vma, pin_pages)(mm, vma, pages,
+						vmas, &start, &len, i);
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/7] pin_pages for hugetlb
@ 2007-03-19 20:05   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/memory.c          |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2452dde..d0b4b46 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -567,6 +567,7 @@ const struct file_operations hugetlbfs_file_operations = {
 
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
+	.pin_pages		= follow_hugetlb_page,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index 69bb0b3..01256cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,9 +1039,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				|| !(vm_flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
-		if (is_vm_hugetlb_page(vma)) {
-			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i);
+		if (has_pt_op(vma, pin_pages)) {
+			i = pt_op(vma, pin_pages)(mm, vma, pages,
+						vmas, &start, &len, i);
 			continue;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/7] unmap_page_range for hugetlb
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:05   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c    |    3 ++-
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   12 ++++++++----
 mm/memory.c             |   10 ++++------
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d0b4b46..198efa7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 			v_offset = 0;
 
 		__unmap_hugepage_range(vma,
-				vma->vm_start + v_offset, vma->vm_end);
+				vma->vm_start + v_offset, vma->vm_end, 0);
 	}
 }
 
@@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
 	.pin_pages		= follow_hugetlb_page,
+	.unmap_page_range	= unmap_hugepage_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3f3e7a6..502c2f8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36db012..d902fb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			    unsigned long end)
+			    unsigned long end, long *zap_work)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 		list_del(&page->lru);
 		put_page(page);
 	}
+
+	if (zap_work)
+		*zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, long *zap_work)
 {
 	/*
 	 * It is undesirable to test vma->vm_file as it should be non-null
@@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	 */
 	if (vma->vm_file) {
 		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end);
+		__unmap_hugepage_range(vma, start, end, zap_work);
 		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 	}
+	return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 01256cf..a3bcaf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
 				tlb_start_valid = 1;
 			}
 
-			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end);
-				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
-				start = end;
-			} else
+			if (unlikely(has_pt_op(vma, unmap_page_range)))
+				start = pt_op(vma, unmap_page_range)
+						(vma, start, end, &zap_work);
+			else
 				start = unmap_page_range(*tlbp, vma,
 						start, end, &zap_work, details);
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/7] unmap_page_range for hugetlb
@ 2007-03-19 20:05   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c    |    3 ++-
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   12 ++++++++----
 mm/memory.c             |   10 ++++------
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d0b4b46..198efa7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 			v_offset = 0;
 
 		__unmap_hugepage_range(vma,
-				vma->vm_start + v_offset, vma->vm_end);
+				vma->vm_start + v_offset, vma->vm_end, 0);
 	}
 }
 
@@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
 static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
 	.pin_pages		= follow_hugetlb_page,
+	.unmap_page_range	= unmap_hugepage_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3f3e7a6..502c2f8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
-void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
-void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
+unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 36db012..d902fb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -356,7 +356,7 @@ nomem:
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			    unsigned long end)
+			    unsigned long end, long *zap_work)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 		list_del(&page->lru);
 		put_page(page);
 	}
+
+	if (zap_work)
+		*zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end)
+unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, long *zap_work)
 {
 	/*
 	 * It is undesirable to test vma->vm_file as it should be non-null
@@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	 */
 	if (vma->vm_file) {
 		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end);
+		__unmap_hugepage_range(vma, start, end, zap_work);
 		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 	}
+	return end;
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 01256cf..a3bcaf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
 				tlb_start_valid = 1;
 			}
 
-			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end);
-				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
-				start = end;
-			} else
+			if (unlikely(has_pt_op(vma, unmap_page_range)))
+				start = pt_op(vma, unmap_page_range)
+						(vma, start, end, &zap_work);
+			else
 				start = unmap_page_range(*tlbp, vma,
 						start, end, &zap_work, details);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 5/7] change_protection for hugetlb
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:05   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/mprotect.c        |    5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 198efa7..3de5d93 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -569,6 +569,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
 	.pin_pages		= follow_hugetlb_page,
 	.unmap_page_range	= unmap_hugepage_range,
+	.change_protection	= hugetlb_change_protection,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
 		dirty_accountable = 1;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
+	if (has_pt_op(vma, change_protection))
+		pt_op(vma, change_protection)(vma, start, end,
+			vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 5/7] change_protection for hugetlb
@ 2007-03-19 20:05   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/mprotect.c        |    5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 198efa7..3de5d93 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -569,6 +569,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.copy_vma		= copy_hugetlb_page_range,
 	.pin_pages		= follow_hugetlb_page,
 	.unmap_page_range	= unmap_hugepage_range,
+	.change_protection	= hugetlb_change_protection,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3b8f3c0..172e204 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -201,8 +201,9 @@ success:
 		dirty_accountable = 1;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
+	if (has_pt_op(vma, change_protection))
+		pt_op(vma, change_protection)(vma, start, end,
+			vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 6/7] free_pgtable_range for hugetlb
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:06   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/memory.c          |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3de5d93..823a9e3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -570,6 +570,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.pin_pages		= follow_hugetlb_page,
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
+	.free_pgtable_range	= hugetlb_free_pgd_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index a3bcaf3..d2f28e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
 		anon_vma_unlink(vma);
 		unlink_file_vma(vma);
 
-		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+		if (has_pt_op(vma, free_pgtable_range)) {
+			pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !has_pt_op(next, free_pgtable_range)) {
 				vma = next;
 				next = vma->vm_next;
 				anon_vma_unlink(vma);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 6/7] free_pgtable_range for hugetlb
@ 2007-03-19 20:06   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/memory.c          |    6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3de5d93..823a9e3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -570,6 +570,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.pin_pages		= follow_hugetlb_page,
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
+	.free_pgtable_range	= hugetlb_free_pgd_range,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/memory.c b/mm/memory.c
index a3bcaf3..d2f28e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -279,15 +279,15 @@ void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
 		anon_vma_unlink(vma);
 		unlink_file_vma(vma);
 
-		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+		if (has_pt_op(vma, free_pgtable_range)) {
+			pt_op(vma, free_pgtable_range)(tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !has_pt_op(next, free_pgtable_range)) {
 				vma = next;
 				next = vma->vm_next;
 				anon_vma_unlink(vma);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 7/7] hugetlbfs fault handler
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-19 20:06   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/hugetlb.c         |    4 +++-
 mm/memory.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 823a9e3..29e65c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -571,6 +571,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
 	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= hugetlb_fault,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d902fb9..e0f0607 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	int remainder = *length;
 
+	BUG_ON(!has_pt_op(vma, fault));
+
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			int ret;
 
 			spin_unlock(&mm->page_table_lock);
-			ret = hugetlb_fault(mm, vma, vaddr, 0);
+			ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
 			spin_lock(&mm->page_table_lock);
 			if (ret == VM_FAULT_MINOR)
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index d2f28e7..bd7c243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(has_pt_op(vma, fault)))
+		return pt_op(vma, fault)(mm, vma, address, write_access);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 7/7] hugetlbfs fault handler
@ 2007-03-19 20:06   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-19 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adam Litke, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/hugetlb.c         |    4 +++-
 mm/memory.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 823a9e3..29e65c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -571,6 +571,7 @@ static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
 	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= hugetlb_fault,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d902fb9..e0f0607 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	int remainder = *length;
 
+	BUG_ON(!has_pt_op(vma, fault));
+
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -606,7 +608,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			int ret;
 
 			spin_unlock(&mm->page_table_lock);
-			ret = hugetlb_fault(mm, vma, vaddr, 0);
+			ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
 			spin_lock(&mm->page_table_lock);
 			if (ret == VM_FAULT_MINOR)
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index d2f28e7..bd7c243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2499,8 +2499,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(has_pt_op(vma, fault)))
+		return pt_op(vma, fault)(mm, vma, address, write_access);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-19 20:05   ` Adam Litke
@ 2007-03-20 23:24     ` Dave Hansen
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:24 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> 
> +#define has_pt_op(vma, op) \
> +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> +#define pt_op(vma, call) \
> +       ((vma)->pagetable_ops->call) 

Can you get rid of these macros?  I think they make it a wee bit harder
to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  

+       if (has_pt_op(vma, copy_vma))
+               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);

+       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
+               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);

I guess it does lead to some longish lines.  Does it start looking
really nasty?

If you're going to have them, it might just be best to put a single
unlikely() around the macro definitions themselves to keep anybody from
having to open-code it for any of the users.  

-- Dave


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-20 23:24     ` Dave Hansen
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:24 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> 
> +#define has_pt_op(vma, op) \
> +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> +#define pt_op(vma, call) \
> +       ((vma)->pagetable_ops->call) 

Can you get rid of these macros?  I think they make it a wee bit harder
to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  

+       if (has_pt_op(vma, copy_vma))
+               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);

+       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
+               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);

I guess it does lead to some longish lines.  Does it start looking
really nasty?

If you're going to have them, it might just be best to put a single
unlikely() around the macro definitions themselves to keep anybody from
having to open-code it for any of the users.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] unmap_page_range for hugetlb
  2007-03-19 20:05   ` Adam Litke
@ 2007-03-20 23:27     ` Dave Hansen
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:27 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> Signed-off-by: Adam Litke <agl@us.ibm.com>
> ---
> 
>  fs/hugetlbfs/inode.c    |    3 ++-
>  include/linux/hugetlb.h |    4 ++--
>  mm/hugetlb.c            |   12 ++++++++----
>  mm/memory.c             |   10 ++++------
>  4 files changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d0b4b46..198efa7 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
>  			v_offset = 0;
> 
>  		__unmap_hugepage_range(vma,
> -				vma->vm_start + v_offset, vma->vm_end);
> +				vma->vm_start + v_offset, vma->vm_end, 0);
>  	}
>  }

Did you mean NULL instead of 0 here?

> @@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
>  static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
>  	.copy_vma		= copy_hugetlb_page_range,
>  	.pin_pages		= follow_hugetlb_page,
> +	.unmap_page_range	= unmap_hugepage_range,
>  };
> 
>  static const struct inode_operations hugetlbfs_dir_inode_operations = {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 3f3e7a6..502c2f8 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
>  int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
> -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
> -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
> +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
> +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
>  int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
>  int hugetlb_report_meminfo(char *);
>  int hugetlb_report_node_meminfo(int, char *);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 36db012..d902fb9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -356,7 +356,7 @@ nomem:
>  }
> 
>  void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> -			    unsigned long end)
> +			    unsigned long end, long *zap_work)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> @@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  		list_del(&page->lru);
>  		put_page(page);
>  	}
> +
> +	if (zap_work)
> +		*zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
>  }
> 
> -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> -			  unsigned long end)
> +unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end, long *zap_work)
>  {
>  	/*
>  	 * It is undesirable to test vma->vm_file as it should be non-null
> @@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  	 */
>  	if (vma->vm_file) {
>  		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> -		__unmap_hugepage_range(vma, start, end);
> +		__unmap_hugepage_range(vma, start, end, zap_work);
>  		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	}
> +	return end;
>  }
> 
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> diff --git a/mm/memory.c b/mm/memory.c
> index 01256cf..a3bcaf3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
>  				tlb_start_valid = 1;
>  			}
> 
> -			if (unlikely(is_vm_hugetlb_page(vma))) {
> -				unmap_hugepage_range(vma, start, end);
> -				zap_work -= (end - start) /
> -						(HPAGE_SIZE / PAGE_SIZE);
> -				start = end;
> -			} else
> +			if (unlikely(has_pt_op(vma, unmap_page_range)))
> +				start = pt_op(vma, unmap_page_range)
> +						(vma, start, end, &zap_work);
> +			else
>  				start = unmap_page_range(*tlbp, vma,
>  						start, end, &zap_work, details);

unmap_vmas() is the sole caller of unmap_page_range().  Would it make
sense to move the hook inside of unmap_page_range()?

-- Dave


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/7] unmap_page_range for hugetlb
@ 2007-03-20 23:27     ` Dave Hansen
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:27 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> Signed-off-by: Adam Litke <agl@us.ibm.com>
> ---
> 
>  fs/hugetlbfs/inode.c    |    3 ++-
>  include/linux/hugetlb.h |    4 ++--
>  mm/hugetlb.c            |   12 ++++++++----
>  mm/memory.c             |   10 ++++------
>  4 files changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d0b4b46..198efa7 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -289,7 +289,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
>  			v_offset = 0;
> 
>  		__unmap_hugepage_range(vma,
> -				vma->vm_start + v_offset, vma->vm_end);
> +				vma->vm_start + v_offset, vma->vm_end, 0);
>  	}
>  }

Did you mean NULL instead of 0 here?

> @@ -568,6 +568,7 @@ const struct file_operations hugetlbfs_file_operations = {
>  static const struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
>  	.copy_vma		= copy_hugetlb_page_range,
>  	.pin_pages		= follow_hugetlb_page,
> +	.unmap_page_range	= unmap_hugepage_range,
>  };
> 
>  static const struct inode_operations hugetlbfs_dir_inode_operations = {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 3f3e7a6..502c2f8 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -17,8 +17,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
>  int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
> -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
> -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
> +unsigned long unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
> +void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, long *);
>  int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
>  int hugetlb_report_meminfo(char *);
>  int hugetlb_report_node_meminfo(int, char *);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 36db012..d902fb9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -356,7 +356,7 @@ nomem:
>  }
> 
>  void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> -			    unsigned long end)
> +			    unsigned long end, long *zap_work)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> @@ -399,10 +399,13 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  		list_del(&page->lru);
>  		put_page(page);
>  	}
> +
> +	if (zap_work)
> +		*zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE);
>  }
> 
> -void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> -			  unsigned long end)
> +unsigned long unmap_hugepage_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end, long *zap_work)
>  {
>  	/*
>  	 * It is undesirable to test vma->vm_file as it should be non-null
> @@ -414,9 +417,10 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  	 */
>  	if (vma->vm_file) {
>  		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> -		__unmap_hugepage_range(vma, start, end);
> +		__unmap_hugepage_range(vma, start, end, zap_work);
>  		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	}
> +	return end;
>  }
> 
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> diff --git a/mm/memory.c b/mm/memory.c
> index 01256cf..a3bcaf3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -839,12 +839,10 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
>  				tlb_start_valid = 1;
>  			}
> 
> -			if (unlikely(is_vm_hugetlb_page(vma))) {
> -				unmap_hugepage_range(vma, start, end);
> -				zap_work -= (end - start) /
> -						(HPAGE_SIZE / PAGE_SIZE);
> -				start = end;
> -			} else
> +			if (unlikely(has_pt_op(vma, unmap_page_range)))
> +				start = pt_op(vma, unmap_page_range)
> +						(vma, start, end, &zap_work);
> +			else
>  				start = unmap_page_range(*tlbp, vma,
>  						start, end, &zap_work, details);

unmap_vmas() is the sole caller of unmap_page_range().  Would it make
sense to move the hook inside of unmap_page_range()?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-20 23:50   ` Dave Hansen
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:50 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> For the common case (vma->pagetable_ops == NULL), we do almost the
> same thing as the current code: load and test.  The third instruction
> is different in that we jump for the common case instead of jumping in
> the hugetlb case.  I don't think this is a big deal though.  If it is,
> would an unlikely() macro fix it? 

I wouldn't worry about micro-optimizing it at that level.  The CPU does
enough stuff under the covers that I wouldn't worry about it at all.

I wonder if the real differential impact (if any) is likely to come from
the pagetable_ops cacheline being hot or cold, since it is in a
different place in the structure than the flags.  But, from a quick
glance I see a few vm_ops references preceding pagetable_ops references,
so the pagetable_ops cacheline might already be hot most of the time.  

BTW, are there any other possible users for these things other than
large pages?

-- Dave


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-20 23:50   ` Dave Hansen
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2007-03-20 23:50 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> For the common case (vma->pagetable_ops == NULL), we do almost the
> same thing as the current code: load and test.  The third instruction
> is different in that we jump for the common case instead of jumping in
> the hugetlb case.  I don't think this is a big deal though.  If it is,
> would an unlikely() macro fix it? 

I wouldn't worry about micro-optimizing it at that level.  The CPU does
enough stuff under the covers that I wouldn't worry about it at all.

I wonder if the real differential impact (if any) is likely to come from
the pagetable_ops cacheline being hot or cold, since it is in a
different place in the structure than the flags.  But, from a quick
glance I see a few vm_ops references preceding pagetable_ops references,
so the pagetable_ops cacheline might already be hot most of the time.  

BTW, are there any other possible users for these things other than
large pages?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-21  1:17   ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  1:17 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, Christoph Hellwig, Ken Chen,
	linux-mm, linux-kernel

On Mon, Mar 19, 2007 at 01:05:02PM -0700, Adam Litke wrote:
> Andrew, given the favorable review of these patches the last time
> around, would you consider them for the -mm tree?  Does anyone else
> have any objections?

We need a new round of commentary for how it should integrate with
Nick Piggin's fault handling patches given that both introduce very
similar ->fault() methods, albeit at different places and for different
purposes.

I think things weren't entirely wrapped up last time but there was
general approval in concept and code-level issues had been gotten past.
I've forgotten the conclusion of hch and arjan's commentary on making
the pagetable operations mandatory. ISTR they were all cosmetic affairs
like that or whether they should be part of ->vm_ops as opposed to
fundamental issues.

The last thing I'd want to do is hold things back, so by no means
delay merging etc. on account of this, but I am curious on several
points. First, is there any demonstrable overhead to mandatory indirect
calls for the pagetable operations? Second, can case analysis for e.g.
file-backed vs. anon and/or COW vs. shared be avoided by the use of
the indirect function call, or more specifically, to any beneficial
effect? Well, I rearranged the code in such a manner ca. 2.6.6 so I
know the rearrangement is possible, but not the performance impact vs.
modern kernels, if any, never mind how the code ends up looking in
modern kernels. Third, could you use lmbench or some such to get direct
fork() and fault handling microbenchmarks? Kernel compiles are too
close to macrobenchmarks to say anything concrete there apart from that
other issues (e.g. SMP load balancing, NUMA, lock contention, etc.)
dominate indirect calls. If you have the time or interest to explore
any of these areas, I'd be very interested in hearing the results.

One thing I would like to see for sure is dropping the has_pt_op()
and pt_op() macros. The Linux-native convention is to open-code the
function pointer fetches, and the non-native convention is to wrap
things like defaulting (though they actually do something more involved)
in the analogue of pt_op() for the purposes of things like extensible
sets of operations bordering on OOP-ish method tables. So this ends up
as some sort of hybrid convention without the functionality of the
non-native call wrappers and without the clarity of open-coding. My
personal preference is that the function pointer table be mandatory and
the call to the the function pointer be unconditional and the type
dispatch accomplished entirely through the function pointers, but I'm
not particularly insistent about that.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-21  1:17   ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  1:17 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, Christoph Hellwig, Ken Chen,
	linux-mm, linux-kernel

On Mon, Mar 19, 2007 at 01:05:02PM -0700, Adam Litke wrote:
> Andrew, given the favorable review of these patches the last time
> around, would you consider them for the -mm tree?  Does anyone else
> have any objections?

We need a new round of commentary for how it should integrate with
Nick Piggin's fault handling patches given that both introduce very
similar ->fault() methods, albeit at different places and for different
purposes.

I think things weren't entirely wrapped up last time but there was
general approval in concept and code-level issues had been gotten past.
I've forgotten the conclusion of hch and arjan's commentary on making
the pagetable operations mandatory. ISTR they were all cosmetic affairs
like that or whether they should be part of ->vm_ops as opposed to
fundamental issues.

The last thing I'd want to do is hold things back, so by no means
delay merging etc. on account of this, but I am curious on several
points. First, is there any demonstrable overhead to mandatory indirect
calls for the pagetable operations? Second, can case analysis for e.g.
file-backed vs. anon and/or COW vs. shared be avoided by the use of
the indirect function call, or more specifically, to any beneficial
effect? Well, I rearranged the code in such a manner ca. 2.6.6 so I
know the rearrangement is possible, but not the performance impact vs.
modern kernels, if any, never mind how the code ends up looking in
modern kernels. Third, could you use lmbench or some such to get direct
fork() and fault handling microbenchmarks? Kernel compiles are too
close to macrobenchmarks to say anything concrete there apart from that
other issues (e.g. SMP load balancing, NUMA, lock contention, etc.)
dominate indirect calls. If you have the time or interest to explore
any of these areas, I'd be very interested in hearing the results.

One thing I would like to see for sure is dropping the has_pt_op()
and pt_op() macros. The Linux-native convention is to open-code the
function pointer fetches, and the non-native convention is to wrap
things like defaulting (though they actually do something more involved)
in the analogue of pt_op() for the purposes of things like extensible
sets of operations bordering on OOP-ish method tables. So this ends up
as some sort of hybrid convention without the functionality of the
non-native call wrappers and without the clarity of open-coding. My
personal preference is that the function pointer table be mandatory and
the call to the the function pointer be unconditional and the type
dispatch accomplished entirely through the function pointers, but I'm
not particularly insistent about that.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-19 20:05   ` Adam Litke
@ 2007-03-21  4:18     ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  4:18 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
> Signed-off-by: Adam Litke <agl@us.ibm.com>
> ---
> 
>  include/linux/mm.h |   25 +++++++++++++++++++++++++
>  1 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60e0e4a..7089323 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,6 +98,7 @@ struct vm_area_struct {
>  
>  	/* Function pointers to deal with this struct. */
>  	struct vm_operations_struct * vm_ops;
> +	const struct pagetable_operations_struct * pagetable_ops;
>  
>  	/* Information about our backing store: */
>  	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE

Can you remind me why this isn't in vm_ops?

Also, it is going to be hugepage-only, isn't it? So should the naming be
changed to reflect that? And #ifdef it...

> @@ -218,6 +219,30 @@ struct vm_operations_struct {
>  };
>  
>  struct mmu_gather;
> +
> +struct pagetable_operations_struct {
> +	int (*fault)(struct mm_struct *mm,
> +		struct vm_area_struct *vma,
> +		unsigned long address, int write_access);

I got dibs on fault ;)

My callback is a sanitised one that basically abstracts the details of the
virtual memory mapping away, so it is usable by drivers and filesystems.

You actually want to bypass the normal fault handling because it doesn't
know how to deal with your virtual memory mapping. Hmm, the best suggestion
I can come up with is handle_mm_fault... unless you can think of a better
name for me to use.

> +	int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
> +		struct vm_area_struct *vma);
> +	int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
> +		struct page **pages, struct vm_area_struct **vmas,
> +		unsigned long *position, int *length, int i);
> +	void (*change_protection)(struct vm_area_struct *vma,
> +		unsigned long address, unsigned long end, pgprot_t newprot);
> +	unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
> +		unsigned long address, unsigned long end, long *zap_work);
> +	void (*free_pgtable_range)(struct mmu_gather **tlb,
> +		unsigned long addr, unsigned long end,
> +		unsigned long floor, unsigned long ceiling);
> +};
> +
> +#define has_pt_op(vma, op) \
> +	((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> +#define pt_op(vma, call) \
> +	((vma)->pagetable_ops->call)
> +
>  struct inode;
>  
>  #define page_private(page)		((page)->private)
> 
> --

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  4:18     ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  4:18 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
> Signed-off-by: Adam Litke <agl@us.ibm.com>
> ---
> 
>  include/linux/mm.h |   25 +++++++++++++++++++++++++
>  1 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60e0e4a..7089323 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,6 +98,7 @@ struct vm_area_struct {
>  
>  	/* Function pointers to deal with this struct. */
>  	struct vm_operations_struct * vm_ops;
> +	const struct pagetable_operations_struct * pagetable_ops;
>  
>  	/* Information about our backing store: */
>  	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE

Can you remind me why this isn't in vm_ops?

Also, it is going to be hugepage-only, isn't it? So should the naming be
changed to reflect that? And #ifdef it...

> @@ -218,6 +219,30 @@ struct vm_operations_struct {
>  };
>  
>  struct mmu_gather;
> +
> +struct pagetable_operations_struct {
> +	int (*fault)(struct mm_struct *mm,
> +		struct vm_area_struct *vma,
> +		unsigned long address, int write_access);

I got dibs on fault ;)

My callback is a sanitised one that basically abstracts the details of the
virtual memory mapping away, so it is usable by drivers and filesystems.

You actually want to bypass the normal fault handling because it doesn't
know how to deal with your virtual memory mapping. Hmm, the best suggestion
I can come up with is handle_mm_fault... unless you can think of a better
name for me to use.

> +	int (*copy_vma)(struct mm_struct *dst, struct mm_struct *src,
> +		struct vm_area_struct *vma);
> +	int (*pin_pages)(struct mm_struct *mm, struct vm_area_struct *vma,
> +		struct page **pages, struct vm_area_struct **vmas,
> +		unsigned long *position, int *length, int i);
> +	void (*change_protection)(struct vm_area_struct *vma,
> +		unsigned long address, unsigned long end, pgprot_t newprot);
> +	unsigned long (*unmap_page_range)(struct vm_area_struct *vma,
> +		unsigned long address, unsigned long end, long *zap_work);
> +	void (*free_pgtable_range)(struct mmu_gather **tlb,
> +		unsigned long addr, unsigned long end,
> +		unsigned long floor, unsigned long ceiling);
> +};
> +
> +#define has_pt_op(vma, op) \
> +	((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> +#define pt_op(vma, call) \
> +	((vma)->pagetable_ops->call)
> +
>  struct inode;
>  
>  #define page_private(page)		((page)->private)
> 
> --

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  4:18     ` Nick Piggin
@ 2007-03-21  4:52       ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  4:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
>>  	struct vm_operations_struct * vm_ops;
>> +	const struct pagetable_operations_struct * pagetable_ops;

On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> Can you remind me why this isn't in vm_ops?
> Also, it is going to be hugepage-only, isn't it? So should the naming be
> changed to reflect that? And #ifdef it...

ISTR potential ppc64 users coming out of the woodwork for something I
didn't recognize the name of, but I may be confusing that with your
patch. I can implement additional users (and useful ones at that)
needing this in particular if desired.


Adam Litke wrote:
>> +struct pagetable_operations_struct {
>> +	int (*fault)(struct mm_struct *mm,

On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> I got dibs on fault ;)
> My callback is a sanitised one that basically abstracts the details of the
> virtual memory mapping away, so it is usable by drivers and filesystems.
> You actually want to bypass the normal fault handling because it doesn't
> know how to deal with your virtual memory mapping. Hmm, the best suggestion
> I can come up with is handle_mm_fault... unless you can think of a better
> name for me to use.

Two fault handling methods callbacks raise an eyebrow over here at least.
I was vaguely hoping for unification of the fault handling callbacks.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  4:52       ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  4:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
>>  	struct vm_operations_struct * vm_ops;
>> +	const struct pagetable_operations_struct * pagetable_ops;

On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> Can you remind me why this isn't in vm_ops?
> Also, it is going to be hugepage-only, isn't it? So should the naming be
> changed to reflect that? And #ifdef it...

ISTR potential ppc64 users coming out of the woodwork for something I
didn't recognize the name of, but I may be confusing that with your
patch. I can implement additional users (and useful ones at that)
needing this in particular if desired.


Adam Litke wrote:
>> +struct pagetable_operations_struct {
>> +	int (*fault)(struct mm_struct *mm,

On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> I got dibs on fault ;)
> My callback is a sanitised one that basically abstracts the details of the
> virtual memory mapping away, so it is usable by drivers and filesystems.
> You actually want to bypass the normal fault handling because it doesn't
> know how to deal with your virtual memory mapping. Hmm, the best suggestion
> I can come up with is handle_mm_fault... unless you can think of a better
> name for me to use.

Two fault handling methods callbacks raise an eyebrow over here at least.
I was vaguely hoping for unification of the fault handling callbacks.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  4:52       ` William Lee Irwin III
@ 2007-03-21  5:07         ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  5:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

William Lee Irwin III wrote:
> Adam Litke wrote:
> 
>>> 	struct vm_operations_struct * vm_ops;
>>>+	const struct pagetable_operations_struct * pagetable_ops;
> 
> 
> On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> 
>>Can you remind me why this isn't in vm_ops?
>>Also, it is going to be hugepage-only, isn't it? So should the naming be
>>changed to reflect that? And #ifdef it...
> 
> 
> ISTR potential ppc64 users coming out of the woodwork for something I
> didn't recognize the name of, but I may be confusing that with your
> patch. I can implement additional users (and useful ones at that)
> needing this in particular if desired.

Yes I would be interested in seeing useful additional users of this
that cannot use our regular virtual memory, before making it a general
thing.

I just don't want to see proliferation of these things, if possible.

> Adam Litke wrote:
> 
>>>+struct pagetable_operations_struct {
>>>+	int (*fault)(struct mm_struct *mm,
> 
> 
> On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> 
>>I got dibs on fault ;)
>>My callback is a sanitised one that basically abstracts the details of the
>>virtual memory mapping away, so it is usable by drivers and filesystems.
>>You actually want to bypass the normal fault handling because it doesn't
>>know how to deal with your virtual memory mapping. Hmm, the best suggestion
>>I can come up with is handle_mm_fault... unless you can think of a better
>>name for me to use.
> 
> 
> Two fault handling methods callbacks raise an eyebrow over here at least.
> I was vaguely hoping for unification of the fault handling callbacks.

I don't know if it would be so clean to do that as they are at different levels.
Adam's fault is before the VM translation (and bypasses it), and mine is after.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  5:07         ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  5:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

William Lee Irwin III wrote:
> Adam Litke wrote:
> 
>>> 	struct vm_operations_struct * vm_ops;
>>>+	const struct pagetable_operations_struct * pagetable_ops;
> 
> 
> On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> 
>>Can you remind me why this isn't in vm_ops?
>>Also, it is going to be hugepage-only, isn't it? So should the naming be
>>changed to reflect that? And #ifdef it...
> 
> 
> ISTR potential ppc64 users coming out of the woodwork for something I
> didn't recognize the name of, but I may be confusing that with your
> patch. I can implement additional users (and useful ones at that)
> needing this in particular if desired.

Yes I would be interested in seeing useful additional users of this
that cannot use our regular virtual memory, before making it a general
thing.

I just don't want to see proliferation of these things, if possible.

> Adam Litke wrote:
> 
>>>+struct pagetable_operations_struct {
>>>+	int (*fault)(struct mm_struct *mm,
> 
> 
> On Wed, Mar 21, 2007 at 03:18:30PM +1100, Nick Piggin wrote:
> 
>>I got dibs on fault ;)
>>My callback is a sanitised one that basically abstracts the details of the
>>virtual memory mapping away, so it is usable by drivers and filesystems.
>>You actually want to bypass the normal fault handling because it doesn't
>>know how to deal with your virtual memory mapping. Hmm, the best suggestion
>>I can come up with is handle_mm_fault... unless you can think of a better
>>name for me to use.
> 
> 
> Two fault handling methods callbacks raise an eyebrow over here at least.
> I was vaguely hoping for unification of the fault handling callbacks.

I don't know if it would be so clean to do that as they are at different levels.
Adam's fault is before the VM translation (and bypasses it), and mine is after.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  5:07         ` Nick Piggin
@ 2007-03-21  5:41           ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  5:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

William Lee Irwin III wrote:
>> ISTR potential ppc64 users coming out of the woodwork for something I
>> didn't recognize the name of, but I may be confusing that with your
>> patch. I can implement additional users (and useful ones at that)
>> needing this in particular if desired.

On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> Yes I would be interested in seeing useful additional users of this
> that cannot use our regular virtual memory, before making it a general
> thing.
> I just don't want to see proliferation of these things, if possible.

I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
in a few weeks I can start up on the first two of the bunch.


William Lee Irwin III wrote:
>> Two fault handling methods callbacks raise an eyebrow over here at least.
>> I was vaguely hoping for unification of the fault handling callbacks.

On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> I don't know if it would be so clean to do that as they are at different 
> levels.
> Adam's fault is before the VM translation (and bypasses it), and mine is 
> after.

Not much of a VM translation; it's just a lookup through the software
mocked-up structures on everything save i386, x86_64, and some m68k where
they're the same thing only with hardware walkers (ISTR ia64's being
firmware a la Alpha despite the "HPW" name, though I could be wrong)
reliant on them. The drivers/etc. could just as easily use helper
functions to carry out the lookup, thereby accomplishing the
unification. There's nothing particularly fundamental about a pte
lookup. Normal arches that do software TLB refill could just as easily
consult the radix trees dangled off struct address_space or any old
data structure floating around the kernel with enough information to
translate user virtual addresses to the physical addresses they need to
fill the TLB with, and there are other kernels that literally do things
like that.

Basically, drop in to the ->fault() callback with no attempt at a pte
lookup. The drivers using the standard pagetable format can call helper
functions to do all the gruntwork surrounding that for them. Then the
more sophisticated drivers can do the necessary work by hand.

But others should really be consulted on this point. My notions in/around
this area tend to be outside the mainstream. I can anticipate that the
two ->fault() functions will look strange to people, but not what
alternatives would be most idiomatic to mainstream Linux conventions.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  5:41           ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21  5:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

William Lee Irwin III wrote:
>> ISTR potential ppc64 users coming out of the woodwork for something I
>> didn't recognize the name of, but I may be confusing that with your
>> patch. I can implement additional users (and useful ones at that)
>> needing this in particular if desired.

On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> Yes I would be interested in seeing useful additional users of this
> that cannot use our regular virtual memory, before making it a general
> thing.
> I just don't want to see proliferation of these things, if possible.

I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
in a few weeks I can start up on the first two of the bunch.


William Lee Irwin III wrote:
>> Two fault handling methods callbacks raise an eyebrow over here at least.
>> I was vaguely hoping for unification of the fault handling callbacks.

On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> I don't know if it would be so clean to do that as they are at different 
> levels.
> Adam's fault is before the VM translation (and bypasses it), and mine is 
> after.

Not much of a VM translation; it's just a lookup through the software
mocked-up structures on everything save i386, x86_64, and some m68k where
they're the same thing only with hardware walkers (ISTR ia64's being
firmware a la Alpha despite the "HPW" name, though I could be wrong)
reliant on them. The drivers/etc. could just as easily use helper
functions to carry out the lookup, thereby accomplishing the
unification. There's nothing particularly fundamental about a pte
lookup. Normal arches that do software TLB refill could just as easily
consult the radix trees dangled off struct address_space or any old
data structure floating around the kernel with enough information to
translate user virtual addresses to the physical addresses they need to
fill the TLB with, and there are other kernels that literally do things
like that.

Basically, drop in to the ->fault() callback with no attempt at a pte
lookup. The drivers using the standard pagetable format can call helper
functions to do all the gruntwork surrounding that for them. Then the
more sophisticated drivers can do the necessary work by hand.

But others should really be consulted on this point. My notions in/around
this area tend to be outside the mainstream. I can anticipate that the
two ->fault() functions will look strange to people, but not what
alternatives would be most idiomatic to mainstream Linux conventions.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  5:41           ` William Lee Irwin III
@ 2007-03-21  6:51             ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  6:51 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel, Linus Torvalds

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>ISTR potential ppc64 users coming out of the woodwork for something I
>>>didn't recognize the name of, but I may be confusing that with your
>>>patch. I can implement additional users (and useful ones at that)
>>>needing this in particular if desired.
> 
> 
> On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> 
>>Yes I would be interested in seeing useful additional users of this
>>that cannot use our regular virtual memory, before making it a general
>>thing.
>>I just don't want to see proliferation of these things, if possible.
> 
> 
> I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
> in a few weeks I can start up on the first two of the bunch.

Care to give us a hint? :)


> William Lee Irwin III wrote:
> 
>>>Two fault handling methods callbacks raise an eyebrow over here at least.
>>>I was vaguely hoping for unification of the fault handling callbacks.
> 
> 
> On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> 
>>I don't know if it would be so clean to do that as they are at different 
>>levels.
>>Adam's fault is before the VM translation (and bypasses it), and mine is 
>>after.
> 
> 
> Not much of a VM translation; it's just a lookup through the software
> mocked-up structures on everything save i386, x86_64, and some m68k where
> they're the same thing only with hardware walkers (ISTR ia64's being
> firmware a la Alpha despite the "HPW" name, though I could be wrong)

Well the vma+pagetables *are* our VM translation data structure. It is
a good data structure. The Gelato/UNSW guys experimenting with changing
this have basically said they haven't yet got anything that beats it.

I would be opposed to anything that bypasses that unless a) it is not
applicable to the VM as a whole, and b) it is really worth it
(hugepages was a reasonable exception).


> reliant on them. The drivers/etc. could just as easily use helper
> functions to carry out the lookup, thereby accomplishing the
> unification. There's nothing particularly fundamental about a pte
> lookup.

Yeah you could, but it looks back to front to me.

The VM tells the filesystem that the machine took a fault at virtual
address X, then the filesystem asks the VM what pgoff that is, then
tells the VM to install the corresponding page to vaddr X.

With my ->fault, the VM asks the filesystem to give the page that
corresponds to vaddr X, then installs it into that vaddr.


> Normal arches that do software TLB refill could just as easily
> consult the radix trees dangled off struct address_space or any old
> data structure floating around the kernel with enough information to
> translate user virtual addresses to the physical addresses they need to
> fill the TLB with, and there are other kernels that literally do things
> like that.

Sure it *could* be done, but it may not be very nice, given Linux's
design. And you definitely need _something_ other than just the
pagecache radix-tree, because the VM needs to know who maps the page.

So if, for your backing store, you use a small hash table and evict old
entries like powerpc, you'll constantly be faulting in and out pages
from the VM's high level view of the address space. That isn't a really
cheap operation. It takes at least:

read_lock_irq(mapping->tree_lock);
radix_tree_lookup()
read_unlock_irq(mapping->tree_lock);
lock_page()
atomic_add(page->_count)
atomic_add(page->_mapcount)
unlock_page()

atomic_add_negative(page->_mapcount)
atomic_dec_and_test(page->_count)

Compared to our current page table walk which is just a single locked
op + barrier for the spinlock + radix tree walk.


If you had a very large hash table (ia64 long mode, maybe?), then you
may have slightly fewer high level faults, but range based operations
are going to take a whole lot of cache misses, aren't they? Especially
for small processes.

Not that I wouldn't be happy to be proven wrong, but I don't think it
should be something that sneaks in under these pagetable operations.
IMO.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  6:51             ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  6:51 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel, Linus Torvalds

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>ISTR potential ppc64 users coming out of the woodwork for something I
>>>didn't recognize the name of, but I may be confusing that with your
>>>patch. I can implement additional users (and useful ones at that)
>>>needing this in particular if desired.
> 
> 
> On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> 
>>Yes I would be interested in seeing useful additional users of this
>>that cannot use our regular virtual memory, before making it a general
>>thing.
>>I just don't want to see proliferation of these things, if possible.
> 
> 
> I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
> in a few weeks I can start up on the first two of the bunch.

Care to give us a hint? :)


> William Lee Irwin III wrote:
> 
>>>Two fault handling methods callbacks raise an eyebrow over here at least.
>>>I was vaguely hoping for unification of the fault handling callbacks.
> 
> 
> On Wed, Mar 21, 2007 at 04:07:43PM +1100, Nick Piggin wrote:
> 
>>I don't know if it would be so clean to do that as they are at different 
>>levels.
>>Adam's fault is before the VM translation (and bypasses it), and mine is 
>>after.
> 
> 
> Not much of a VM translation; it's just a lookup through the software
> mocked-up structures on everything save i386, x86_64, and some m68k where
> they're the same thing only with hardware walkers (ISTR ia64's being
> firmware a la Alpha despite the "HPW" name, though I could be wrong)

Well the vma+pagetables *are* our VM translation data structure. It is
a good data structure. The Gelato/UNSW guys experimenting with changing
this have basically said they haven't yet got anything that beats it.

I would be opposed to anything that bypasses that unless a) it is not
applicable to the VM as a whole, and b) it is really worth it
(hugepages was a reasonable exception).


> reliant on them. The drivers/etc. could just as easily use helper
> functions to carry out the lookup, thereby accomplishing the
> unification. There's nothing particularly fundamental about a pte
> lookup.

Yeah you could, but it looks back to front to me.

The VM tells the filesystem that the machine took a fault at virtual
address X, then the filesystem asks the VM what pgoff that is, then
tells the VM to install the corresponding page to vaddr X.

With my ->fault, the VM asks the filesystem to give the page that
corresponds to vaddr X, then installs it into that vaddr.


> Normal arches that do software TLB refill could just as easily
> consult the radix trees dangled off struct address_space or any old
> data structure floating around the kernel with enough information to
> translate user virtual addresses to the physical addresses they need to
> fill the TLB with, and there are other kernels that literally do things
> like that.

Sure it *could* be done, but it may not be very nice, given Linux's
design. And you definitely need _something_ other than just the
pagecache radix-tree, because the VM needs to know who maps the page.

So if, for your backing store, you use a small hash table and evict old
entries like powerpc, you'll constantly be faulting in and out pages
from the VM's high level view of the address space. That isn't a really
cheap operation. It takes at least:

read_lock_irq(mapping->tree_lock);
radix_tree_lookup()
read_unlock_irq(mapping->tree_lock);
lock_page()
atomic_add(page->_count)
atomic_add(page->_mapcount)
unlock_page()

atomic_add_negative(page->_mapcount)
atomic_dec_and_test(page->_count)

Compared to our current page table walk which is just a single locked
op + barrier for the spinlock + radix tree walk.


If you had a very large hash table (ia64 long mode, maybe?), then you
may have slightly fewer high level faults, but range based operations
are going to take a whole lot of cache misses, aren't they? Especially
for small processes.

Not that I wouldn't be happy to be proven wrong, but I don't think it
should be something that sneaks in under these pagetable operations.
IMO.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  6:51             ` Nick Piggin
@ 2007-03-21  7:36               ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  7:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Adam Litke, Andrew Morton,
	Arjan van de Ven, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel, Linus Torvalds

Nick Piggin wrote:

> Yeah you could, but it looks back to front to me.
> 
> The VM tells the filesystem that the machine took a fault at virtual
> address X, then the filesystem asks the VM what pgoff that is, then
> tells the VM to install the corresponding page to vaddr X.
> 
> With my ->fault, the VM asks the filesystem to give the page that
> corresponds to vaddr X, then installs it into that vaddr.

Err, sorry, that's what the current ->nopage does. It is then still
up to the filesystem to do the vaddr to pgoff conversion.

My fault patches of course just ask the filesystem for the page at
a given pgoff.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21  7:36               ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21  7:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Adam Litke, Andrew Morton,
	Arjan van de Ven, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel, Linus Torvalds

Nick Piggin wrote:

> Yeah you could, but it looks back to front to me.
> 
> The VM tells the filesystem that the machine took a fault at virtual
> address X, then the filesystem asks the VM what pgoff that is, then
> tells the VM to install the corresponding page to vaddr X.
> 
> With my ->fault, the VM asks the filesystem to give the page that
> corresponds to vaddr X, then installs it into that vaddr.

Err, sorry, that's what the current ->nopage does. It is then still
up to the filesystem to do the vaddr to pgoff conversion.

My fault patches of course just ask the filesystem for the page at
a given pgoff.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  6:51             ` Nick Piggin
@ 2007-03-21 10:46               ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 10:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel, Linus Torvalds

William Lee Irwin III wrote:
>> I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
>> in a few weeks I can start up on the first two of the bunch.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Care to give us a hint? :)

The first is something DISM-like. I've not made up my mind on the
second, but the shopping catalogue of feature requests I've done
nothing about for some time that want this is long.


William Lee Irwin III wrote:
>> Not much of a VM translation; it's just a lookup through the
>> software mocked-up structures on everything save i386, x86_64, and
>> some m68k where they're the same thing only with hardware walkers
>> (ISTR ia64's being firmware a la Alpha despite the "HPW" name,
>> though I could be wrong)

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Well the vma+pagetables *are* our VM translation data structure. It is
> a good data structure. The Gelato/UNSW guys experimenting with changing
> this have basically said they haven't yet got anything that beats it.
> I would be opposed to anything that bypasses that unless a) it is not
> applicable to the VM as a whole, and b) it is really worth it
> (hugepages was a reasonable exception).

Maybe anticipating the conventional Linux approach to this wasn't as
difficult as I supposed. ;)


William Lee Irwin III wrote:
>> reliant on them. The drivers/etc. could just as easily use helper
>> functions to carry out the lookup, thereby accomplishing the
>> unification. There's nothing particularly fundamental about a pte
>> lookup.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Yeah you could, but it looks back to front to me.
> The VM tells the filesystem that the machine took a fault at virtual
> address X, then the filesystem asks the VM what pgoff that is, then
> tells the VM to install the corresponding page to vaddr X.
> With my ->fault, the VM asks the filesystem to give the page that
> corresponds to vaddr X, then installs it into that vaddr.

I'm aware of what is now done and the minor modification accomplished
by your ->fault(). Maybe I've even written something like this before
that I never posted. It's obvious what I'm on about and that my
thoughts here are too divergent to fly. Others should chime in with
more Linux-native ideas about what's to be done here.


William Lee Irwin III wrote:
>> Normal arches that do software TLB refill could just as easily
>> consult the radix trees dangled off struct address_space or any old
>> data structure floating around the kernel with enough information to
>> translate user virtual addresses to the physical addresses they need to
>> fill the TLB with, and there are other kernels that literally do things
>> like that.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Sure it *could* be done, but it may not be very nice, given Linux's
> design. And you definitely need _something_ other than just the
> pagecache radix-tree, because the VM needs to know who maps the page.
> So if, for your backing store, you use a small hash table and evict old
> entries like powerpc, you'll constantly be faulting in and out pages
> from the VM's high level view of the address space. That isn't a really
> cheap operation. It takes at least:
[long list of locking operations snipped]
> Compared to our current page table walk which is just a single locked
> op + barrier for the spinlock + radix tree walk.
> If you had a very large hash table (ia64 long mode, maybe?), then you
> may have slightly fewer high level faults, but range based operations
> are going to take a whole lot of cache misses, aren't they? Especially
> for small processes.
> Not that I wouldn't be happy to be proven wrong, but I don't think it
> should be something that sneaks in under these pagetable operations.
> IMO.

I'll presume that was not for my benefit; if so, it was superfluous.

The example I gave was to show how far things could diverge from Linux'
conventions. Every single locking operation cited for Linux didn't
apply to the kernel I was thinking of due to its lockless pagecache
analogue, its lack of a direct equivalent of struct page, and its use
of different lifetime-bounding protocols from reference counting.
Things like page replacement didn't rely on things that would disturb
all that. It all worked out quite well for that kernel. So not only
can it be done other ways, but those ways are indeed efficient.

It should be clear from the above that retrofitting Linux to do similar
is effectively impossible. (Well, if you think you can pull off removing
struct page in favor of no direct equivalent and bounding the lifetimes
of page-sized chunks of memory by shooting down all references using
knowledge of who could possibly be hanging onto them in Linux, feel
free to attempt such a retrofit, and I'll send you a case of Scotch
whisky if you can get it to boot and run a major database benchmark
without crashing regardless of whether it's merged.)

In any event, let's not talk too much at cross-purposes. I'm deferring
to others on this, as I said. Someone else will doubtless come at this
from a direction that gibes better with Linux-native conventions.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 10:46               ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 10:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel, Linus Torvalds

William Lee Irwin III wrote:
>> I'm tied up elsewhere so I won't get to it in a timely fashion. Maybe
>> in a few weeks I can start up on the first two of the bunch.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Care to give us a hint? :)

The first is something DISM-like. I've not made up my mind on the
second, but the shopping catalogue of feature requests I've done
nothing about for some time that want this is long.


William Lee Irwin III wrote:
>> Not much of a VM translation; it's just a lookup through the
>> software mocked-up structures on everything save i386, x86_64, and
>> some m68k where they're the same thing only with hardware walkers
>> (ISTR ia64's being firmware a la Alpha despite the "HPW" name,
>> though I could be wrong)

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Well the vma+pagetables *are* our VM translation data structure. It is
> a good data structure. The Gelato/UNSW guys experimenting with changing
> this have basically said they haven't yet got anything that beats it.
> I would be opposed to anything that bypasses that unless a) it is not
> applicable to the VM as a whole, and b) it is really worth it
> (hugepages was a reasonable exception).

Maybe anticipating the conventional Linux approach to this wasn't as
difficult as I supposed. ;)


William Lee Irwin III wrote:
>> reliant on them. The drivers/etc. could just as easily use helper
>> functions to carry out the lookup, thereby accomplishing the
>> unification. There's nothing particularly fundamental about a pte
>> lookup.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Yeah you could, but it looks back to front to me.
> The VM tells the filesystem that the machine took a fault at virtual
> address X, then the filesystem asks the VM what pgoff that is, then
> tells the VM to install the corresponding page to vaddr X.
> With my ->fault, the VM asks the filesystem to give the page that
> corresponds to vaddr X, then installs it into that vaddr.

I'm aware of what is now done and the minor modification accomplished
by your ->fault(). Maybe I've even written something like this before
that I never posted. It's obvious what I'm on about and that my
thoughts here are too divergent to fly. Others should chime in with
more Linux-native ideas about what's to be done here.


William Lee Irwin III wrote:
>> Normal arches that do software TLB refill could just as easily
>> consult the radix trees dangled off struct address_space or any old
>> data structure floating around the kernel with enough information to
>> translate user virtual addresses to the physical addresses they need to
>> fill the TLB with, and there are other kernels that literally do things
>> like that.

On Wed, Mar 21, 2007 at 05:51:23PM +1100, Nick Piggin wrote:
> Sure it *could* be done, but it may not be very nice, given Linux's
> design. And you definitely need _something_ other than just the
> pagecache radix-tree, because the VM needs to know who maps the page.
> So if, for your backing store, you use a small hash table and evict old
> entries like powerpc, you'll constantly be faulting in and out pages
> from the VM's high level view of the address space. That isn't a really
> cheap operation. It takes at least:
[long list of locking operations snipped]
> Compared to our current page table walk which is just a single locked
> op + barrier for the spinlock + radix tree walk.
> If you had a very large hash table (ia64 long mode, maybe?), then you
> may have slightly fewer high level faults, but range based operations
> are going to take a whole lot of cache misses, aren't they? Especially
> for small processes.
> Not that I wouldn't be happy to be proven wrong, but I don't think it
> should be something that sneaks in under these pagetable operations.
> IMO.

I'll presume that was not for my benefit; if so, it was superfluous.

The example I gave was to show how far things could diverge from Linux'
conventions. Every single locking operation cited for Linux didn't
apply to the kernel I was thinking of due to its lockless pagecache
analogue, its lack of a direct equivalent of struct page, and its use
of different lifetime-bounding protocols from reference counting.
Things like page replacement didn't rely on things that would disturb
all that. It all worked out quite well for that kernel. So not only
can it be done other ways, but those ways are indeed efficient.

It should be clear from the above that retrofitting Linux to do similar
is effectively impossible. (Well, if you think you can pull off removing
struct page in favor of no direct equivalent and bounding the lifetimes
of page-sized chunks of memory by shooting down all references using
knowledge of who could possibly be hanging onto them in Linux, feel
free to attempt such a retrofit, and I'll send you a case of Scotch
whisky if you can get it to boot and run a major database benchmark
without crashing regardless of whether it's merged.)

In any event, let's not talk too much at cross-purposes. I'm deferring
to others on this, as I said. Someone else will doubtless come at this
from a direction that gibes better with Linux-native conventions.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-20 23:24     ` Dave Hansen
@ 2007-03-21 14:50       ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 14:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
> On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> > 
> > +#define has_pt_op(vma, op) \
> > +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> > +#define pt_op(vma, call) \
> > +       ((vma)->pagetable_ops->call) 
> 
> Can you get rid of these macros?  I think they make it a wee bit harder
> to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
> 
> +       if (has_pt_op(vma, copy_vma))
> +               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
> 
> +       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
> +               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);
> 
> I guess it does lead to some longish lines.  Does it start looking
> really nasty?

Yeah, it starts to look pretty bad.  Some of these calls are in code
that is already indented several times.

> If you're going to have them, it might just be best to put a single
> unlikely() around the macro definitions themselves to keep anybody from
> having to open-code it for any of the users.  

It should be pretty easy to wrap has_pt_op() with an unlikely().  Good
suggestion.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 14:50       ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 14:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
> On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> > 
> > +#define has_pt_op(vma, op) \
> > +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> > +#define pt_op(vma, call) \
> > +       ((vma)->pagetable_ops->call) 
> 
> Can you get rid of these macros?  I think they make it a wee bit harder
> to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
> 
> +       if (has_pt_op(vma, copy_vma))
> +               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
> 
> +       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
> +               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);
> 
> I guess it does lead to some longish lines.  Does it start looking
> really nasty?

Yeah, it starts to look pretty bad.  Some of these calls are in code
that is already indented several times.

> If you're going to have them, it might just be best to put a single
> unlikely() around the macro definitions themselves to keep anybody from
> having to open-code it for any of the users.  

It should be pretty easy to wrap has_pt_op() with an unlikely().  Good
suggestion.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21 14:50       ` Adam Litke
@ 2007-03-21 15:05         ` Arjan van de Ven
  -1 siblings, 0 replies; 82+ messages in thread
From: Arjan van de Ven @ 2007-03-21 15:05 UTC (permalink / raw)
  To: Adam Litke
  Cc: Dave Hansen, Andrew Morton, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 09:50 -0500, Adam Litke wrote:
> On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
> > On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> > > 
> > > +#define has_pt_op(vma, op) \
> > > +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> > > +#define pt_op(vma, call) \
> > > +       ((vma)->pagetable_ops->call) 
> > 
> > Can you get rid of these macros?  I think they make it a wee bit harder
> > to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
> > 
> > +       if (has_pt_op(vma, copy_vma))
> > +               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
> > 
> > +       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
> > +               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);
> > 
> > I guess it does lead to some longish lines.  Does it start looking
> > really nasty?
> 
> Yeah, it starts to look pretty bad.  Some of these calls are in code
> that is already indented several times.

can we just make sure these things are never NULL in the first place?
would obsolete a lot of the checks, which are also runtime overhead as
well!
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 15:05         ` Arjan van de Ven
  0 siblings, 0 replies; 82+ messages in thread
From: Arjan van de Ven @ 2007-03-21 15:05 UTC (permalink / raw)
  To: Adam Litke
  Cc: Dave Hansen, Andrew Morton, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 09:50 -0500, Adam Litke wrote:
> On Tue, 2007-03-20 at 16:24 -0700, Dave Hansen wrote:
> > On Mon, 2007-03-19 at 13:05 -0700, Adam Litke wrote:
> > > 
> > > +#define has_pt_op(vma, op) \
> > > +       ((vma)->pagetable_ops && (vma)->pagetable_ops->op)
> > > +#define pt_op(vma, call) \
> > > +       ((vma)->pagetable_ops->call) 
> > 
> > Can you get rid of these macros?  I think they make it a wee bit harder
> > to read.  My brain doesn't properly parse the foo(arg)(bar) syntax.  
> > 
> > +       if (has_pt_op(vma, copy_vma))
> > +               return pt_op(vma, copy_vma)(dst_mm, src_mm, vma);
> > 
> > +       if (vma->pagetable_ops && vma->pagetable_ops->copy_vma)
> > +               return vma->pagetable_ops->copy_vma(dst_mm, src_mm, vma);
> > 
> > I guess it does lead to some longish lines.  Does it start looking
> > really nasty?
> 
> Yeah, it starts to look pretty bad.  Some of these calls are in code
> that is already indented several times.

can we just make sure these things are never NULL in the first place?
would obsolete a lot of the checks, which are also runtime overhead as
well!
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21  4:18     ` Nick Piggin
@ 2007-03-21 15:17       ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 15:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
> Adam Litke wrote:
> > Signed-off-by: Adam Litke <agl@us.ibm.com>
> > ---
> > 
> >  include/linux/mm.h |   25 +++++++++++++++++++++++++
> >  1 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60e0e4a..7089323 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -98,6 +98,7 @@ struct vm_area_struct {
> >  
> >  	/* Function pointers to deal with this struct. */
> >  	struct vm_operations_struct * vm_ops;
> > +	const struct pagetable_operations_struct * pagetable_ops;
> >  
> >  	/* Information about our backing store: */
> >  	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
> 
> Can you remind me why this isn't in vm_ops?

We didn't want to bloat the size of the vm_ops struct for all of its
users.

> Also, it is going to be hugepage-only, isn't it? So should the naming be
> changed to reflect that? And #ifdef it...

They are doing some interesting things on Cell that could take advantage
of this.

> > @@ -218,6 +219,30 @@ struct vm_operations_struct {
> >  };
> >  
> >  struct mmu_gather;
> > +
> > +struct pagetable_operations_struct {
> > +	int (*fault)(struct mm_struct *mm,
> > +		struct vm_area_struct *vma,
> > +		unsigned long address, int write_access);
> 
> I got dibs on fault ;)
> 
> My callback is a sanitised one that basically abstracts the details of the
> virtual memory mapping away, so it is usable by drivers and filesystems.
> 
> You actually want to bypass the normal fault handling because it doesn't
> know how to deal with your virtual memory mapping. Hmm, the best suggestion
> I can come up with is handle_mm_fault... unless you can think of a better
> name for me to use.

How about I use handle_pte_fault?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 15:17       ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 15:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
> Adam Litke wrote:
> > Signed-off-by: Adam Litke <agl@us.ibm.com>
> > ---
> > 
> >  include/linux/mm.h |   25 +++++++++++++++++++++++++
> >  1 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60e0e4a..7089323 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -98,6 +98,7 @@ struct vm_area_struct {
> >  
> >  	/* Function pointers to deal with this struct. */
> >  	struct vm_operations_struct * vm_ops;
> > +	const struct pagetable_operations_struct * pagetable_ops;
> >  
> >  	/* Information about our backing store: */
> >  	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
> 
> Can you remind me why this isn't in vm_ops?

We didn't want to bloat the size of the vm_ops struct for all of its
users.

> Also, it is going to be hugepage-only, isn't it? So should the naming be
> changed to reflect that? And #ifdef it...

They are doing some interesting things on Cell that could take advantage
of this.

> > @@ -218,6 +219,30 @@ struct vm_operations_struct {
> >  };
> >  
> >  struct mmu_gather;
> > +
> > +struct pagetable_operations_struct {
> > +	int (*fault)(struct mm_struct *mm,
> > +		struct vm_area_struct *vma,
> > +		unsigned long address, int write_access);
> 
> I got dibs on fault ;)
> 
> My callback is a sanitised one that basically abstracts the details of the
> virtual memory mapping away, so it is usable by drivers and filesystems.
> 
> You actually want to bypass the normal fault handling because it doesn't
> know how to deal with your virtual memory mapping. Hmm, the best suggestion
> I can come up with is handle_mm_fault... unless you can think of a better
> name for me to use.

How about I use handle_pte_fault?

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-21 15:55   ` Hugh Dickins
  -1 siblings, 0 replies; 82+ messages in thread
From: Hugh Dickins @ 2007-03-21 15:55 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 19 Mar 2007, Adam Litke wrote:
> Andrew, given the favorable review of these patches the last time around, would
> you consider them for the -mm tree?  Does anyone else have any objections?

I quite fail to understand the enthusiasm for these patches.  All they
do is make the already ugly interfaces to hugetlb more obscure than at
present, and open the door to even uglier stuff later.  Don't you need
to wait for at least one other user of these interfaces to emerge,
to get a better idea of whether they're appropriate?

Hugh

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-21 15:55   ` Hugh Dickins
  0 siblings, 0 replies; 82+ messages in thread
From: Hugh Dickins @ 2007-03-21 15:55 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Mon, 19 Mar 2007, Adam Litke wrote:
> Andrew, given the favorable review of these patches the last time around, would
> you consider them for the -mm tree?  Does anyone else have any objections?

I quite fail to understand the enthusiasm for these patches.  All they
do is make the already ugly interfaces to hugetlb more obscure than at
present, and open the door to even uglier stuff later.  Don't you need
to wait for at least one other user of these interfaces to emerge,
to get a better idea of whether they're appropriate?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21 15:17       ` Adam Litke
@ 2007-03-21 16:00         ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-21 16:00 UTC (permalink / raw)
  To: Adam Litke
  Cc: Nick Piggin, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel

On Wed, Mar 21, 2007 at 10:17:40AM -0500, Adam Litke wrote:
> > Also, it is going to be hugepage-only, isn't it? So should the naming be
> > changed to reflect that? And #ifdef it...
> 
> They are doing some interesting things on Cell that could take advantage
> of this.

That would be new to me.  What we need on Cell is fixing up the
get_unmapped_area mess which Ben is working on now.

And let me once again repeat that I don't like this at all.  I'll
rather have a few ugly ifdefs in strategic places than a big object
oriented mess like this with just a single user.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 16:00         ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-21 16:00 UTC (permalink / raw)
  To: Adam Litke
  Cc: Nick Piggin, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel

On Wed, Mar 21, 2007 at 10:17:40AM -0500, Adam Litke wrote:
> > Also, it is going to be hugepage-only, isn't it? So should the naming be
> > changed to reflect that? And #ifdef it...
> 
> They are doing some interesting things on Cell that could take advantage
> of this.

That would be new to me.  What we need on Cell is fixing up the
get_unmapped_area mess which Ben is working on now.

And let me once again repeat that I don't like this at all.  I'll
rather have a few ugly ifdefs in strategic places than a big object
oriented mess like this with just a single user.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-21 15:55   ` Hugh Dickins
@ 2007-03-21 16:01     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-21 16:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel

On Wed, Mar 21, 2007 at 03:55:54PM +0000, Hugh Dickins wrote:
> On Mon, 19 Mar 2007, Adam Litke wrote:
> > Andrew, given the favorable review of these patches the last time around, would
> > you consider them for the -mm tree?  Does anyone else have any objections?
> 
> I quite fail to understand the enthusiasm for these patches.  All they
> do is make the already ugly interfaces to hugetlb more obscure than at
> present, and open the door to even uglier stuff later.  Don't you need
> to wait for at least one other user of these interfaces to emerge,
> to get a better idea of whether they're appropriate?

*nod*

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
@ 2007-03-21 16:01     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-21 16:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Christoph Hellwig, Ken Chen, linux-mm,
	linux-kernel

On Wed, Mar 21, 2007 at 03:55:54PM +0000, Hugh Dickins wrote:
> On Mon, 19 Mar 2007, Adam Litke wrote:
> > Andrew, given the favorable review of these patches the last time around, would
> > you consider them for the -mm tree?  Does anyone else have any objections?
> 
> I quite fail to understand the enthusiasm for these patches.  All they
> do is make the already ugly interfaces to hugetlb more obscure than at
> present, and open the door to even uglier stuff later.  Don't you need
> to wait for at least one other user of these interfaces to emerge,
> to get a better idea of whether they're appropriate?

*nod*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-21 15:55   ` Hugh Dickins
  (?)
  (?)
@ 2007-03-21 16:23   ` William Lee Irwin III
  2007-03-21 17:08     ` Hugh Dickins
  -1 siblings, 1 reply; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 16:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

On Mon, 19 Mar 2007, Adam Litke wrote:
>> Andrew, given the favorable review of these patches the last time
>> around, would you consider them for the -mm tree?  Does anyone else
>> have any objections?

On Wed, Mar 21, 2007 at 03:55:54PM +0000, Hugh Dickins wrote:
> I quite fail to understand the enthusiasm for these patches.  All they
> do is make the already ugly interfaces to hugetlb more obscure than at
> present, and open the door to even uglier stuff later.  Don't you need
> to wait for at least one other user of these interfaces to emerge,
> to get a better idea of whether they're appropriate?

The lack of an interface of this sort has essentially blocked the
development of some of them.

What sort of uglier stuff are you concerned about this enabling? My
wild guess is precisely the prospective users in my queue of features
to implement that I've neglected on account of the lack of such an
interface. It might be a good idea for me to take whatever distaste for
them exists into account before belting out the code for them.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-21 16:23   ` William Lee Irwin III
@ 2007-03-21 17:08     ` Hugh Dickins
  2007-03-21 17:42       ` William Lee Irwin III
  0 siblings, 1 reply; 82+ messages in thread
From: Hugh Dickins @ 2007-03-21 17:08 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

On Wed, 21 Mar 2007, William Lee Irwin III wrote:
> 
> What sort of uglier stuff are you concerned about this enabling? My
> wild guess is precisely the prospective users in my queue of features
> to implement that I've neglected on account of the lack of such an
> interface. It might be a good idea for me to take whatever distaste for
> them exists into account before belting out the code for them.

I'll probably hate it whatever it is, don't worry about me ;)  I'm
rather weary of all the grand mm schemes people are toting these days.

I guess my distaste is for letting a door open, for all kinds of
drivers (or odd corners of architectures etc) to take control of
the mm in ways we've never anticipated, and which we'll forever
after be stumbling over.

It would be a good idea for you to belt out the code for a few of them,
to give everyone else an idea of what's to be let in through this door:
I haven't the faintest idea what's envisaged.

Seeing what you have, maybe everyone will react that of course
Adam's page table ops are the way to go, or maybe the reverse.

I would be rather surprised if the adhoc collection of divergences
you found necessary for hugetlb turn out to have a fundamental
applicability.  Perhaps we'll need subvert_the_core_ops.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2)
  2007-03-21 17:08     ` Hugh Dickins
@ 2007-03-21 17:42       ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 17:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

On Wed, 21 Mar 2007, William Lee Irwin III wrote:
>> What sort of uglier stuff are you concerned about this enabling? My
>> wild guess is precisely the prospective users in my queue of features
>> to implement that I've neglected on account of the lack of such an
>> interface. It might be a good idea for me to take whatever distaste for
>> them exists into account before belting out the code for them.

On Wed, Mar 21, 2007 at 05:08:00PM +0000, Hugh Dickins wrote:
> I'll probably hate it whatever it is, don't worry about me ;)  I'm
> rather weary of all the grand mm schemes people are toting these days.

Breathe a sigh of relief; I've retired from the world of grand schemes.
That much should be clear from my patch output over the past few years.
A driver in its own corner of the tree wouldn't have been so ambitious.


On Wed, Mar 21, 2007 at 05:08:00PM +0000, Hugh Dickins wrote:
> I guess my distaste is for letting a door open, for all kinds of
> drivers (or odd corners of architectures etc) to take control of
> the mm in ways we've never anticipated, and which we'll forever
> after be stumbling over.
> It would be a good idea for you to belt out the code for a few of them,
> to give everyone else an idea of what's to be let in through this door:
> I haven't the faintest idea what's envisaged.

After consulting with Adam Litke it appears I should remain silent on
this front.


On Wed, Mar 21, 2007 at 05:08:00PM +0000, Hugh Dickins wrote:
> Seeing what you have, maybe everyone will react that of course
> Adam's page table ops are the way to go, or maybe the reverse.
> I would be rather surprised if the adhoc collection of divergences
> you found necessary for hugetlb turn out to have a fundamental
> applicability.  Perhaps we'll need subvert_the_core_ops.

The fact something can be done with them doesn't mean it has to be
done. At least in some corners of industry, the mere opportunity
to do things considered distasteful doesn't outweigh courtesy. Users
doing distasteful things need not be merged, nor shipped or supported,
regardless of what infrastructure they could utilize.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* pagetable_ops: Hugetlb character device example
  2007-03-19 20:05 ` Adam Litke
@ 2007-03-21 19:43   ` Adam Litke
  -1 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, William Lee Irwin III, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke <agl@us.ibm.com>
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)	+= ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)	+= hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)		+= tpm/
+obj-$(CONFIG_HUGETLB_PAGE)	+= page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 0000000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
+
+static const struct {
+	unsigned int    minor;
+	char            *name;
+	umode_t         mode;
+} devlist[] = {
+	{1, "page-huge", S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+			unsigned long address, int *unused)
+{
+	BUG();
+	return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+	.nopage	= page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	pte_t *ptep;
+	pte_t entry, new_entry;
+	int ret;
+	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+	ptep = huge_pte_alloc(mm, address);
+	if (!ptep)
+		return VM_FAULT_OOM;
+
+	mutex_lock(&hugetlb_instantiation_mutex);
+	entry = *ptep;
+	if (pte_none(entry)) {
+		struct page *page;
+
+		page = alloc_huge_page(vma, address);
+		if (!page)
+			return VM_FAULT_OOM;
+		clear_huge_page(page, address);
+
+		ret = VM_FAULT_MINOR;
+		spin_lock(&mm->page_table_lock);
+		if (!pte_none(*ptep))
+			goto out;
+		add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+		new_entry = make_huge_pte(vma, page, 0);
+		set_huge_pte_at(mm, address, ptep, new_entry);
+		goto out;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	/* Check for a racing update before calling hugetlb_cow */
+	if (likely(pte_same(entry, *ptep)))
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+	spin_unlock(&mm->page_table_lock);
+	mutex_unlock(&hugetlb_instantiation_mutex);
+	return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+	.copy_vma		= copy_hugetlb_page_range,
+	.pin_pages		= follow_hugetlb_page,
+	.unmap_page_range	= unmap_hugepage_range,
+	.change_protection	= hugetlb_change_protection,
+	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHARED)
+		return -EINVAL;
+
+	if (vma->vm_pgoff)
+		return -EINVAL;
+
+	if (vma->vm_start & ~HPAGE_MASK)
+		return -EINVAL;
+
+	if (vma->vm_end & ~HPAGE_MASK)
+		return -EINVAL;
+
+	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_flags |= (VM_HUGETLB | VM_RESERVED);
+	vma->vm_ops = &page_vm_ops;
+	vma->pagetable_ops = &page_pagetable_ops;
+
+	return 0;
+}
+
+const struct file_operations page_file_operations = {
+	.mmap			= page_mmap,
+	.get_unmapped_area	= hugetlb_get_unmapped_area,
+	.prepare_unmapped_area	= prepare_hugepage_range,
+};
+
+static struct class *page_class;
+
+static int __init chr_dev_init(void)
+{
+	int major, i;
+
+	printk("Initializing page devices...");
+	major = register_chrdev(0, "page", &page_file_operations);
+	if (major <= 0)
+		printk("failed\n");
+	else
+		printk("(%i:0)\n", major);
+
+	page_class = class_create(THIS_MODULE, "page");
+	for (i = 0; i < ARRAY_SIZE(devlist); i++)
+		class_device_create(page_class, NULL,
+			MKDEV(major, devlist[i].minor),
+			NULL, devlist[i].name);
+
+	return 0;
+}
+fs_initcall(chr_dev_init);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4fc0bca..edd4944 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	BUG_ON(!has_pt_op(vma, fault));
 
+	BUG_ON(!has_pt_op(vma,fault));
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* pagetable_ops: Hugetlb character device example
@ 2007-03-21 19:43   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, William Lee Irwin III, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke <agl@us.ibm.com>
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)	+= ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)	+= hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)		+= tpm/
+obj-$(CONFIG_HUGETLB_PAGE)	+= page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 0000000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
+
+static const struct {
+	unsigned int    minor;
+	char            *name;
+	umode_t         mode;
+} devlist[] = {
+	{1, "page-huge", S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+			unsigned long address, int *unused)
+{
+	BUG();
+	return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+	.nopage	= page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	pte_t *ptep;
+	pte_t entry, new_entry;
+	int ret;
+	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+	ptep = huge_pte_alloc(mm, address);
+	if (!ptep)
+		return VM_FAULT_OOM;
+
+	mutex_lock(&hugetlb_instantiation_mutex);
+	entry = *ptep;
+	if (pte_none(entry)) {
+		struct page *page;
+
+		page = alloc_huge_page(vma, address);
+		if (!page)
+			return VM_FAULT_OOM;
+		clear_huge_page(page, address);
+
+		ret = VM_FAULT_MINOR;
+		spin_lock(&mm->page_table_lock);
+		if (!pte_none(*ptep))
+			goto out;
+		add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+		new_entry = make_huge_pte(vma, page, 0);
+		set_huge_pte_at(mm, address, ptep, new_entry);
+		goto out;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	/* Check for a racing update before calling hugetlb_cow */
+	if (likely(pte_same(entry, *ptep)))
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+	spin_unlock(&mm->page_table_lock);
+	mutex_unlock(&hugetlb_instantiation_mutex);
+	return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+	.copy_vma		= copy_hugetlb_page_range,
+	.pin_pages		= follow_hugetlb_page,
+	.unmap_page_range	= unmap_hugepage_range,
+	.change_protection	= hugetlb_change_protection,
+	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_SHARED)
+		return -EINVAL;
+
+	if (vma->vm_pgoff)
+		return -EINVAL;
+
+	if (vma->vm_start & ~HPAGE_MASK)
+		return -EINVAL;
+
+	if (vma->vm_end & ~HPAGE_MASK)
+		return -EINVAL;
+
+	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_flags |= (VM_HUGETLB | VM_RESERVED);
+	vma->vm_ops = &page_vm_ops;
+	vma->pagetable_ops = &page_pagetable_ops;
+
+	return 0;
+}
+
+const struct file_operations page_file_operations = {
+	.mmap			= page_mmap,
+	.get_unmapped_area	= hugetlb_get_unmapped_area,
+	.prepare_unmapped_area	= prepare_hugepage_range,
+};
+
+static struct class *page_class;
+
+static int __init chr_dev_init(void)
+{
+	int major, i;
+
+	printk("Initializing page devices...");
+	major = register_chrdev(0, "page", &page_file_operations);
+	if (major <= 0)
+		printk("failed\n");
+	else
+		printk("(%i:0)\n", major);
+
+	page_class = class_create(THIS_MODULE, "page");
+	for (i = 0; i < ARRAY_SIZE(devlist); i++)
+		class_device_create(page_class, NULL,
+			MKDEV(major, devlist[i].minor),
+			NULL, devlist[i].name);
+
+	return 0;
+}
+fs_initcall(chr_dev_init);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4fc0bca..edd4944 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -590,6 +590,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	BUG_ON(!has_pt_op(vma, fault));
 
+	BUG_ON(!has_pt_op(vma,fault));
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 19:43   ` Adam Litke
  (?)
@ 2007-03-21 19:51   ` Valdis.Kletnieks
  2007-03-21 20:26       ` Adam Litke
  2007-03-21 22:26       ` William Lee Irwin III
  -1 siblings, 2 replies; 82+ messages in thread
From: Valdis.Kletnieks @ 2007-03-21 19:51 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> The main reason I am advocating a set of pagetable_operations is to
> enable the development of a new hugetlb interface.

Do you have an exit strategy for the *old* interface?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 19:51   ` Valdis.Kletnieks
@ 2007-03-21 20:26       ` Adam Litke
  2007-03-21 22:26       ` William Lee Irwin III
  1 sibling, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 20:26 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 15:51 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.
> 
> Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-21 20:26       ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-03-21 20:26 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, 2007-03-21 at 15:51 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.
> 
> Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 19:51   ` Valdis.Kletnieks
@ 2007-03-21 22:26       ` William Lee Irwin III
  2007-03-21 22:26       ` William Lee Irwin III
  1 sibling, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 22:26 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
>> The main reason I am advocating a set of pagetable_operations is to
>> enable the development of a new hugetlb interface.

On Wed, Mar 21, 2007 at 03:51:31PM -0400, Valdis.Kletnieks@vt.edu wrote:
> Do you have an exit strategy for the *old* interface?

Hello.

My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
acquired the necessary functionality until expand-on-mmap() was merged.
That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
wrapper for expand-on-mmap() around ramfs once ramfs acquires the
necessary functionality is now the exit strategy.

Given current opinions on general multiple pagesize support, by means of
which the ramfs functionality is/was intended to be implemented, that
time may well be "never."

Character device analogues of /dev/zero are not replacements for the
filesystem. Few or no transitions of existing users to such are
possible. It primarily enables new users who really need anonymous
hugetlb, such as numerical applications. The need for a filesystem
namespace and persisting across process creation and destruction will
not be eliminated by character devices.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-21 22:26       ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 22:26 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
>> The main reason I am advocating a set of pagetable_operations is to
>> enable the development of a new hugetlb interface.

On Wed, Mar 21, 2007 at 03:51:31PM -0400, Valdis.Kletnieks@vt.edu wrote:
> Do you have an exit strategy for the *old* interface?

Hello.

My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
acquired the necessary functionality until expand-on-mmap() was merged.
That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
wrapper for expand-on-mmap() around ramfs once ramfs acquires the
necessary functionality is now the exit strategy.

Given current opinions on general multiple pagesize support, by means of
which the ramfs functionality is/was intended to be implemented, that
time may well be "never."

Character device analogues of /dev/zero are not replacements for the
filesystem. Few or no transitions of existing users to such are
possible. It primarily enables new users who really need anonymous
hugetlb, such as numerical applications. The need for a filesystem
namespace and persisting across process creation and destruction will
not be eliminated by character devices.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 22:26       ` William Lee Irwin III
@ 2007-03-21 22:53         ` Matt Mackall
  -1 siblings, 0 replies; 82+ messages in thread
From: Matt Mackall @ 2007-03-21 22:53 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> >> The main reason I am advocating a set of pagetable_operations is to
> >> enable the development of a new hugetlb interface.
> 
> On Wed, Mar 21, 2007 at 03:51:31PM -0400, Valdis.Kletnieks@vt.edu wrote:
> > Do you have an exit strategy for the *old* interface?
> 
> Hello.
> 
> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> acquired the necessary functionality until expand-on-mmap() was merged.
> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> necessary functionality is now the exit strategy.

Can you describe what ramfs needs here in a bit more detail?

If it's non-trivial, I'd rather see any new functionality go into
shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
far.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-21 22:53         ` Matt Mackall
  0 siblings, 0 replies; 82+ messages in thread
From: Matt Mackall @ 2007-03-21 22:53 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> >> The main reason I am advocating a set of pagetable_operations is to
> >> enable the development of a new hugetlb interface.
> 
> On Wed, Mar 21, 2007 at 03:51:31PM -0400, Valdis.Kletnieks@vt.edu wrote:
> > Do you have an exit strategy for the *old* interface?
> 
> Hello.
> 
> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> acquired the necessary functionality until expand-on-mmap() was merged.
> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> necessary functionality is now the exit strategy.

Can you describe what ramfs needs here in a bit more detail?

If it's non-trivial, I'd rather see any new functionality go into
shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
far.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21 15:17       ` Adam Litke
@ 2007-03-21 23:02         ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21 23:02 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
> On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
> 
>>Adam Litke wrote:

>>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>index 60e0e4a..7089323 100644
>>>--- a/include/linux/mm.h
>>>+++ b/include/linux/mm.h
>>>@@ -98,6 +98,7 @@ struct vm_area_struct {
>>> 
>>> 	/* Function pointers to deal with this struct. */
>>> 	struct vm_operations_struct * vm_ops;
>>>+	const struct pagetable_operations_struct * pagetable_ops;
>>> 
>>> 	/* Information about our backing store: */
>>> 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
>>
>>Can you remind me why this isn't in vm_ops?
> 
> 
> We didn't want to bloat the size of the vm_ops struct for all of its
> users.

But vmas are surely far more numerous than vm_ops, aren't they?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 23:02         ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21 23:02 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
> On Wed, 2007-03-21 at 15:18 +1100, Nick Piggin wrote:
> 
>>Adam Litke wrote:

>>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>index 60e0e4a..7089323 100644
>>>--- a/include/linux/mm.h
>>>+++ b/include/linux/mm.h
>>>@@ -98,6 +98,7 @@ struct vm_area_struct {
>>> 
>>> 	/* Function pointers to deal with this struct. */
>>> 	struct vm_operations_struct * vm_ops;
>>>+	const struct pagetable_operations_struct * pagetable_ops;
>>> 
>>> 	/* Information about our backing store: */
>>> 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
>>
>>Can you remind me why this isn't in vm_ops?
> 
> 
> We didn't want to bloat the size of the vm_ops struct for all of its
> users.

But vmas are surely far more numerous than vm_ops, aren't they?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21 16:00         ` Christoph Hellwig
@ 2007-03-21 23:03           ` Nick Piggin
  -1 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21 23:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

Christoph Hellwig wrote:
> On Wed, Mar 21, 2007 at 10:17:40AM -0500, Adam Litke wrote:
> 
>>>Also, it is going to be hugepage-only, isn't it? So should the naming be
>>>changed to reflect that? And #ifdef it...
>>
>>They are doing some interesting things on Cell that could take advantage
>>of this.
> 
> 
> That would be new to me.  What we need on Cell is fixing up the
> get_unmapped_area mess which Ben is working on now.
> 
> And let me once again repeat that I don't like this at all.  I'll
> rather have a few ugly ifdefs in strategic places than a big object
> oriented mess like this with just a single user.

I think I agree that we'd need more than one user for this.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 23:03           ` Nick Piggin
  0 siblings, 0 replies; 82+ messages in thread
From: Nick Piggin @ 2007-03-21 23:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

Christoph Hellwig wrote:
> On Wed, Mar 21, 2007 at 10:17:40AM -0500, Adam Litke wrote:
> 
>>>Also, it is going to be hugepage-only, isn't it? So should the naming be
>>>changed to reflect that? And #ifdef it...
>>
>>They are doing some interesting things on Cell that could take advantage
>>of this.
> 
> 
> That would be new to me.  What we need on Cell is fixing up the
> get_unmapped_area mess which Ben is working on now.
> 
> And let me once again repeat that I don't like this at all.  I'll
> rather have a few ugly ifdefs in strategic places than a big object
> oriented mess like this with just a single user.

I think I agree that we'd need more than one user for this.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
  2007-03-21 23:02         ` Nick Piggin
@ 2007-03-21 23:32           ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 23:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
>> We didn't want to bloat the size of the vm_ops struct for all of its
>> users.

On Thu, Mar 22, 2007 at 10:02:07AM +1100, Nick Piggin wrote:
> But vmas are surely far more numerous than vm_ops, aren't they?

It should be clarified that the pointer to the operations structure
in once-per-mmap() vmas is a bigger overhead than once-per-driver
function pointers in the vm_ops structure.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/7] Introduce the pagetable_operations and associated helper macros.
@ 2007-03-21 23:32           ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 23:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven, Christoph Hellwig,
	Ken Chen, linux-mm, linux-kernel

Adam Litke wrote:
>> We didn't want to bloat the size of the vm_ops struct for all of its
>> users.

On Thu, Mar 22, 2007 at 10:02:07AM +1100, Nick Piggin wrote:
> But vmas are surely far more numerous than vm_ops, aren't they?

It should be clarified that the pointer to the operations structure
in once-per-mmap() vmas is a bigger overhead than once-per-driver
function pointers in the vm_ops structure.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 22:53         ` Matt Mackall
@ 2007-03-21 23:35           ` William Lee Irwin III
  -1 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 23:35 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
>> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
>> acquired the necessary functionality until expand-on-mmap() was merged.
>> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
>> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
>> necessary functionality is now the exit strategy.

On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> Can you describe what ramfs needs here in a bit more detail?
> If it's non-trivial, I'd rather see any new functionality go into
> shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> far.

I was referring to fully-general multiple pagesize support. ramfs
would inherit the functionality by virtue of generic pagecache and TLB
handling in such an arrangement. It doesn't make sense to modify ramfs
as a special case; hugetlb is as it stands a ramfs special-cased for
such purposes.


-- wli

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-21 23:35           ` William Lee Irwin III
  0 siblings, 0 replies; 82+ messages in thread
From: William Lee Irwin III @ 2007-03-21 23:35 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
>> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
>> acquired the necessary functionality until expand-on-mmap() was merged.
>> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
>> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
>> necessary functionality is now the exit strategy.

On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> Can you describe what ramfs needs here in a bit more detail?
> If it's non-trivial, I'd rather see any new functionality go into
> shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> far.

I was referring to fully-general multiple pagesize support. ramfs
would inherit the functionality by virtue of generic pagecache and TLB
handling in such an arrangement. It doesn't make sense to modify ramfs
as a special case; hugetlb is as it stands a ramfs special-cased for
such purposes.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 23:35           ` William Lee Irwin III
@ 2007-03-22  0:31             ` Matt Mackall
  -1 siblings, 0 replies; 82+ messages in thread
From: Matt Mackall @ 2007-03-22  0:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote:
> On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> >> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> >> acquired the necessary functionality until expand-on-mmap() was merged.
> >> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> >> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> >> necessary functionality is now the exit strategy.
> 
> On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> > Can you describe what ramfs needs here in a bit more detail?
> > If it's non-trivial, I'd rather see any new functionality go into
> > shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> > far.
> 
> I was referring to fully-general multiple pagesize support. ramfs
> would inherit the functionality by virtue of generic pagecache and TLB
> handling in such an arrangement. It doesn't make sense to modify ramfs
> as a special case; hugetlb is as it stands a ramfs special-cased for
> such purposes.

Ahh, I see.

Good luck!

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-22  0:31             ` Matt Mackall
  0 siblings, 0 replies; 82+ messages in thread
From: Matt Mackall @ 2007-03-22  0:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Valdis.Kletnieks, Adam Litke, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote:
> On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> >> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> >> acquired the necessary functionality until expand-on-mmap() was merged.
> >> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> >> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> >> necessary functionality is now the exit strategy.
> 
> On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> > Can you describe what ramfs needs here in a bit more detail?
> > If it's non-trivial, I'd rather see any new functionality go into
> > shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> > far.
> 
> I was referring to fully-general multiple pagesize support. ramfs
> would inherit the functionality by virtue of generic pagecache and TLB
> handling in such an arrangement. It doesn't make sense to modify ramfs
> as a special case; hugetlb is as it stands a ramfs special-cased for
> such purposes.

Ahh, I see.

Good luck!

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-21 19:43   ` Adam Litke
@ 2007-03-22 10:38     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-22 10:38 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> The main reason I am advocating a set of pagetable_operations is to
> enable the development of a new hugetlb interface.  During the hugetlb
> BOFS at OLS last year, we talked about a character device that would
> behave like /dev/zero.  Many of the people were talking about how they
> just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> getting anonymous memory so bringing that model to huge pages would make
> programming for anonymous huge pages easier.

That is a very laudable goal, but an utterly wrong way to get there.
Despite Linus' veto a while ago what we really want is support for transparent
super pages.  Adding random pointer indirections where we had the direct
hugetlb calls before isn't helpful for that at all.  As a start you might
want to make a clear destinction between core hugetlb code and the
filesystem interface to it without all the useless indirections.  That
should get you as far as your char dev interface.  But over the long
term the core VM needs to deal with multiple (and probably not just two)
page sizes.  Given that the code to deal with different sized pages is
essentially the same just on different units on most architectures cries
for a better method to implement this than adding random function indirection
that point to mostly identical code.


And your driver is the best example of why we utterly don't want
a page_table operations interface.  The last thing we want is random
driver taking over core VM functionality.  The right way would be to a
filesystem/driver to tell (or maybe just give hints) which page size
to use for this mapping.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-22 10:38     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-22 10:38 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andrew Morton, Arjan van de Ven, William Lee Irwin III,
	Christoph Hellwig, Ken Chen, linux-mm, linux-kernel

On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> The main reason I am advocating a set of pagetable_operations is to
> enable the development of a new hugetlb interface.  During the hugetlb
> BOFS at OLS last year, we talked about a character device that would
> behave like /dev/zero.  Many of the people were talking about how they
> just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> getting anonymous memory so bringing that model to huge pages would make
> programming for anonymous huge pages easier.

That is a very laudable goal, but an utterly wrong way to get there.
Despite Linus' veto a while ago what we really want is support for transparent
super pages.  Adding random pointer indirections where we had the direct
hugetlb calls before isn't helpful for that at all.  As a start you might
want to make a clear destinction between core hugetlb code and the
filesystem interface to it without all the useless indirections.  That
should get you as far as your char dev interface.  But over the long
term the core VM needs to deal with multiple (and probably not just two)
page sizes.  Given that the code to deal with different sized pages is
essentially the same just on different units on most architectures cries
for a better method to implement this than adding random function indirection
that point to mostly identical code.


And your driver is the best example of why we utterly don't want
a page_table operations interface.  The last thing we want is random
driver taking over core VM functionality.  The right way would be to a
filesystem/driver to tell (or maybe just give hints) which page size
to use for this mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-22 10:38     ` Christoph Hellwig
@ 2007-03-22 15:42       ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2007-03-22 15:42 UTC (permalink / raw)
  To: Christoph Hellwig, Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

On (22/03/07 10:38), Christoph Hellwig didst pronounce:
> On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.  During the hugetlb
> > BOFS at OLS last year, we talked about a character device that would
> > behave like /dev/zero.  Many of the people were talking about how they
> > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> > about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> > getting anonymous memory so bringing that model to huge pages would make
> > programming for anonymous huge pages easier.
> 
> That is a very laudable goal, but an utterly wrong way to get there.
> Despite Linus' veto a while ago what we really want is support for transparent
> super pages.

A year ago, I may have agreed with you. However, Linus not only veto'd it but
stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
he wore a t-shirt a hat and held up a neon sign. The assertion at the time
was that variable page support of any sort had to be outside of the core VM
because automatic support will get it wrong in some cases and makes the core
VM harder to understand (because it's super-clear at the moment). Others
attending agreed with the position. That position rules out drivers or
filesystems giving hints about superpage sizes in the foreseeable future.

What they did not have any problem with was providing better interfaces to
program against as long as they were on the side of the VM like hugetlbfs
and not in the core. The character device for private mappings is an
example of an interface that is easier to program against than hugetlbfs.
It's far easier for an application to mmap a file at a fixed location than
trying to discover if hugetlbfs is mounted or not. However, to support that
sort of interface, there needs to be a way of telling the VM to call the an
alternative pagetable handler - hence Adam's patches.

Someone with sufficient energy could try implementing variable page support
entirely as a device using Adam's interface. If it turned out to be a
good idea, then another push could be made for transparent support later.
As it is, transparent superpage support is a also bit of a bitch for Power
and IA64. Power because in many cases (not all), pages of two different
sizes cannot be in the same virtual address range. IA64 has issues because
with the *current* pagetable implementation, hugepages are limited to fixed
address ranges. These sort of issues alone make transparent support in the
kernel a non-trivial problem.

> Adding random pointer indirections where we had the direct
> hugetlb calls before isn't helpful for that at all. 

They aren't random, they are pretty specific. Also, even when paths like fault
is entered, the cost of an indirect call is insignificant in comparison to
the page allocation, clearing the page and updating page tables.

In Adam's current patches, the indirect call only happens when a driver is
using the pagetable ops. In the tests I looked at, the cost of the branch
could only be detected on an instruction-level profile and even the branch
cost was pretty damn tiny. If it was a case that indirect calls always took
place, it *might* be a bit more noticable but still nothing in comparison
to the cost of the remainder of the operation.

> As a start you might
> want to make a clear destinction between core hugetlb code and the
> filesystem interface to it without all the useless indirections. 

The indirect calls are about supporting interfaces to userspace. In practice,
the hugetlbfs interface, the shared memory interface and the character device
interface would share a large amount of core code.  Admittadly that code
could do with restructuring because it's all mangled together at the moment.

The core hugetlb code as you call it is mainly dealing with page cache and
huge page pool management. The filesystem layer is relatively thin on top of
it. With Adams pagetable abstraction, it would make more sense to restructuring
the huge page code and separate out core-support-for-superpages from hugetlbfs.

> That
> should get you as far as your char dev interface. 

No, it wouldn't. Restructing the current code would allow better sharing
between interfaces but that's it. At the end of the restructuring, we'd still
need a way of saying "this VMA should be using some but not all the hugetlb
code over there even though I'm not hugetlbfs". At that point, we'd be back
at the pagetable ops abstraction.

> But over the long
> term the core VM needs to deal with multiple (and probably not just two)
> page sizes.  Given that the code to deal with different sized pages is
> essentially the same just on different units on most architectures cries
> for a better method to implement this than adding random function indirection
> that point to mostly identical code.
> 

Internally, a semi-sane way of supporting multiple page sizes would be
to have one internal VFS mount per page size and using the hugetlbfs page
cache management code.  Currently, HugetlbFS is basically a wrapper around
an internal VFS mount whose pages happen to be a specific size.

That said, variable page sizes is a different problem to the one Adam is
addressing here. In fact, someone with sufficient energy could implement a
variable page device behind Adam's abstraction just to see if it worked in
practice or not.

Restructering to support something like variable page support and more than
one interface would look something like;

HugetlbFS Interface	Shared Memory Interface		Char Device
	|                         |                          |
	|                         |               |-----------------
	|                         |               |                |
Internal hugetlbfs mount   Internal mount       mount            mount
 size HugePageSize         size HugPageSize   HugePageSize   different size
        |                         |               |                |
	|-----------------------------------------------------------
	                          |
		          Hugeage Reservation tracking
			          |
			  Hugepage pool management
			          |
		           Page allocator
	

But if this nice arrangement existed today, Adams patches would still be
needed to make it usable.

> And your driver is the best example of why we utterly don't want
> a page_table operations interface.  The last thing we want is random
> driver taking over core VM functionality. 

Who said anything about random? If a new driver of any sort shows up and
using pagetable_ops, the developer will certainly be asked what they are
doing that for.

> The right way would be to a
> filesystem/driver to tell (or maybe just give hints) which page size
> to use for this mapping.
> 

If a driver wants to "tell" what pagesize to use, they can override the ops
to call the appropriate hugetlb code. Hints will be damn near impossible to
get right in all cases.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-22 15:42       ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2007-03-22 15:42 UTC (permalink / raw)
  To: Christoph Hellwig, Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

On (22/03/07 10:38), Christoph Hellwig didst pronounce:
> On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.  During the hugetlb
> > BOFS at OLS last year, we talked about a character device that would
> > behave like /dev/zero.  Many of the people were talking about how they
> > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> > about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> > getting anonymous memory so bringing that model to huge pages would make
> > programming for anonymous huge pages easier.
> 
> That is a very laudable goal, but an utterly wrong way to get there.
> Despite Linus' veto a while ago what we really want is support for transparent
> super pages.

A year ago, I may have agreed with you. However, Linus not only veto'd it but
stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
he wore a t-shirt a hat and held up a neon sign. The assertion at the time
was that variable page support of any sort had to be outside of the core VM
because automatic support will get it wrong in some cases and makes the core
VM harder to understand (because it's super-clear at the moment). Others
attending agreed with the position. That position rules out drivers or
filesystems giving hints about superpage sizes in the foreseeable future.

What they did not have any problem with was providing better interfaces to
program against as long as they were on the side of the VM like hugetlbfs
and not in the core. The character device for private mappings is an
example of an interface that is easier to program against than hugetlbfs.
It's far easier for an application to mmap a file at a fixed location than
trying to discover if hugetlbfs is mounted or not. However, to support that
sort of interface, there needs to be a way of telling the VM to call the an
alternative pagetable handler - hence Adam's patches.

Someone with sufficient energy could try implementing variable page support
entirely as a device using Adam's interface. If it turned out to be a
good idea, then another push could be made for transparent support later.
As it is, transparent superpage support is a also bit of a bitch for Power
and IA64. Power because in many cases (not all), pages of two different
sizes cannot be in the same virtual address range. IA64 has issues because
with the *current* pagetable implementation, hugepages are limited to fixed
address ranges. These sort of issues alone make transparent support in the
kernel a non-trivial problem.

> Adding random pointer indirections where we had the direct
> hugetlb calls before isn't helpful for that at all. 

They aren't random, they are pretty specific. Also, even when paths like fault
is entered, the cost of an indirect call is insignificant in comparison to
the page allocation, clearing the page and updating page tables.

In Adam's current patches, the indirect call only happens when a driver is
using the pagetable ops. In the tests I looked at, the cost of the branch
could only be detected on an instruction-level profile and even the branch
cost was pretty damn tiny. If it was a case that indirect calls always took
place, it *might* be a bit more noticable but still nothing in comparison
to the cost of the remainder of the operation.

> As a start you might
> want to make a clear destinction between core hugetlb code and the
> filesystem interface to it without all the useless indirections. 

The indirect calls are about supporting interfaces to userspace. In practice,
the hugetlbfs interface, the shared memory interface and the character device
interface would share a large amount of core code.  Admittadly that code
could do with restructuring because it's all mangled together at the moment.

The core hugetlb code as you call it is mainly dealing with page cache and
huge page pool management. The filesystem layer is relatively thin on top of
it. With Adams pagetable abstraction, it would make more sense to restructuring
the huge page code and separate out core-support-for-superpages from hugetlbfs.

> That
> should get you as far as your char dev interface. 

No, it wouldn't. Restructing the current code would allow better sharing
between interfaces but that's it. At the end of the restructuring, we'd still
need a way of saying "this VMA should be using some but not all the hugetlb
code over there even though I'm not hugetlbfs". At that point, we'd be back
at the pagetable ops abstraction.

> But over the long
> term the core VM needs to deal with multiple (and probably not just two)
> page sizes.  Given that the code to deal with different sized pages is
> essentially the same just on different units on most architectures cries
> for a better method to implement this than adding random function indirection
> that point to mostly identical code.
> 

Internally, a semi-sane way of supporting multiple page sizes would be
to have one internal VFS mount per page size and using the hugetlbfs page
cache management code.  Currently, HugetlbFS is basically a wrapper around
an internal VFS mount whose pages happen to be a specific size.

That said, variable page sizes is a different problem to the one Adam is
addressing here. In fact, someone with sufficient energy could implement a
variable page device behind Adam's abstraction just to see if it worked in
practice or not.

Restructering to support something like variable page support and more than
one interface would look something like;

HugetlbFS Interface	Shared Memory Interface		Char Device
	|                         |                          |
	|                         |               |-----------------
	|                         |               |                |
Internal hugetlbfs mount   Internal mount       mount            mount
 size HugePageSize         size HugPageSize   HugePageSize   different size
        |                         |               |                |
	|-----------------------------------------------------------
	                          |
		          Hugeage Reservation tracking
			          |
			  Hugepage pool management
			          |
		           Page allocator
	

But if this nice arrangement existed today, Adams patches would still be
needed to make it usable.

> And your driver is the best example of why we utterly don't want
> a page_table operations interface.  The last thing we want is random
> driver taking over core VM functionality. 

Who said anything about random? If a new driver of any sort shows up and
using pagetable_ops, the developer will certainly be asked what they are
doing that for.

> The right way would be to a
> filesystem/driver to tell (or maybe just give hints) which page size
> to use for this mapping.
> 

If a driver wants to "tell" what pagesize to use, they can override the ops
to call the appropriate hugetlb code. Hints will be damn near impossible to
get right in all cases.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-22 15:42       ` Mel Gorman
@ 2007-03-22 18:15         ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-22 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 03:42:27PM +0000, Mel Gorman wrote:
> A year ago, I may have agreed with you. However, Linus not only veto'd it but
> stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
> he wore a t-shirt a hat and held up a neon sign. The assertion at the time
> was that variable page support of any sort had to be outside of the core VM
> because automatic support will get it wrong in some cases and makes the core
> VM harder to understand (because it's super-clear at the moment). Others
> attending agreed with the position. That position rules out drivers or
> filesystems giving hints about superpage sizes in the foreseeable future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..

> What they did not have any problem with was providing better interfaces to
> program against as long as they were on the side of the VM like hugetlbfs
> and not in the core. The character device for private mappings is an
> example of an interface that is easier to program against than hugetlbfs.
> It's far easier for an application to mmap a file at a fixed location than
> trying to discover if hugetlbfs is mounted or not. However, to support that
> sort of interface, there needs to be a way of telling the VM to call the an
> alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.  Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.  I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.

> Someone with sufficient energy could try implementing variable page support
> entirely as a device using Adam's interface. 

Hopefully not, doing this in a driver would be utterly braindead and
certainly not mergeable.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-22 18:15         ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2007-03-22 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 03:42:27PM +0000, Mel Gorman wrote:
> A year ago, I may have agreed with you. However, Linus not only veto'd it but
> stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
> he wore a t-shirt a hat and held up a neon sign. The assertion at the time
> was that variable page support of any sort had to be outside of the core VM
> because automatic support will get it wrong in some cases and makes the core
> VM harder to understand (because it's super-clear at the moment). Others
> attending agreed with the position. That position rules out drivers or
> filesystems giving hints about superpage sizes in the foreseeable future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..

> What they did not have any problem with was providing better interfaces to
> program against as long as they were on the side of the VM like hugetlbfs
> and not in the core. The character device for private mappings is an
> example of an interface that is easier to program against than hugetlbfs.
> It's far easier for an application to mmap a file at a fixed location than
> trying to discover if hugetlbfs is mounted or not. However, to support that
> sort of interface, there needs to be a way of telling the VM to call the an
> alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.  Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.  I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.

> Someone with sufficient energy could try implementing variable page support
> entirely as a device using Adam's interface. 

Hopefully not, doing this in a driver would be utterly braindead and
certainly not mergeable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
  2007-03-22 18:15         ` Christoph Hellwig
@ 2007-03-23 14:57           ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2007-03-23 14:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, Linux Memory Management List,
	Linux Kernel Mailing List


On Thu, 22 Mar 2007, Christoph Hellwig wrote:

> On Thu, Mar 22, 2007 at 03:42:27PM +0000, Mel Gorman wrote:
> > A year ago, I may have agreed with you. However, Linus not only veto'd
> > it but
> > stamped on it repeatadly at VM Summit. He couldn't have made it clearer
> > if
> > he wore a t-shirt a hat and held up a neon sign. The assertion at the
> > time
> > was that variable page support of any sort had to be outside of the core
> > VM
> > because automatic support will get it wrong in some cases and makes the
> > core
> > VM harder to understand (because it's super-clear at the moment). Others
> > attending agreed with the position. That position rules out drivers or
> > filesystems giving hints about superpage sizes in the foreseeable
> > future.
> 
> Actually I think the only way to get it right is to do it in the core
> (or inm the architecture code for the really nasty bits of course), but
> then again this isn't the point I want to make here..
>

Maybe ultimatly it's the right thing to do, more can be done "on the side"
until such time as it's really worth handling the complexity in the core VM.

> > What they did not have any problem with was providing better interfaces
> > to
> > program against as long as they were on the side of the VM like
> > hugetlbfs
> > and not in the core. The character device for private mappings is an
> > example of an interface that is easier to program against than
> > hugetlbfs.
> > It's far easier for an application to mmap a file at a fixed location
> > than
> > trying to discover if hugetlbfs is mounted or not. However, to support
> > that
> > sort of interface, there needs to be a way of telling the VM to call the
> > an
> > alternative pagetable handler - hence Adam's patches.
> 
> .. and this is where we get into problems.  There should be no need to
> use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
> means 'this is hugetlb backed anonymous memory' is much nicer to archive
> this.

Except in the example he posted, the fault handler for the char device and
the hugetlbfs case are doing slightly different things. The fault handler
for hugetlbfs assumes the existance of a file mapping where as the char
device is inserting the page directly. Having the bit is not enough for the
core is not enough to determine that a slightly different fault handler was
needed.

With the current code, it is almost impossible for a driver with a different
pagetable layout to express different semantics to hugetlbfs with respects
to how their pagetables are setup. Altering hugetlbfs much is very difficult
because any alteration becomes global in nature. The ops would allow a
experimental drivers and interfaces to be developed without breaking
existing users of hugetlbfs and let us figure out things like "Is it worth
supporting 1GiB pages in Opterons" without breaking everything else in the
process.

> Because it makes clear there is exactly one special case here
> and no carte blanche for drivers to do whatever they want.

Drivers can already cause all sorts of mayhem through the existing hooks if
they are perverse enough. It is never encouraged of course but nothing
prevents them.

> I would prefer
> to even get rid of that single special case as mentioned above, but I'm
> definitly set dead against at making this special case totally open for
> random bits of the kernel to mess with.
>

As kernel memory is already backed by huge tlb entries in many cases, random
drivers should have little or no interest in doing anything mad with
pagetable ops. All it gets them is pain and entertaining posts from the
mailing list.

That said, your main objection seems to be opening to door to arbitrary
drivers to change the pagetable ops. I can see your point and Hughs on why
this could lead to some hilarity down the road, particularly if out-of-tree
drivers entering into the mess so how about the following;

Instead of having a vma->pagetable_ops with a structure of pointers,
it would be a simple integer  into a fixed list of pagetable operation
handlers. Something like....

#define PAGETABLE_OP_DEFAULT      0
#define PAGETABLE_OP_HUGETLB_FS   1
#define PAGETABLE_OP_HUGETLB_CHAR 2

struct pagetable_operations_struct[] pagetable_ops_lookup_table = {
 	/* PAGETABLE_OP_DEFAULT assuming we always used the table */
 	{
 		.fault = handle_pte_fault
 		....
 	},

 	/* PAGETABLE_OP_HUGETLB_FS */
 	{
 		.fault    = hugetlb_fault
 		.copy_vma = copy_hugetlb_page_range
 		......
 	},

 	/* PAGETABLE_OP_HUGETLB_CHAR */
 	{
 		.fault = whatever
 	}
};

Drivers would only be able to set an index in the VMA for this table.
The lookup would be about the same cost as what is currently there. However,
random drivers cannot mess with the pagetable ops - they would have to be
known by the core. Experimental drivers would have to update the table but
that shouldn't be an issue. Out-of-tree drivers would have no ability to
mess here at all which is a good thing.

> > Someone with sufficient energy could try implementing variable page
> > support
> > entirely as a device using Adam's interface.
> 
> Hopefully not, doing this in a driver would be utterly braindead and
> certainly not mergeable.
>

Indeed, but exposing superpages only through a magic filesystem with special
semantics doesn't win the nobel prize either. The point is to have the
ability to develop alternatives to try them out without breaking existing
users of hugepages. How else can it be shown that further churn in the
core superpage support would be worth it?

All of that said, while I was writing this mail up Ken Chen posted a patch 
for a char device that sits on top of hugetlbfs that looks promising. That 
can be kicked that around a bit and put the pagetable ops on the 
back-burner until such time as there is a driver of interest that really 
needs to do something different semantically different to hugetlbfs with 
respects to page tables.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: pagetable_ops: Hugetlb character device example
@ 2007-03-23 14:57           ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2007-03-23 14:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adam Litke, Andrew Morton, Arjan van de Ven,
	William Lee Irwin III, Ken Chen, Linux Memory Management List,
	Linux Kernel Mailing List

On Thu, 22 Mar 2007, Christoph Hellwig wrote:

> On Thu, Mar 22, 2007 at 03:42:27PM +0000, Mel Gorman wrote:
> > A year ago, I may have agreed with you. However, Linus not only veto'd
> > it but
> > stamped on it repeatadly at VM Summit. He couldn't have made it clearer
> > if
> > he wore a t-shirt a hat and held up a neon sign. The assertion at the
> > time
> > was that variable page support of any sort had to be outside of the core
> > VM
> > because automatic support will get it wrong in some cases and makes the
> > core
> > VM harder to understand (because it's super-clear at the moment). Others
> > attending agreed with the position. That position rules out drivers or
> > filesystems giving hints about superpage sizes in the foreseeable
> > future.
> 
> Actually I think the only way to get it right is to do it in the core
> (or inm the architecture code for the really nasty bits of course), but
> then again this isn't the point I want to make here..
>

Maybe ultimatly it's the right thing to do, more can be done "on the side"
until such time as it's really worth handling the complexity in the core VM.

> > What they did not have any problem with was providing better interfaces
> > to
> > program against as long as they were on the side of the VM like
> > hugetlbfs
> > and not in the core. The character device for private mappings is an
> > example of an interface that is easier to program against than
> > hugetlbfs.
> > It's far easier for an application to mmap a file at a fixed location
> > than
> > trying to discover if hugetlbfs is mounted or not. However, to support
> > that
> > sort of interface, there needs to be a way of telling the VM to call the
> > an
> > alternative pagetable handler - hence Adam's patches.
> 
> .. and this is where we get into problems.  There should be no need to
> use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
> means 'this is hugetlb backed anonymous memory' is much nicer to archive
> this.

Except in the example he posted, the fault handler for the char device and
the hugetlbfs case are doing slightly different things. The fault handler
for hugetlbfs assumes the existance of a file mapping where as the char
device is inserting the page directly. Having the bit is not enough for the
core is not enough to determine that a slightly different fault handler was
needed.

With the current code, it is almost impossible for a driver with a different
pagetable layout to express different semantics to hugetlbfs with respects
to how their pagetables are setup. Altering hugetlbfs much is very difficult
because any alteration becomes global in nature. The ops would allow a
experimental drivers and interfaces to be developed without breaking
existing users of hugetlbfs and let us figure out things like "Is it worth
supporting 1GiB pages in Opterons" without breaking everything else in the
process.

> Because it makes clear there is exactly one special case here
> and no carte blanche for drivers to do whatever they want.

Drivers can already cause all sorts of mayhem through the existing hooks if
they are perverse enough. It is never encouraged of course but nothing
prevents them.

> I would prefer
> to even get rid of that single special case as mentioned above, but I'm
> definitly set dead against at making this special case totally open for
> random bits of the kernel to mess with.
>

As kernel memory is already backed by huge tlb entries in many cases, random
drivers should have little or no interest in doing anything mad with
pagetable ops. All it gets them is pain and entertaining posts from the
mailing list.

That said, your main objection seems to be opening to door to arbitrary
drivers to change the pagetable ops. I can see your point and Hughs on why
this could lead to some hilarity down the road, particularly if out-of-tree
drivers entering into the mess so how about the following;

Instead of having a vma->pagetable_ops with a structure of pointers,
it would be a simple integer  into a fixed list of pagetable operation
handlers. Something like....

#define PAGETABLE_OP_DEFAULT      0
#define PAGETABLE_OP_HUGETLB_FS   1
#define PAGETABLE_OP_HUGETLB_CHAR 2

struct pagetable_operations_struct[] pagetable_ops_lookup_table = {
 	/* PAGETABLE_OP_DEFAULT assuming we always used the table */
 	{
 		.fault = handle_pte_fault
 		....
 	},

 	/* PAGETABLE_OP_HUGETLB_FS */
 	{
 		.fault    = hugetlb_fault
 		.copy_vma = copy_hugetlb_page_range
 		......
 	},

 	/* PAGETABLE_OP_HUGETLB_CHAR */
 	{
 		.fault = whatever
 	}
};

Drivers would only be able to set an index in the VMA for this table.
The lookup would be about the same cost as what is currently there. However,
random drivers cannot mess with the pagetable ops - they would have to be
known by the core. Experimental drivers would have to update the table but
that shouldn't be an issue. Out-of-tree drivers would have no ability to
mess here at all which is a good thing.

> > Someone with sufficient energy could try implementing variable page
> > support
> > entirely as a device using Adam's interface.
> 
> Hopefully not, doing this in a driver would be utterly braindead and
> certainly not mergeable.
>

Indeed, but exposing superpages only through a magic filesystem with special
semantics doesn't win the nobel prize either. The point is to have the
ability to develop alternatives to try them out without breaking existing
users of hugepages. How else can it be shown that further churn in the
core superpage support would be worth it?

All of that said, while I was writing this mail up Ken Chen posted a patch 
for a char device that sits on top of hugetlbfs that looks promising. That 
can be kicked that around a bit and put the pagetable ops on the 
back-burner until such time as there is a driver of interest that really 
needs to do something different semantically different to hugetlbfs with 
respects to page tables.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 7/7] hugetlbfs fault handler
  2007-02-19 18:31 [PATCH 0/7] [RFC] hugetlb: pagetable_operations API Adam Litke
@ 2007-02-19 18:32   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-02-19 18:32 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, agl


Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/hugetlb.c         |    4 +++-
 mm/memory.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3461f9b..1de73c1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,6 +568,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
 	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= hugetlb_fault,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3bcc0db..63de369 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	int remainder = *length;
 
+	BUG_ON(!has_pt_op(vma, fault));
+
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			int ret;
 
 			spin_unlock(&mm->page_table_lock);
-			ret = hugetlb_fault(mm, vma, vaddr, 0);
+			ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
 			spin_lock(&mm->page_table_lock);
 			if (ret == VM_FAULT_MINOR)
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index 442e6b2..c8e17b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(has_pt_op(vma, fault)))
+		return pt_op(vma, fault)(mm, vma, address, write_access);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 7/7] hugetlbfs fault handler
@ 2007-02-19 18:32   ` Adam Litke
  0 siblings, 0 replies; 82+ messages in thread
From: Adam Litke @ 2007-02-19 18:32 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, agl

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 fs/hugetlbfs/inode.c |    1 +
 mm/hugetlb.c         |    4 +++-
 mm/memory.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3461f9b..1de73c1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,6 +568,7 @@ static struct pagetable_operations_struct hugetlbfs_pagetable_ops = {
 	.unmap_page_range	= unmap_hugepage_range,
 	.change_protection	= hugetlb_change_protection,
 	.free_pgtable_range	= hugetlb_free_pgd_range,
+	.fault			= hugetlb_fault,
 };
 
 static struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3bcc0db..63de369 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -588,6 +588,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	int remainder = *length;
 
+	BUG_ON(!has_pt_op(vma, fault));
+
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -604,7 +606,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			int ret;
 
 			spin_unlock(&mm->page_table_lock);
-			ret = hugetlb_fault(mm, vma, vaddr, 0);
+			ret = pt_op(vma, fault)(mm, vma, vaddr, 0);
 			spin_lock(&mm->page_table_lock);
 			if (ret == VM_FAULT_MINOR)
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index 442e6b2..c8e17b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2455,8 +2455,8 @@ int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	count_vm_event(PGFAULT);
 
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+	if (unlikely(has_pt_op(vma, fault)))
+		return pt_op(vma, fault)(mm, vma, address, write_access);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2007-03-23 14:57 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-19 20:05 [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2) Adam Litke
2007-03-19 20:05 ` Adam Litke
2007-03-19 20:05 ` [PATCH 1/7] Introduce the pagetable_operations and associated helper macros Adam Litke
2007-03-19 20:05   ` Adam Litke
2007-03-20 23:24   ` Dave Hansen
2007-03-20 23:24     ` Dave Hansen
2007-03-21 14:50     ` Adam Litke
2007-03-21 14:50       ` Adam Litke
2007-03-21 15:05       ` Arjan van de Ven
2007-03-21 15:05         ` Arjan van de Ven
2007-03-21  4:18   ` Nick Piggin
2007-03-21  4:18     ` Nick Piggin
2007-03-21  4:52     ` William Lee Irwin III
2007-03-21  4:52       ` William Lee Irwin III
2007-03-21  5:07       ` Nick Piggin
2007-03-21  5:07         ` Nick Piggin
2007-03-21  5:41         ` William Lee Irwin III
2007-03-21  5:41           ` William Lee Irwin III
2007-03-21  6:51           ` Nick Piggin
2007-03-21  6:51             ` Nick Piggin
2007-03-21  7:36             ` Nick Piggin
2007-03-21  7:36               ` Nick Piggin
2007-03-21 10:46             ` William Lee Irwin III
2007-03-21 10:46               ` William Lee Irwin III
2007-03-21 15:17     ` Adam Litke
2007-03-21 15:17       ` Adam Litke
2007-03-21 16:00       ` Christoph Hellwig
2007-03-21 16:00         ` Christoph Hellwig
2007-03-21 23:03         ` Nick Piggin
2007-03-21 23:03           ` Nick Piggin
2007-03-21 23:02       ` Nick Piggin
2007-03-21 23:02         ` Nick Piggin
2007-03-21 23:32         ` William Lee Irwin III
2007-03-21 23:32           ` William Lee Irwin III
2007-03-19 20:05 ` [PATCH 2/7] copy_vma for hugetlbfs Adam Litke
2007-03-19 20:05   ` Adam Litke
2007-03-19 20:05 ` [PATCH 3/7] pin_pages for hugetlb Adam Litke
2007-03-19 20:05   ` Adam Litke
2007-03-19 20:05 ` [PATCH 4/7] unmap_page_range " Adam Litke
2007-03-19 20:05   ` Adam Litke
2007-03-20 23:27   ` Dave Hansen
2007-03-20 23:27     ` Dave Hansen
2007-03-19 20:05 ` [PATCH 5/7] change_protection " Adam Litke
2007-03-19 20:05   ` Adam Litke
2007-03-19 20:06 ` [PATCH 6/7] free_pgtable_range " Adam Litke
2007-03-19 20:06   ` Adam Litke
2007-03-19 20:06 ` [PATCH 7/7] hugetlbfs fault handler Adam Litke
2007-03-19 20:06   ` Adam Litke
2007-03-20 23:50 ` [PATCH 0/7] [RFC] hugetlb: pagetable_operations API (V2) Dave Hansen
2007-03-20 23:50   ` Dave Hansen
2007-03-21  1:17 ` William Lee Irwin III
2007-03-21  1:17   ` William Lee Irwin III
2007-03-21 15:55 ` Hugh Dickins
2007-03-21 15:55   ` Hugh Dickins
2007-03-21 16:01   ` Christoph Hellwig
2007-03-21 16:01     ` Christoph Hellwig
2007-03-21 16:23   ` William Lee Irwin III
2007-03-21 17:08     ` Hugh Dickins
2007-03-21 17:42       ` William Lee Irwin III
2007-03-21 19:43 ` pagetable_ops: Hugetlb character device example Adam Litke
2007-03-21 19:43   ` Adam Litke
2007-03-21 19:51   ` Valdis.Kletnieks
2007-03-21 20:26     ` Adam Litke
2007-03-21 20:26       ` Adam Litke
2007-03-21 22:26     ` William Lee Irwin III
2007-03-21 22:26       ` William Lee Irwin III
2007-03-21 22:53       ` Matt Mackall
2007-03-21 22:53         ` Matt Mackall
2007-03-21 23:35         ` William Lee Irwin III
2007-03-21 23:35           ` William Lee Irwin III
2007-03-22  0:31           ` Matt Mackall
2007-03-22  0:31             ` Matt Mackall
2007-03-22 10:38   ` Christoph Hellwig
2007-03-22 10:38     ` Christoph Hellwig
2007-03-22 15:42     ` Mel Gorman
2007-03-22 15:42       ` Mel Gorman
2007-03-22 18:15       ` Christoph Hellwig
2007-03-22 18:15         ` Christoph Hellwig
2007-03-23 14:57         ` Mel Gorman
2007-03-23 14:57           ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2007-02-19 18:31 [PATCH 0/7] [RFC] hugetlb: pagetable_operations API Adam Litke
2007-02-19 18:32 ` [PATCH 7/7] hugetlbfs fault handler Adam Litke
2007-02-19 18:32   ` Adam Litke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.