linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: COW for hugepages
@ 2004-04-07  7:42 David Gibson
  2004-04-07  7:53 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: David Gibson @ 2004-04-07  7:42 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Benjamin Herrenschmidt, Anton Blanchard, Paul Mackerras, linuxppc64-dev

Currently the kernel does not implement copy-on-write for huge pages -
in fact any sort of page fault on a hugepage results in a SIGBUS.
This means that hugepages *always* have MAP_SHARED semantics, even if
MAP_PRIVATE is requested, and in particular that they are always
shared across a fork().  Particularly when using hugetlbfs as just a
source of quasi-anonymous memory, those are rather strange semantics.

So, the patch below adds COW support for hugepages.  It has all the
necessary generic code changes and the arch-specific code for PPC64
(because that's what I have to hand to test on).  Other architectures
will just get SIGBUS on any hugepage fault, as before.

This does introduce an inconsistency into hugepage semantics, in that
the initial hugepage mapping is always prefaulted (so you know at
mmap() time whether or not you were able to get enough hugepage
memory), whereas after a fork(), or on a MAP_PRIVATE mapping its
possible to get a SIGBUS if a COW fails due to a lack of hugepages.
The inconsistency is odd, but I think it's actually the best semantics
we can get simply, because:

- We want to prefault and fail early normal hugepage mappings, because
the normal use will be some big program grabbing a huge pile at
startup.  We'd rather it just fails to get the mapping if there aren't
enough hugepages, so it can request a smaller amount or die
gracefully, rather than getting 3/4 of the way through its big number
crunching job (say), then falling over with a SIGBUS.

- We don't want to prefault pages on fork(), because of the (probably
common) case of some big hugepage-using program calling a little shell
function with system.  If we copy all the hugepages on fork(), a) it
will be very slow and b) it could fail if there aren't enough
hugepages, although the child is actually just about to exec() and
won't need any of them.

Doing the COW for hugepages turns out not to be terribly difficult.
Is there any reason not to apply this patch?

Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c	2004-04-07 13:49:56.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c	2004-04-07 15:12:01.546762168 +1000
@@ -117,6 +117,16 @@
 #define hugepte_page(x)	pfn_to_page(hugepte_pfn(x))
 #define hugepte_none(x)	(!(hugepte_val(x) & _HUGEPAGE_PFN))
 
+#define hugepte_write(x) (hugepte_val(x) & _HUGEPAGE_RW)
+#define hugepte_same(A,B) \
+	(((hugepte_val(A) ^ hugepte_val(B)) & ~_HUGEPAGE_HPTEFLAGS) == 0)
+
+static inline hugepte_t hugepte_mkwrite(hugepte_t pte)
+{
+	hugepte_val(pte) |= _HUGEPAGE_RW;
+	return pte;
+}
+
 
 static void free_huge_page(struct page *page);
 static void flush_hash_hugepage(mm_context_t context, unsigned long ea,
@@ -324,6 +334,16 @@
 	struct page *ptepage;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	cpumask_t tmp;
+	int cow;
+	int local;
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	while (addr < end) {
 		BUG_ON(! in_hugepage_area(src->context, addr));
@@ -334,6 +354,17 @@
 			return -ENOMEM;
 
 		src_pte = hugepte_offset(src, addr);
+
+		if (cow) {
+			entry = __hugepte(hugepte_update(src_pte, 
+							 _HUGEPAGE_RW
+							 | _HUGEPAGE_HPTEFLAGS,
+							 0));
+			if ((addr % HPAGE_SIZE) == 0)
+				flush_hash_hugepage(src->context, addr,
+						    entry, local);
+		}
+
 		entry = *src_pte;
 		
 		if ((addr % HPAGE_SIZE) == 0) {
@@ -507,12 +538,16 @@
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
 	int ret = 0;
+	int writable;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
 	BUG_ON((vma->vm_end % HPAGE_SIZE) != 0);
 
 	spin_lock(&mm->page_table_lock);
+
+	writable = (vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED);
+
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
 		hugepte_t *pte = hugepte_alloc(mm, addr);
@@ -542,15 +577,25 @@
 				ret = -ENOMEM;
 				goto out;
 			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			unlock_page(page);
+			/* This is a new page, all full of zeroes.  If
+			 * we're MAP_SHARED, the page needs to go into
+			 * the page cache.  If it's MAP_PRIVATE it
+			 * might as well be made "anonymous" now or
+			 * we'll just have to copy it on the first
+			 * write. */
+			if (vma->vm_flags & VM_SHARED) {
+				ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+				unlock_page(page);
+			} else {
+				writable = (vma->vm_flags & VM_WRITE);
+			}
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
 				goto out;
 			}
 		}
-		setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
+		setup_huge_pte(mm, page, pte, writable);
 	}
 out:
 	spin_unlock(&mm->page_table_lock);
@@ -701,7 +746,7 @@
 	 * prevented then send the problem up to do_page_fault.
 	 */
 	is_write = access & _PAGE_RW;
-	if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW)))
+	if (unlikely(is_write && !hugepte_write(*ptep)))
 		return 1;
 
 	/*
@@ -944,6 +989,121 @@
 }
 EXPORT_SYMBOL(hugetlb_total_pages);
 
+static int hugepage_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, hugepte_t *ptep, hugepte_t pte)
+{
+	struct page *old_page, *new_page;
+	int i;
+	cpumask_t tmp;
+	int local;
+
+	BUG_ON(!pfn_valid(hugepte_pfn(*ptep)));
+
+	old_page = hugepte_page(*ptep);
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	/* If no-one else is actually using this page, avoid the copy
+	 * and just make the page writable */
+	if (!TestSetPageLocked(old_page)) {
+		int avoidcopy = (page_count(old_page) == 1);
+		unlock_page(old_page);
+		if (avoidcopy) {
+			for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
+				set_hugepte(ptep+i, hugepte_mkwrite(pte));
+			
+
+			pte = __hugepte(hugepte_update(ptep, _HUGEPAGE_HPTEFLAGS, 0));
+			if (hugepte_val(pte) & _HUGEPAGE_HASHPTE)
+				flush_hash_hugepage(mm->context, address,
+						    pte, local);
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_MINOR;
+		}
+	}
+
+	page_cache_get(old_page);
+
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugetlb_page();
+	if (! new_page) {
+		page_cache_release(old_page);
+
+		/* Logically this is OOM, not a SIGBUS, but an OOM
+		 * could cause the kernel to go killing other
+		 * processes which won't help the hugepage situation
+		 * at all (?) */
+		return VM_FAULT_SIGBUS;
+	}
+
+	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++)
+		copy_user_highpage(new_page + i, old_page + i, address + i*PAGE_SIZE);
+
+	spin_lock(&mm->page_table_lock);
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	ptep = hugepte_offset(mm, address);
+	if (hugepte_same(*ptep, pte)) {
+		/* Break COW */
+		for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
+			hugepte_update(ptep, ~0,
+				       hugepte_val(mk_hugepte(new_page, 1)));
+
+		if (hugepte_val(pte) & _HUGEPAGE_HASHPTE)
+			flush_hash_hugepage(mm->context, address,
+					    pte, local);
+
+		/* Make the old page be freed below */
+		new_page = old_page;
+	}
+	page_cache_release(new_page);
+	page_cache_release(old_page);
+	spin_unlock(&mm->page_table_lock);
+	return VM_FAULT_MINOR;
+}
+
+int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			    unsigned long address, int write_access)
+{
+	hugepte_t *ptep;
+	int rc = VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);
+
+	ptep = hugepte_offset(mm, address & HPAGE_MASK);
+
+	if ( (! ptep) || hugepte_none(*ptep))
+		goto fail;
+
+	/* Otherwise, there ought to be a real hugepte here */
+	BUG_ON(hugepte_bad(*ptep));
+
+	rc = VM_FAULT_MINOR;
+
+	if (! (write_access && !hugepte_write(*ptep))) {
+		printk(KERN_WARNING "Unexpected hugepte fault (wr=%d hugepte=%08x\n",
+		     write_access, hugepte_val(*ptep));
+		goto fail;
+	}
+		
+	/* The only faults we should actually get are COWs */
+	/* this drops the page_table_lock */
+	return hugepage_cow(mm, vma, address, ptep, *ptep); 
+	
+ fail:
+	spin_unlock(&mm->page_table_lock);
+
+	return rc;
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
Index: working-2.6/include/asm-ppc64/page.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/page.h	2004-04-06 11:32:00.000000000 +1000
+++ working-2.6/include/asm-ppc64/page.h	2004-04-07 15:12:01.548761864 +1000
@@ -55,6 +55,7 @@
 	  && touches_hugepage_low_range((addr), (len))))
 #define hugetlb_free_pgtables free_pgtables
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#define ARCH_HANDLES_HUGEPAGE_FAULTS
 
 #define in_hugepage_area(context, addr) \
 	((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \
Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h	2004-04-05 14:04:37.000000000 +1000
+++ working-2.6/include/linux/hugetlb.h	2004-04-07 15:12:01.557760496 +1000
@@ -50,6 +50,13 @@
 int prepare_hugepage_range(unsigned long addr, unsigned long len);
 #endif
 
+#ifndef ARCH_HANDLES_HUGEPAGE_FAULTS
+#define handle_hugetlb_mm_fault(mm, vma, a, w)	(VM_FAULT_SIGBUS)
+#else
+int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			    unsigned long address, int write_access);
+#endif
+
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
Index: working-2.6/mm/memory.c
===================================================================
--- working-2.6.orig/mm/memory.c	2004-03-16 11:31:35.000000000 +1100
+++ working-2.6/mm/memory.c	2004-04-07 15:12:01.562759736 +1000
@@ -1619,7 +1619,8 @@
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		/* mapping truncation can do this. */
+		return handle_hugetlb_mm_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd



-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:42 RFC: COW for hugepages David Gibson
@ 2004-04-07  7:53 ` Andrew Morton
  2004-04-07  8:03   ` Anton Blanchard
  2004-04-07  8:29   ` Benjamin Herrenschmidt
  2004-04-07  9:00 ` Zoltan Menyhart
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2004-04-07  7:53 UTC (permalink / raw)
  To: David Gibson; +Cc: linux-kernel, benh, anton, paulus, linuxppc64-dev

David Gibson <david@gibson.dropbear.id.au> wrote:
>
> Doing the COW for hugepages turns out not to be terribly difficult.
>  Is there any reason not to apply this patch?

Not much, except that it adds stuff to the kernel.

Does anyone actually have a real-world need for the feature?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:53 ` Andrew Morton
@ 2004-04-07  8:03   ` Anton Blanchard
  2004-04-07  8:29   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 22+ messages in thread
From: Anton Blanchard @ 2004-04-07  8:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Gibson, linux-kernel, benh, paulus, linuxppc64-dev

 
> Not much, except that it adds stuff to the kernel.
> 
> Does anyone actually have a real-world need for the feature?

Ive been playing with a malloc library that uses largepages. It needs
largepage COW to work.

Anton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:53 ` Andrew Morton
  2004-04-07  8:03   ` Anton Blanchard
@ 2004-04-07  8:29   ` Benjamin Herrenschmidt
  2004-04-08  1:57     ` David Gibson
  1 sibling, 1 reply; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2004-04-07  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Gibson, Linux Kernel list, Anton Blanchard, Paul Mackerras,
	linuxppc64-dev

On Wed, 2004-04-07 at 17:53, Andrew Morton wrote:
> David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > Doing the COW for hugepages turns out not to be terribly difficult.
> >  Is there any reason not to apply this patch?
> 
> Not much, except that it adds stuff to the kernel.
> 
> Does anyone actually have a real-world need for the feature?

Yup, porting some apps to use hugepages, when those apps
rely on fork & cow semantics typically. Also, implicit use of
hugepages (usually via a malloc override library).

Ben.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:42 RFC: COW for hugepages David Gibson
  2004-04-07  7:53 ` Andrew Morton
@ 2004-04-07  9:00 ` Zoltan Menyhart
  2004-04-08  1:53   ` David Gibson
  2004-04-07 12:34 ` Andi Kleen
  2004-04-07 15:21 ` Dave Hansen
  3 siblings, 1 reply; 22+ messages in thread
From: Zoltan Menyhart @ 2004-04-07  9:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, linux-kernel, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, linuxppc64-dev

David,

Why not just add a flag for a VMA telling if you want / do not want to
copy it on "fork()" ? E.g.:

dup_mmap():

	for (= current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {

		if (mpnt->vm_flags & VM_HUGETLB_DONT_COPY)
			<do nothing>
	}

Regards,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:42 RFC: COW for hugepages David Gibson
  2004-04-07  7:53 ` Andrew Morton
  2004-04-07  9:00 ` Zoltan Menyhart
@ 2004-04-07 12:34 ` Andi Kleen
  2004-04-07 14:27   ` Anton Blanchard
                     ` (2 more replies)
  2004-04-07 15:21 ` Dave Hansen
  3 siblings, 3 replies; 22+ messages in thread
From: Andi Kleen @ 2004-04-07 12:34 UTC (permalink / raw)
  To: David Gibson; +Cc: akpm, linux-kernel, benh, anton, paulus, linuxppc64-dev

On Wed, 7 Apr 2004 17:42:39 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently the kernel does not implement copy-on-write for huge pages -
> in fact any sort of page fault on a hugepage results in a SIGBUS.
> This means that hugepages *always* have MAP_SHARED semantics, even if
> MAP_PRIVATE is requested, and in particular that they are always
> shared across a fork().  Particularly when using hugetlbfs as just a
> source of quasi-anonymous memory, those are rather strange semantics.

[...]

Implementing this for ppc64 only is just wrong. Before you do this 
I would suggest to factor out the common code in the various hugetlbpage
implementations and then implement it in common code.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 12:34 ` Andi Kleen
@ 2004-04-07 14:27   ` Anton Blanchard
  2004-04-07 14:50     ` Andi Kleen
  2004-04-08  1:11   ` Benjamin Herrenschmidt
  2004-04-08  1:50   ` David Gibson
  2 siblings, 1 reply; 22+ messages in thread
From: Anton Blanchard @ 2004-04-07 14:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Gibson, akpm, linux-kernel, benh, paulus, linuxppc64-dev

 
> Implementing this for ppc64 only is just wrong. Before you do this 
> I would suggest to factor out the common code in the various hugetlbpage
> implementations and then implement it in common code.

I could say a similar thing about your i386 specific largepage modifications
in the NUMA api :)

We should probably look at making lots of the arch specific hugetlb code
common but im not sure we want that to become a prerequisite for merging
NUMA API and hugepage COW.

Anton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 14:27   ` Anton Blanchard
@ 2004-04-07 14:50     ` Andi Kleen
  2004-04-07 21:41       ` Anton Blanchard
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2004-04-07 14:50 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: david, akpm, linux-kernel, benh, paulus, linuxppc64-dev

On Thu, 8 Apr 2004 00:27:48 +1000
Anton Blanchard <anton@samba.org> wrote:

>  
> > Implementing this for ppc64 only is just wrong. Before you do this 
> > I would suggest to factor out the common code in the various hugetlbpage
> > implementations and then implement it in common code.
> 
> I could say a similar thing about your i386 specific largepage modifications
> in the NUMA api :)

All they did was to modify the code to lazy faulting. That is architecture specific

(and add the mpol code, but that was pretty minor) 

COW is a different thing though.

> 
> We should probably look at making lots of the arch specific hugetlb code
> common but im not sure we want that to become a prerequisite for merging
> NUMA API and hugepage COW.

That would just make the merging later harder. Making it common first and then
adding features would be better.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  7:42 RFC: COW for hugepages David Gibson
                   ` (2 preceding siblings ...)
  2004-04-07 12:34 ` Andi Kleen
@ 2004-04-07 15:21 ` Dave Hansen
  2004-04-08  3:22   ` David Gibson
  3 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2004-04-07 15:21 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, Linux Kernel Mailing List, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, PPC64 External List

On Wed, 2004-04-07 at 00:42, David Gibson wrote:
> +	/* XXX are there races with checking cpu_vm_mask? - Anton */
> +	tmp = cpumask_of_cpu(smp_processor_id());
> +	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
> +		local = 1;
> +
> +	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

You're under the pagetable lock for that mm, so you at least don't have
to worry about preemption.  But, that definitely at least deserves a
comment on why you didn't disable preemption.

Also, for racing with other users of cpu_vm_mask, there aren't any other
users that clear other cpus' bits other than initialization.  

One thing that's I didn't see in this patch is any check of capabilities
or permissions.  What if a privileged user sets up a single page huge
page, then does a setuid() to a lower privilege level?  Is that not a
valid way to get hugetlb pages to an unprivileged user?  That other user
is free to go fork and write to the pages to their heart's content,
consuming as many huge pages as are present in the system.

It looks to me like dup_mmap() drops VM_LOCKED on mlock()'d VMAs at fork
time.  Do we need to do the same for hugetlb pages?  

-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 14:50     ` Andi Kleen
@ 2004-04-07 21:41       ` Anton Blanchard
  2004-04-07 21:47         ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Anton Blanchard @ 2004-04-07 21:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: david, akpm, linux-kernel, benh, paulus, linuxppc64-dev

 
> All they did was to modify the code to lazy faulting. That is
> architecture specific
> 
> (and add the mpol code, but that was pretty minor) 
> 
> COW is a different thing though.

Why is it architecture specific? I dont understand why you cant make
lazy faulting common code.
 
Anton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 21:41       ` Anton Blanchard
@ 2004-04-07 21:47         ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2004-04-07 21:47 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: david, akpm, linux-kernel, benh, paulus, linuxppc64-dev

On Thu, 8 Apr 2004 07:41:16 +1000
Anton Blanchard <anton@samba.org> wrote:

>  
> > All they did was to modify the code to lazy faulting. That is
> > architecture specific
> > 
> > (and add the mpol code, but that was pretty minor) 
> > 
> > COW is a different thing though.
> 
> Why is it architecture specific? I dont understand why you cant make
> lazy faulting common code.

My understanding was that you needed special magic on IA64 and PPC64. If that's
wrong you can probably do it, but you may need new TLB flush primitives.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 12:34 ` Andi Kleen
  2004-04-07 14:27   ` Anton Blanchard
@ 2004-04-08  1:11   ` Benjamin Herrenschmidt
  2004-04-08  2:01     ` Andrew Morton
  2004-04-08  1:50   ` David Gibson
  2 siblings, 1 reply; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2004-04-08  1:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Gibson, Andrew Morton, Linux Kernel list, Anton Blanchard,
	Paul Mackerras, linuxppc64-dev


> Implementing this for ppc64 only is just wrong. Before you do this 
> I would suggest to factor out the common code in the various hugetlbpage
> implementations and then implement it in common code.

Have you actually looked at it and how huge pages are implemented
on the various architectures ?

Honestly, I don't think we have any common abstraction on things
like hugepte's etc... actually, archs aren't even required to use
PTEs at all.

I don't see how we can make that code arch-neutral, at least not
without a major redesign of the whole large pages mecanism.

Ben.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 12:34 ` Andi Kleen
  2004-04-07 14:27   ` Anton Blanchard
  2004-04-08  1:11   ` Benjamin Herrenschmidt
@ 2004-04-08  1:50   ` David Gibson
  2 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2004-04-08  1:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, benh, anton, paulus, linuxppc64-dev

On Wed, Apr 07, 2004 at 02:34:47PM +0200, Andi Kleen wrote:
> On Wed, 7 Apr 2004 17:42:39 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently the kernel does not implement copy-on-write for huge pages -
> > in fact any sort of page fault on a hugepage results in a SIGBUS.
> > This means that hugepages *always* have MAP_SHARED semantics, even if
> > MAP_PRIVATE is requested, and in particular that they are always
> > shared across a fork().  Particularly when using hugetlbfs as just a
> > source of quasi-anonymous memory, those are rather strange semantics.
> 
> [...]
> 
> Implementing this for ppc64 only is just wrong. Before you do this 
> I would suggest to factor out the common code in the various hugetlbpage
> implementations and then implement it in common code.

Yes, I know it needs to be implemented for the other archs, but I
wanted to get some feedback on the concept before diving into the
other arch details to figure out how to implement it there.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  9:00 ` Zoltan Menyhart
@ 2004-04-08  1:53   ` David Gibson
  2004-04-08 11:10     ` Zoltan Menyhart
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2004-04-08  1:53 UTC (permalink / raw)
  To: Zoltan.Menyhart
  Cc: Andrew Morton, linux-kernel, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, linuxppc64-dev

On Wed, Apr 07, 2004 at 11:00:43AM +0200, Zoltan Menyhart wrote:
> David,
> 
> Why not just add a flag for a VMA telling if you want / do not want to
> copy it on "fork()" ? E.g.:
> 
> dup_mmap():
> 
> 	for (= current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
> 
> 		if (mpnt->vm_flags & VM_HUGETLB_DONT_COPY)
> 			<do nothing>
> 	}
> 

Um.. why would that be useful?

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07  8:29   ` Benjamin Herrenschmidt
@ 2004-04-08  1:57     ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2004-04-08  1:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linux Kernel list, Anton Blanchard,
	Paul Mackerras, linuxppc64-dev

On Wed, Apr 07, 2004 at 06:29:54PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2004-04-07 at 17:53, Andrew Morton wrote:
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > Doing the COW for hugepages turns out not to be terribly difficult.
> > >  Is there any reason not to apply this patch?
> > 
> > Not much, except that it adds stuff to the kernel.
> > 
> > Does anyone actually have a real-world need for the feature?
> 
> Yup, porting some apps to use hugepages, when those apps
> rely on fork & cow semantics typically. Also, implicit use of
> hugepages (usually via a malloc override library).

I also have some experimental patches for putting ELF segments into
large pages (for HPC FORTRAN monsters with massive arrays in the BSS).
MAP_PRIVATE semantics (and hence COW) are a clear prerequisite for
that.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08  1:11   ` Benjamin Herrenschmidt
@ 2004-04-08  2:01     ` Andrew Morton
  2004-04-08  3:09       ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-04-08  2:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: ak, david, linux-kernel, anton, paulus, linuxppc64-dev

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> 
> > Implementing this for ppc64 only is just wrong. Before you do this 
> > I would suggest to factor out the common code in the various hugetlbpage
> > implementations and then implement it in common code.
> 
> Have you actually looked at it and how huge pages are implemented
> on the various architectures ?
> 
> Honestly, I don't think we have any common abstraction on things
> like hugepte's etc... actually, archs aren't even required to use
> PTEs at all.
> 
> I don't see how we can make that code arch-neutral, at least not
> without a major redesign of the whole large pages mecanism.

I don't see much in the COW code which is ppc64-specific.  All the hardware
needs to do is to provide a way to make the big pages readonly.  With a bit
of an abstraction for the TLB manipulation in there it should be pretty
straightforward.

Certainly worth the attempt, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08  2:01     ` Andrew Morton
@ 2004-04-08  3:09       ` David Gibson
  2004-04-08  3:24         ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2004-04-08  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Herrenschmidt, ak, linux-kernel, anton, paulus, linuxppc64-dev

On Wed, Apr 07, 2004 at 07:01:26PM -0700, Andrew Morton wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >
> > 
> > > Implementing this for ppc64 only is just wrong. Before you do this 
> > > I would suggest to factor out the common code in the various hugetlbpage
> > > implementations and then implement it in common code.
> > 
> > Have you actually looked at it and how huge pages are implemented
> > on the various architectures ?
> > 
> > Honestly, I don't think we have any common abstraction on things
> > like hugepte's etc... actually, archs aren't even required to use
> > PTEs at all.
> > 
> > I don't see how we can make that code arch-neutral, at least not
> > without a major redesign of the whole large pages mecanism.
> 
> I don't see much in the COW code which is ppc64-specific.  All the hardware
> needs to do is to provide a way to make the big pages readonly.  With a bit
> of an abstraction for the TLB manipulation in there it should be pretty
> straightforward.
> 
> Certainly worth the attempt, no?

Yes, you have a point.  However doing it in a cross-arch way will
require building more of a shared abstraction about hugepage pte
entries that exists currently.  And that will mean making significant
changes to all the archs to create that abstraction.  I don't know
enough about the other archs to be confident of debugging such
changes, but I'll see what I can do.

That should let the actual handle_mm_fault->hugepage_cow codepath be
shared.  However how the hugepage ptes fit in with the rest of the
pagetables varies from arch to arch - on ppc64 we're considering
putting hugepages into their own entirely separate pagetables, even -
so anything that actually walks the pagetables (like
copy_hugetlb_page_range()) will still have to be arch-specific.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-07 15:21 ` Dave Hansen
@ 2004-04-08  3:22   ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2004-04-08  3:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux Kernel Mailing List, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, PPC64 External List

On Wed, Apr 07, 2004 at 08:21:11AM -0700, Dave Hansen wrote:
> On Wed, 2004-04-07 at 00:42, David Gibson wrote:
> > +	/* XXX are there races with checking cpu_vm_mask? - Anton */
> > +	tmp = cpumask_of_cpu(smp_processor_id());
> > +	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
> > +		local = 1;
> > +
> > +	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> 
> You're under the pagetable lock for that mm, so you at least don't have
> to worry about preemption.  But, that definitely at least deserves a
> comment on why you didn't disable preemption.
> 
> Also, for racing with other users of cpu_vm_mask, there aren't any other
> users that clear other cpus' bits other than initialization.  

Yes.  That comment was actually copied from code in the analagous
normal page path (from where it has since been removed).  I've altered
it appropriately.

> One thing that's I didn't see in this patch is any check of capabilities
> or permissions.  What if a privileged user sets up a single page huge
> page, then does a setuid() to a lower privilege level?  Is that not a
> valid way to get hugetlb pages to an unprivileged user?  That other user
> is free to go fork and write to the pages to their heart's content,
> consuming as many huge pages as are present in the system.

That seems to me like a "don't do that" situation (you can avoid the
security problem by only giving MAP_SHARED pages to the unprivileged
process).  I don't see a better way to solve this without implementing
some sort of rlimit system for hugepages, which seems like overkill:
As far as I can tell all the real applications (huge databases, HPC,
benchmarks) for hugepages tend to be the sorts of cases where the
hugepage-using program is the only thing you really care about on the
system .

> It looks to me like dup_mmap() drops VM_LOCKED on mlock()'d VMAs at fork
> time.  Do we need to do the same for hugetlb pages?  

Given that all huge pages are unswappable and therefore effectively
mlock()ed, I don't think so.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08  3:09       ` David Gibson
@ 2004-04-08  3:24         ` Andrew Morton
  2004-04-08  3:56           ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-04-08  3:24 UTC (permalink / raw)
  To: David Gibson; +Cc: benh, ak, linux-kernel, anton, paulus, linuxppc64-dev

David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > I don't see much in the COW code which is ppc64-specific.  All the hardware
>  > needs to do is to provide a way to make the big pages readonly.  With a bit
>  > of an abstraction for the TLB manipulation in there it should be pretty
>  > straightforward.
>  > 
>  > Certainly worth the attempt, no?
> 
>  Yes, you have a point.  However doing it in a cross-arch way will
>  require building more of a shared abstraction about hugepage pte
>  entries that exists currently.  And that will mean making significant
>  changes to all the archs to create that abstraction.  I don't know
>  enough about the other archs to be confident of debugging such
>  changes, but I'll see what I can do.

Well the first step is to consolidate the existing duplication in 2.6.5
before thinking about new features.  That's largely a cut-n-paste job which
I've been meaning to get onto but alas have not.  I don't want to dump it
on you just because you want to tend to your COWs so if you have other
things to do, please let me know.

We could use the `weak' attribute in mm/hugetlbpage.c for those cases where
one arch really needs to do something different.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08  3:24         ` Andrew Morton
@ 2004-04-08  3:56           ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2004-04-08  3:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: benh, ak, linux-kernel, anton, paulus, linuxppc64-dev

On Wed, Apr 07, 2004 at 08:24:43PM -0700, Andrew Morton wrote:
> David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > I don't see much in the COW code which is ppc64-specific.  All the hardware
> >  > needs to do is to provide a way to make the big pages readonly.  With a bit
> >  > of an abstraction for the TLB manipulation in there it should be pretty
> >  > straightforward.
> >  > 
> >  > Certainly worth the attempt, no?
> > 
> >  Yes, you have a point.  However doing it in a cross-arch way will
> >  require building more of a shared abstraction about hugepage pte
> >  entries that exists currently.  And that will mean making significant
> >  changes to all the archs to create that abstraction.  I don't know
> >  enough about the other archs to be confident of debugging such
> >  changes, but I'll see what I can do.
> 
> Well the first step is to consolidate the existing duplication in 2.6.5
> before thinking about new features.  That's largely a cut-n-paste job which
> I've been meaning to get onto but alas have not.  I don't want to dump it
> on you just because you want to tend to your COWs so if you have other
> things to do, please let me know.

Well, I do have other things to do, so I'll try to look at the
consolidation when I get a chance.

> We could use the `weak' attribute in mm/hugetlbpage.c for those cases where
> one arch really needs to do something different.

Yes, that's an idea.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08  1:53   ` David Gibson
@ 2004-04-08 11:10     ` Zoltan Menyhart
  2004-04-08 14:19       ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Zoltan Menyhart @ 2004-04-08 11:10 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrew Morton, linux-kernel, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, linuxppc64-dev

David Gibson wrote:

> > Why not just add a flag for a VMA telling if you want / do not want to
> > copy it on "fork()" ? E.g.:
> >
> > dup_mmap():
> >
> >       for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
> >
> >               if (mpnt->vm_flags & VM_HUGETLB_DONT_COPY)
> >                       <do nothing>
> >       }
> >
> 
> Um.. why would that be useful?

I think there are 2 major cases:

- A big hugepage-using program creates threads to take advantage
  of all of the CPUs => use clone2(...CLONEVM...), it works today.
- Another big hugepage-using program calling a little shell function
  with system() => just skip the VMA of the huge page area in
	do_fork():
		copy_process():
			copy_mm():
				dup_mmap()
  The child will have no copy of the huge page area. No problem, it will
  exec() soon, and the stack, the usual data, the malloc()'ed data, etc.
  are not in the huge page area => exec() will will work correctly.

I do not think we need a COW of the huge pages.

Regards,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RFC: COW for hugepages
  2004-04-08 11:10     ` Zoltan Menyhart
@ 2004-04-08 14:19       ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2004-04-08 14:19 UTC (permalink / raw)
  To: Zoltan.Menyhart
  Cc: Andrew Morton, linux-kernel, Benjamin Herrenschmidt,
	Anton Blanchard, Paul Mackerras, linuxppc64-dev

On Thu, Apr 08, 2004 at 01:10:51PM +0200, Zoltan Menyhart wrote:
> David Gibson wrote:
> 
> > > Why not just add a flag for a VMA telling if you want / do not want to
> > > copy it on "fork()" ? E.g.:
> > >
> > > dup_mmap():
> > >
> > >       for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
> > >
> > >               if (mpnt->vm_flags & VM_HUGETLB_DONT_COPY)
> > >                       <do nothing>
> > >       }
> > >
> > 
> > Um.. why would that be useful?
> 
> I think there are 2 major cases:
> 
> - A big hugepage-using program creates threads to take advantage
>   of all of the CPUs => use clone2(...CLONEVM...), it works today.
> - Another big hugepage-using program calling a little shell function
>   with system() => just skip the VMA of the huge page area in
> 	do_fork():
> 		copy_process():
> 			copy_mm():
> 				dup_mmap()
>   The child will have no copy of the huge page area. No problem, it will
>   exec() soon, and the stack, the usual data, the malloc()'ed data, etc.
>   are not in the huge page area => exec() will will work correctly.

> I do not think we need a COW of the huge pages.

Both of these two cases work already.  The fork() will copy the huge
PTEs, but that's no big deal since all the actual pages are shared.
COW isn't supposed to help either case - it's for cases where we
really do need MAP_PRIVAT semantics.  That's particularly important
for things we might do in the future where large pages are allocated
automatically according to certain heuristics.  In this case we
certainly can't make the pages silently have different semantics to
normal anonymous memory.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-04-08 14:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-07  7:42 RFC: COW for hugepages David Gibson
2004-04-07  7:53 ` Andrew Morton
2004-04-07  8:03   ` Anton Blanchard
2004-04-07  8:29   ` Benjamin Herrenschmidt
2004-04-08  1:57     ` David Gibson
2004-04-07  9:00 ` Zoltan Menyhart
2004-04-08  1:53   ` David Gibson
2004-04-08 11:10     ` Zoltan Menyhart
2004-04-08 14:19       ` David Gibson
2004-04-07 12:34 ` Andi Kleen
2004-04-07 14:27   ` Anton Blanchard
2004-04-07 14:50     ` Andi Kleen
2004-04-07 21:41       ` Anton Blanchard
2004-04-07 21:47         ` Andi Kleen
2004-04-08  1:11   ` Benjamin Herrenschmidt
2004-04-08  2:01     ` Andrew Morton
2004-04-08  3:09       ` David Gibson
2004-04-08  3:24         ` Andrew Morton
2004-04-08  3:56           ` David Gibson
2004-04-08  1:50   ` David Gibson
2004-04-07 15:21 ` Dave Hansen
2004-04-08  3:22   ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).