All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm: mremap correct rmap accounting
@ 2007-01-23 14:19 Nick Piggin
  2007-01-23 20:55 ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2007-01-23 14:19 UTC (permalink / raw)
  To: Linux Memory Management, Hugh Dickins, Andrew Morton, Ralf Baechle

[-- Attachment #1: Type: text/plain, Size: 228 bytes --]

Just spotted this possible issue when searching for another bug. Am I right
in my thinking? I haven't dug out my x86-multi-ZERO_PAGE-patch to verify
(or actually test the patch :P).

Comments?

Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: rmap-mremap-fix.patch --]
[-- Type: text/plain, Size: 2294 bytes --]

When mremap()ing virtual addresses, some architectures (read: MIPS) switches
underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).

The problem is that the refcount and mapcount remain on the old page, while
the actual pte is switched to the new one. This would counter underruns and
confuse the rmap code.

Fix it by actually moving accounting info to the new page. Would it be neater
to do this in move_pte? maybe rmap.c? (nick mumbles something about not
accounting ZERO_PAGE()s)

Signed-off-by: Nick Piggin <npiggin@suse.de>


Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2007-01-24 01:00:53.000000000 +1100
+++ linux-2.6/mm/mremap.c	2007-01-24 01:01:16.000000000 +1100
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -72,7 +73,7 @@ static void move_ptes(struct vm_area_str
 {
 	struct address_space *mapping = NULL;
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_pte, *new_pte;
 	spinlock_t *old_ptl, *new_ptl;
 
 	if (vma->vm_file) {
@@ -102,12 +103,28 @@ static void move_ptes(struct vm_area_str
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
+		pte_t new, old;
+
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_clear_flush(vma, old_addr, old_pte);
+		old = ptep_clear_flush(vma, old_addr, old_pte);
 		/* ZERO_PAGE can be dependant on virtual addr */
-		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
-		set_pte_at(mm, new_addr, new_pte, pte);
+		new = move_pte(old, new_vma->vm_page_prot, old_addr, new_addr);
+		if (unlikely(pte_pfn(old) != pte_pfn(new))) {
+			struct page *page;
+			/* must be different ZERO_PAGE()es. Update accounting */
+
+			page = vm_normal_page(vma, old_addr, old);
+			BUG_ON(page != ZERO_PAGE(old_addr));
+			put_page(page);
+			page_remove_rmap(page, vma);
+
+			page = vm_normal_page(new_vma, new_addr, new);
+			BUG_ON(page != ZERO_PAGE(new_addr));
+			get_page(page);
+			page_add_file_rmap(page);
+		}
+		set_pte_at(mm, new_addr, new_pte, new);
 	}
 
 	arch_leave_lazy_mmu_mode();

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-23 14:19 [patch] mm: mremap correct rmap accounting Nick Piggin
@ 2007-01-23 20:55 ` Hugh Dickins
  2007-01-23 23:49   ` Nick Piggin
  2007-01-29  3:31   ` Nick Piggin
  0 siblings, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-01-23 20:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Andrew Morton, Ralf Baechle

On Wed, 24 Jan 2007, Nick Piggin wrote:
> 
> When mremap()ing virtual addresses, some architectures (read: MIPS) switches
> underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).
> 
> The problem is that the refcount and mapcount remain on the old page, while
> the actual pte is switched to the new one. This would counter underruns and
> confuse the rmap code.

Good point.  Nasty.

> 
> Fix it by actually moving accounting info to the new page. Would it be neater
> to do this in move_pte? maybe rmap.c? (nick mumbles something about not
> accounting ZERO_PAGE()s)

Tiresome, I can quite see why it brings you to mumbling.

Though it looks right, I do hate the patch cluttering up move_ptes()
like that: will the compiler be able to work out that that "unlikely"
means impossible (and optimize away the code) on all arches but MIPS?
Even if it can, I'd rather not see it there.

Could you make the MIPS move_pte() a proper function, say in
arch/mips/mm/init.c next to setup_zero_pages(), and do that tiresome
stuff there - should then be able to assume ZERO_PAGEs and skip the
BUG_ON embellishments.

Utter nit-of-nits: my sense of symmetry prefers that you put_page()
after page_remove_rmap() instead of before.

Hugh

> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> 
> Index: linux-2.6/mm/mremap.c
> ===================================================================
> --- linux-2.6.orig/mm/mremap.c	2007-01-24 01:00:53.000000000 +1100
> +++ linux-2.6/mm/mremap.c	2007-01-24 01:01:16.000000000 +1100
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/security.h>
>  #include <linux/syscalls.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -72,7 +73,7 @@ static void move_ptes(struct vm_area_str
>  {
>  	struct address_space *mapping = NULL;
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *old_pte, *new_pte, pte;
> +	pte_t *old_pte, *new_pte;
>  	spinlock_t *old_ptl, *new_ptl;
>  
>  	if (vma->vm_file) {
> @@ -102,12 +103,28 @@ static void move_ptes(struct vm_area_str
>  
>  	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
>  				   new_pte++, new_addr += PAGE_SIZE) {
> +		pte_t new, old;
> +
>  		if (pte_none(*old_pte))
>  			continue;
> -		pte = ptep_clear_flush(vma, old_addr, old_pte);
> +		old = ptep_clear_flush(vma, old_addr, old_pte);
>  		/* ZERO_PAGE can be dependant on virtual addr */
> -		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
> -		set_pte_at(mm, new_addr, new_pte, pte);
> +		new = move_pte(old, new_vma->vm_page_prot, old_addr, new_addr);
> +		if (unlikely(pte_pfn(old) != pte_pfn(new))) {
> +			struct page *page;
> +			/* must be different ZERO_PAGE()es. Update accounting */
> +
> +			page = vm_normal_page(vma, old_addr, old);
> +			BUG_ON(page != ZERO_PAGE(old_addr));
> +			put_page(page);
> +			page_remove_rmap(page, vma);
> +
> +			page = vm_normal_page(new_vma, new_addr, new);
> +			BUG_ON(page != ZERO_PAGE(new_addr));
> +			get_page(page);
> +			page_add_file_rmap(page);
> +		}
> +		set_pte_at(mm, new_addr, new_pte, new);
>  	}
>  
>  	arch_leave_lazy_mmu_mode();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-23 20:55 ` Hugh Dickins
@ 2007-01-23 23:49   ` Nick Piggin
  2007-01-29  3:31   ` Nick Piggin
  1 sibling, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2007-01-23 23:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
> On Wed, 24 Jan 2007, Nick Piggin wrote:
> 
>>When mremap()ing virtual addresses, some architectures (read: MIPS) switches
>>underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).
>>
>>The problem is that the refcount and mapcount remain on the old page, while
>>the actual pte is switched to the new one. This would counter underruns and
>>confuse the rmap code.
> 
> 
> Good point.  Nasty.
> 
> 
>>Fix it by actually moving accounting info to the new page. Would it be neater
>>to do this in move_pte? maybe rmap.c? (nick mumbles something about not
>>accounting ZERO_PAGE()s)
> 
> 
> Tiresome, I can quite see why it brings you to mumbling.
> 
> Though it looks right, I do hate the patch cluttering up move_ptes()
> like that: will the compiler be able to work out that that "unlikely"
> means impossible (and optimize away the code) on all arches but MIPS?
> Even if it can, I'd rather not see it there.

Yeah it doesn't look quite right.

> Could you make the MIPS move_pte() a proper function, say in
> arch/mips/mm/init.c next to setup_zero_pages(), and do that tiresome
> stuff there - should then be able to assume ZERO_PAGEs and skip the
> BUG_ON embellishments.

The only thing I was thinking of was if another arch comes along and
does the same thing. Also tried to keep such rmap specifics in mm/.

But if you're happy to do it that way, then I'm happy with it!

> Utter nit-of-nits: my sense of symmetry prefers that you put_page()
> after page_remove_rmap() instead of before.

I prefer that way too, now you mention it ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-23 20:55 ` Hugh Dickins
  2007-01-23 23:49   ` Nick Piggin
@ 2007-01-29  3:31   ` Nick Piggin
  2007-01-29  6:40     ` Andrew Morton
  2007-01-29 19:08     ` Hugh Dickins
  1 sibling, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2007-01-29  3:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linux Memory Management, Andrew Morton, Ralf Baechle, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1603 bytes --]

Hugh Dickins wrote:
> On Wed, 24 Jan 2007, Nick Piggin wrote:
> 
>>When mremap()ing virtual addresses, some architectures (read: MIPS) switches
>>underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).
>>
>>The problem is that the refcount and mapcount remain on the old page, while
>>the actual pte is switched to the new one. This would counter underruns and
>>confuse the rmap code.
> 
> 
> Good point.  Nasty.
> 
> 
>>Fix it by actually moving accounting info to the new page. Would it be neater
>>to do this in move_pte? maybe rmap.c? (nick mumbles something about not
>>accounting ZERO_PAGE()s)
> 
> 
> Tiresome, I can quite see why it brings you to mumbling.
> 
> Though it looks right, I do hate the patch cluttering up move_ptes()
> like that: will the compiler be able to work out that that "unlikely"
> means impossible (and optimize away the code) on all arches but MIPS?
> Even if it can, I'd rather not see it there.
> 
> Could you make the MIPS move_pte() a proper function, say in
> arch/mips/mm/init.c next to setup_zero_pages(), and do that tiresome
> stuff there - should then be able to assume ZERO_PAGEs and skip the
> BUG_ON embellishments.
> 
> Utter nit-of-nits: my sense of symmetry prefers that you put_page()
> after page_remove_rmap() instead of before.

OK, how's this one?

Not tested on MIPS, but the same move_pte compiled on i386 here.

I sent Ralf a little test program that should eventually free a ZERO_PAGE
if it is run a few times (with a non-zero zero_page_mask). Do you have
time to confirm, Ralf?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: rmap-mremap-fix.patch --]
[-- Type: text/plain, Size: 5406 bytes --]

When mremap()ing virtual addresses, some architectures (read: MIPS) switches
underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).

The problem is that the refcount and mapcount remain on the old page, while
the actual pte is switched to the new one. This would counter underruns and
confuse the rmap code.

Fix it by actually moving accounting info to the new page in the MIPS
architecture code.

Also, clean up mremap slightly, so we can see where the old and the new pte
is being used.

Signed-off-by: Nick Piggin <npiggin@suse.de>


Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2007-01-24 01:00:53.000000000 +1100
+++ linux-2.6/mm/mremap.c	2007-01-29 14:17:13.000000000 +1100
@@ -72,7 +72,7 @@ static void move_ptes(struct vm_area_str
 {
 	struct address_space *mapping = NULL;
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *old_pte, *new_pte, pte;
+	pte_t *old_pte, *new_pte;
 	spinlock_t *old_ptl, *new_ptl;
 
 	if (vma->vm_file) {
@@ -102,12 +102,17 @@ static void move_ptes(struct vm_area_str
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
+		pte_t new, old;
+
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_clear_flush(vma, old_addr, old_pte);
-		/* ZERO_PAGE can be dependant on virtual addr */
-		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
-		set_pte_at(mm, new_addr, new_pte, pte);
+		old = ptep_clear_flush(vma, old_addr, old_pte);
+		/*
+		 * ZERO_PAGE can be dependant on virtual addr, so pte may
+		 * actually have to be changed
+		 */
+		new = move_pte(old, new_vma->vm_page_prot, vma, old_addr, new_addr);
+		set_pte_at(mm, new_addr, new_pte, new);
 	}
 
 	arch_leave_lazy_mmu_mode();
Index: linux-2.6/arch/mips/mm/init.c
===================================================================
--- linux-2.6.orig/arch/mips/mm/init.c	2007-01-29 13:57:07.000000000 +1100
+++ linux-2.6/arch/mips/mm/init.c	2007-01-29 14:28:20.000000000 +1100
@@ -25,6 +25,7 @@
 #include <linux/swap.h>
 #include <linux/proc_fs.h>
 #include <linux/pfn.h>
+#include <linux/rmap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -103,6 +104,41 @@ unsigned long setup_zero_pages(void)
 	return 1UL << order;
 }
 
+
+pte_t move_pte(pte_t pte, pgprot_t prot, struct vm_area_struct *old_vma,
+			unsigned long old_addr, unsigned long new_addr)
+{
+	unsigned long pfn;
+	struct page *old_page, *new_page;
+ 	pte_t newpte = pte;
+
+	if (!pte_present(pte))
+		goto out;
+
+	pfn = pte_pfn(pte);
+	if (!pfn_valid(pfn))
+		goto out;
+
+	old_page = vm_normal_page(old_vma, old_addr, pte);
+	if (likely(old_page != ZERO_PAGE(old_addr)))
+		goto out;
+
+	new_page = ZERO_PAGE(new_addr);
+	if (old_page == new_page)
+		goto out;
+
+	get_page(new_page);
+	page_add_file_rmap(new_page);
+
+	page_remove_rmap(old_page, old_vma);
+	put_page(old_page);
+
+	newpte = mk_pte(new_page, prot);
+
+out:
+	return newpte;
+}
+
 /*
  * These are almost like kmap_atomic / kunmap_atmic except they take an
  * additional address argument as the hint.
Index: linux-2.6/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-generic/pgtable.h	2007-01-29 13:56:10.000000000 +1100
+++ linux-2.6/include/asm-generic/pgtable.h	2007-01-29 14:19:06.000000000 +1100
@@ -159,7 +159,14 @@ static inline void ptep_set_wrprotect(st
 #endif
 
 #ifndef __HAVE_ARCH_MOVE_PTE
-#define move_pte(pte, prot, old_addr, new_addr)	(pte)
+/*
+ * Those architectures that define __HAVE_ARCH_MOVE_PTE must take care
+ * to update the rmap information on a vm_normal_page (which includes
+ * ZERO_PAGE), if they update the pte to some other page here.
+ *
+ * See MIPS for an example.
+ */
+#define move_pte(pte, prot, old_vma, old_addr, new_addr)	(pte)
 #endif
 
 /*
Index: linux-2.6/include/asm-mips/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-mips/pgtable.h	2007-01-29 13:55:38.000000000 +1100
+++ linux-2.6/include/asm-mips/pgtable.h	2007-01-29 14:28:20.000000000 +1100
@@ -70,14 +70,8 @@ extern unsigned long zero_page_mask;
 	(virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))
 
 #define __HAVE_ARCH_MOVE_PTE
-#define move_pte(pte, prot, old_addr, new_addr)				\
-({									\
- 	pte_t newpte = (pte);						\
-	if (pte_present(pte) && pfn_valid(pte_pfn(pte)) &&		\
-			pte_page(pte) == ZERO_PAGE(old_addr))		\
-		newpte = mk_pte(ZERO_PAGE(new_addr), (prot));		\
-	newpte;								\
-})
+extern pte_t move_pte(pte_t pte, pgprot_t prot, struct vm_area_struct *old_vma,
+			unsigned long old_addr, unsigned long new_addr);
 
 extern void paging_init(void);
 
Index: linux-2.6/include/asm-sparc64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/pgtable.h	2007-01-29 14:10:38.000000000 +1100
+++ linux-2.6/include/asm-sparc64/pgtable.h	2007-01-29 14:11:32.000000000 +1100
@@ -691,7 +691,7 @@ static inline void set_pte_at(struct mm_
 
 #ifdef DCACHE_ALIASING_POSSIBLE
 #define __HAVE_ARCH_MOVE_PTE
-#define move_pte(pte, prot, old_addr, new_addr)				\
+#define move_pte(pte, prot, old_vma, old_addr, new_addr)		\
 ({									\
  	pte_t newpte = (pte);						\
 	if (tlb_type != hypervisor && pte_present(pte)) {		\

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29  3:31   ` Nick Piggin
@ 2007-01-29  6:40     ` Andrew Morton
  2007-01-29  6:57       ` Nick Piggin
  2007-01-29 19:08     ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2007-01-29  6:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Memory Management, Ralf Baechle, Linus Torvalds

On Mon, 29 Jan 2007 14:31:07 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> When mremap()ing virtual addresses, some architectures (read: MIPS) switches
> underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).
> 
> The problem is that the refcount and mapcount remain on the old page, while
> the actual pte is switched to the new one. This would counter underruns and
> confuse the rmap code.

umm, that sounds fairly fatal.  For how long has this bug been present?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29  6:40     ` Andrew Morton
@ 2007-01-29  6:57       ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2007-01-29  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Memory Management, Ralf Baechle, Linus Torvalds

Andrew Morton wrote:
> On Mon, 29 Jan 2007 14:31:07 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>When mremap()ing virtual addresses, some architectures (read: MIPS) switches
>>underlying pages if encountering ZERO_PAGE(old_vaddr) != ZERO_PAGE(new_vaddr).
>>
>>The problem is that the refcount and mapcount remain on the old page, while
>>the actual pte is switched to the new one. This would counter underruns and
>>confuse the rmap code.
> 
> 
> umm, that sounds fairly fatal.  For how long has this bug been present?
> 

It would be. We have a catch to prevent ZERO_PAGE from actually getting freed,
but it would spew warnings and eventually go bug in the page_mapcount underflow
check.

It has been around for quite a few releases, so either we don't have many R4000
or R4400 SC and MC CPUs running a recent kernel, or this type of mremap is
pretty rare, or both. We get far more reports of mapcount underflow on x86 than
MIPS!

If Ralf acks then I think it could go into the next release.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29  3:31   ` Nick Piggin
  2007-01-29  6:40     ` Andrew Morton
@ 2007-01-29 19:08     ` Hugh Dickins
  2007-01-29 19:27       ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2007-01-29 19:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andrew Morton, Ralf Baechle, Linus Torvalds

On Mon, 29 Jan 2007, Nick Piggin wrote:
> 
> OK, how's this one?

Grudging okay - so irritating to have to do this!

We have different ideas of what's a good cleanup
(pte -> old/new), but if you wish.

I'd much rather you'd got rid of move_pte's prot argument,
that can be taken from the old_vma you're now having to pass
(because of your debug additions to page_remove_rmap).

Grudging okay: thanks for fixing it.

Hugh

> 
> Not tested on MIPS, but the same move_pte compiled on i386 here.
> 
> I sent Ralf a little test program that should eventually free a ZERO_PAGE
> if it is run a few times (with a non-zero zero_page_mask). Do you have
> time to confirm, Ralf?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 19:08     ` Hugh Dickins
@ 2007-01-29 19:27       ` Linus Torvalds
  2007-01-29 20:03         ` Andrew Morton
  2007-01-29 20:10         ` Hugh Dickins
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-01-29 19:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management, Andrew Morton, Ralf Baechle


On Mon, 29 Jan 2007, Hugh Dickins wrote:

> On Mon, 29 Jan 2007, Nick Piggin wrote:
> > 
> > OK, how's this one?
> 
> Grudging okay - so irritating to have to do this!

I really hate it. Like REALLY REALLY hate it.

This just seems really stupid.

How about making the zero-page on MIPS be PageCompound(), and then have 
all the sub-pages just point to the first page - that's how compound pages 
work anyway.

Then the mapcount/refcount is all done on the compound page, and none of 
these problems occur.

Hmm? 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 19:27       ` Linus Torvalds
@ 2007-01-29 20:03         ` Andrew Morton
  2007-01-29 20:18           ` Linus Torvalds
  2007-01-29 20:10         ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2007-01-29 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Ralf Baechle

On Mon, 29 Jan 2007 11:27:56 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 29 Jan 2007, Hugh Dickins wrote:
> 
> > On Mon, 29 Jan 2007, Nick Piggin wrote:
> > > 
> > > OK, how's this one?
> > 
> > Grudging okay - so irritating to have to do this!
> 
> I really hate it. Like REALLY REALLY hate it.
> 
> This just seems really stupid.
> 
> How about making the zero-page on MIPS be PageCompound(), and then have 
> all the sub-pages just point to the first page - that's how compound pages 
> work anyway.
> 

Can we convert those bits of mips to just have a single zero-page, like
everyone else?

Is that trick a correctness thing, or a performance thing?  If the latter,
how useful is it, and how common are the chips which use it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 19:27       ` Linus Torvalds
  2007-01-29 20:03         ` Andrew Morton
@ 2007-01-29 20:10         ` Hugh Dickins
  2007-01-29 20:22           ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2007-01-29 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Linux Memory Management, Andrew Morton, Ralf Baechle

On Mon, 29 Jan 2007, Linus Torvalds wrote:
> On Mon, 29 Jan 2007, Hugh Dickins wrote:
> > On Mon, 29 Jan 2007, Nick Piggin wrote:
> > > 
> > > OK, how's this one?
> > 
> > Grudging okay - so irritating to have to do this!
> 
> I really hate it. Like REALLY REALLY hate it.
> 
> This just seems really stupid.
> 
> How about making the zero-page on MIPS be PageCompound(), and then have 
> all the sub-pages just point to the first page - that's how compound pages 
> work anyway.
> 
> Then the mapcount/refcount is all done on the compound page, and none of 
> these problems occur.

Beautiful idea, an entirely appropriate use of PageCompound().

But it won't quite work as is, since only page_count() is diverted
via PageCompound(): page_mapcount() works on exactly the page given.
So the MIPS ZERO_PAGEs could still hit the page_remove_rmap() BUG.

Agreed that's a surprising divergence: but it's worked fine to date,
and I'm hesitant to change it in a hurry, need to pause to consider
the ramifications.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 20:03         ` Andrew Morton
@ 2007-01-29 20:18           ` Linus Torvalds
  2007-01-29 21:27             ` Ralf Baechle
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-01-29 20:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Ralf Baechle


On Mon, 29 Jan 2007, Andrew Morton wrote:
> 
> Can we convert those bits of mips to just have a single zero-page, like
> everyone else?
> 
> Is that trick a correctness thing, or a performance thing?  If the latter,
> how useful is it, and how common are the chips which use it?

It was a performance thing, iirc. Apparently a fairly big deal: pages 
kicking each other out of the cache due to idiotic cache design. But I 
forget any details.

MIPS in general is a f*cking pain in the *ss. They have a few chips that 
are sane, but just an incredible amount of totally braindamaged ones. 
They're not the only ones with virtual caches, but they're certainly 
well-represented there. Sad.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 20:10         ` Hugh Dickins
@ 2007-01-29 20:22           ` Linus Torvalds
  2007-01-29 20:38             ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-01-29 20:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management, Andrew Morton, Ralf Baechle


On Mon, 29 Jan 2007, Hugh Dickins wrote:
> 
> But it won't quite work as is, since only page_count() is diverted
> via PageCompound(): page_mapcount() works on exactly the page given.
> So the MIPS ZERO_PAGEs could still hit the page_remove_rmap() BUG.

Ok.

> Agreed that's a surprising divergence: but it's worked fine to date,
> and I'm hesitant to change it in a hurry, need to pause to consider
> the ramifications.

How about:
 - the current code has worked so far, so there is no way in hell I'll 
   take a patch for some odd-ball architecture for a theoretical problem 
   that nobody else cares about for 2.6.20 *anyway* and hasn't been 
   reported until now (considering that it's apparently been around since 
   rmap went in).

 - somebody who cares would explore trying to make page_mapcount work like 
   page_count, and see if this is a viable approach.

 - at worst, we can do the Andrew thing, and just forget about the 
   ZERO_PAGE() multi-page thing. It may well be that even MIPS people 
   don't care any more, if the particular uarch version that was helped 
   most isn't very common any more (we can always hope..)

Hmm?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 20:22           ` Linus Torvalds
@ 2007-01-29 20:38             ` Hugh Dickins
  2007-01-29 21:24               ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2007-01-29 20:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Linux Memory Management, Andrew Morton, Ralf Baechle

On Mon, 29 Jan 2007, Linus Torvalds wrote:
> 
> How about:
>  - the current code has worked so far, so there is no way in hell I'll 
>    take a patch for some odd-ball architecture for a theoretical problem 
>    that nobody else cares about for 2.6.20 *anyway* and hasn't been 
>    reported until now (considering that it's apparently been around since 
>    rmap went in).
> 
>  - somebody who cares would explore trying to make page_mapcount work like 
>    page_count, and see if this is a viable approach.
> 
>  - at worst, we can do the Andrew thing, and just forget about the 
>    ZERO_PAGE() multi-page thing. It may well be that even MIPS people 
>    don't care any more, if the particular uarch version that was helped 
>    most isn't very common any more (we can always hope..)

I'm currently thinking the right answer will be to keep the MIPS
multiple ZERO_PAGEs, for their performance benefit, but ditch the
way MIPS moves from one ZERO_PAGE to another in move_pte, trying
desperately to retain that performance benefit across mremap.

It never used to do that, it was something Nick noticed and put
into 2.6.14.  I've the ghost of a memory that at that time he
had some code which was actually relying on ZERO_PAGE(vaddr)
always being the expected ZERO_PAGE for that vaddr.
I'm about to search my old mailboxes.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 20:38             ` Hugh Dickins
@ 2007-01-29 21:24               ` Hugh Dickins
  2007-01-30  1:00                 ` Nick Piggin
                                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-01-29 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Linux Memory Management, Andrew Morton,
	Ralf Baechle, Carsten Otte

On Mon, 29 Jan 2007, Hugh Dickins wrote:
> 
> I'm currently thinking the right answer will be to keep the MIPS
> multiple ZERO_PAGEs, for their performance benefit, but ditch the
> way MIPS moves from one ZERO_PAGE to another in move_pte, trying
> desperately to retain that performance benefit across mremap.
> 
> It never used to do that, it was something Nick noticed and put
> into 2.6.14.  I've the ghost of a memory that at that time he
> had some code which was actually relying on ZERO_PAGE(vaddr)
> always being the expected ZERO_PAGE for that vaddr.
> I'm about to search my old mailboxes.

Ah, it wasn't any special patch of Nick's that needed it for
correctness, it was filemap_xip and its use the ZERO_PAGE(address)
(to avoid unnecessary page allocations): with page_check_address()
an interface for checking just a single page, not a range of them.

Could make it loop over them all, but a quicker patch would be as
below.  I've no idea if the intersection of filemap_xip users and
MIPS users is the empty set or more interesting.  But I'd prefer
you don't just slam in the patch, better have an opinion from
Carsten and/or Nick first.


Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.
Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/asm-mips/pgtable.h |   10 ----------
 mm/filemap_xip.c           |    4 ++--
 mm/mremap.c                |    1 -
 3 files changed, 2 insertions(+), 13 deletions(-)

--- 2.6.20-rc6/include/asm-mips/pgtable.h	2007-01-25 08:25:19.000000000 +0000
+++ linux/include/asm-mips/pgtable.h	2007-01-29 20:57:35.000000000 +0000
@@ -69,16 +69,6 @@ extern unsigned long zero_page_mask;
 #define ZERO_PAGE(vaddr) \
 	(virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))
 
-#define __HAVE_ARCH_MOVE_PTE
-#define move_pte(pte, prot, old_addr, new_addr)				\
-({									\
- 	pte_t newpte = (pte);						\
-	if (pte_present(pte) && pfn_valid(pte_pfn(pte)) &&		\
-			pte_page(pte) == ZERO_PAGE(old_addr))		\
-		newpte = mk_pte(ZERO_PAGE(new_addr), (prot));		\
-	newpte;								\
-})
-
 extern void paging_init(void);
 
 /*
--- 2.6.20-rc6/mm/filemap_xip.c	2007-01-25 08:25:27.000000000 +0000
+++ linux/mm/filemap_xip.c	2007-01-29 20:57:35.000000000 +0000
@@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		page = ZERO_PAGE(address);
+		page = ZERO_PAGE(0);
 		pte = page_check_address(page, mm, address, &ptl);
 		if (pte) {
 			/* Nuke the page table entry. */
@@ -246,7 +246,7 @@ xip_file_nopage(struct vm_area_struct * 
 		__xip_unmap(mapping, pgoff);
 	} else {
 		/* not shared and writable, use ZERO_PAGE() */
-		page = ZERO_PAGE(address);
+		page = ZERO_PAGE(0);
 	}
 
 out:
--- 2.6.20-rc6/mm/mremap.c	2006-11-29 21:57:37.000000000 +0000
+++ linux/mm/mremap.c	2007-01-29 20:57:35.000000000 +0000
@@ -105,7 +105,6 @@ static void move_ptes(struct vm_area_str
 		if (pte_none(*old_pte))
 			continue;
 		pte = ptep_clear_flush(vma, old_addr, old_pte);
-		/* ZERO_PAGE can be dependant on virtual addr */
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 20:18           ` Linus Torvalds
@ 2007-01-29 21:27             ` Ralf Baechle
  0 siblings, 0 replies; 25+ messages in thread
From: Ralf Baechle @ 2007-01-29 21:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Hugh Dickins, Nick Piggin,
	Linux Memory Management, Maciej W. Rozycki

On Mon, Jan 29, 2007 at 12:18:29PM -0800, Linus Torvalds wrote:

(adding Maciej who is using R4000 for much of his MIPS hacking to cc ...)

> > Can we convert those bits of mips to just have a single zero-page, like
> > everyone else?
> > 
> > Is that trick a correctness thing, or a performance thing?  If the latter,
> > how useful is it, and how common are the chips which use it?
> 
> It was a performance thing, iirc. Apparently a fairly big deal: pages 
> kicking each other out of the cache due to idiotic cache design. But I 
> forget any details.

R4000 and R4400 SC and MC versions suffer from virtual aliases but have a
hardware mechanism to detect them and will throw a virtual coherency
exception if so.  So in theory for these processors the entire burden of
handling cache aliases can be left to like ten lines of exception handling
code.  A real world machine may encounter millions of these exceptions in
a relativly short time which sucks performancewise, so aliasing avoidance
by careful selection of addresses and cacheflushes is needed.

So to make that more explicit, it's not needed for correctness but it's
a performance - and sometimes a fairly big one - thing.

(The easy solution for the issue would be raising the pagesize to the next
higher supported values, 16kB or 64kB.  Now for the true idiocy of this
exception-based scheme - it happens that the hardware checks three bits for
aliasing as if the CPU had 32kB direct mapped caches even though these
types only have 8kB (R4000) rsp 16kB (R4400) primary caches.  This means
ramping up the page size to 16kB wouldn't suffice - it would have to be
64kB to eleminate aliases which is impractical for memory reasons on many
systems.

Anyway, no other MIPS processor has this "virtual coherency exception"
and so I don't have an affected system at hand right now.  So I hacked
a 2.6.20-rc6 kernel for another machine to do R4x00 style ZERO_PAGE
handling - and the system did survive a quick 10min testing just fine
until a test case which Nick had mailed to me killed the system as
predicted.  I blame obsoletion of the chips and lucky choice of workload
for this issue only ever having been reported once and that report was
probably ignored because it was sufficiently obscure.

> MIPS in general is a f*cking pain in the *ss. They have a few chips that 
> are sane, but just an incredible amount of totally braindamaged ones. 
> They're not the only ones with virtual caches, but they're certainly 
> well-represented there. Sad.

Eh...  MIPS isn't the architecture with VIVT data caches.  It's other
arch maintainers that get their sleep robbed by truly st00pid caches.

  Ralf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 21:24               ` Hugh Dickins
@ 2007-01-30  1:00                 ` Nick Piggin
  2007-01-30 14:24                 ` Carsten Otte
  2007-01-30 15:47                 ` Carsten Otte
  2 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2007-01-30  1:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Linux Memory Management, Andrew Morton,
	Ralf Baechle, Carsten Otte

Hugh Dickins wrote:

> Ah, it wasn't any special patch of Nick's that needed it for
> correctness, it was filemap_xip and its use the ZERO_PAGE(address)
> (to avoid unnecessary page allocations): with page_check_address()
> an interface for checking just a single page, not a range of them.
> 
> Could make it loop over them all, but a quicker patch would be as
> below.  I've no idea if the intersection of filemap_xip users and
> MIPS users is the empty set or more interesting.  But I'd prefer
> you don't just slam in the patch, better have an opinion from
> Carsten and/or Nick first.

Yeah you could do that. I found it interesting that there is no way
in hell Linus will take an architecture-specific patch for this
odd-ball architecture and its obscure problem, but would rather fix
it in generic code.

My opinion? If we do keep the multiple zero pages thing, then I would
prefer keep them coherent with their virtual addresses to prevent
future surprises. Ditching multiple zero pages completely would be
ideal, from a core mm/ point of view, of course.

I agree my patch is ugly for having to do refcount and rmap work
outside mm/. But I don't see anything wrong with it as the minimal
correctness fix that we could later reevaluate.

> 
> 
> Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
> is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.
> Instead of complicating that move_pte, just forget the minor optimization
> when mremapping, and change the one thing which needed it for correctness
> - filemap_xip use ZERO_PAGE(0) throughout instead of according to address.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  include/asm-mips/pgtable.h |   10 ----------
>  mm/filemap_xip.c           |    4 ++--
>  mm/mremap.c                |    1 -
>  3 files changed, 2 insertions(+), 13 deletions(-)
> 
> --- 2.6.20-rc6/include/asm-mips/pgtable.h	2007-01-25 08:25:19.000000000 +0000
> +++ linux/include/asm-mips/pgtable.h	2007-01-29 20:57:35.000000000 +0000
> @@ -69,16 +69,6 @@ extern unsigned long zero_page_mask;
>  #define ZERO_PAGE(vaddr) \
>  	(virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))
>  
> -#define __HAVE_ARCH_MOVE_PTE
> -#define move_pte(pte, prot, old_addr, new_addr)				\
> -({									\
> - 	pte_t newpte = (pte);						\
> -	if (pte_present(pte) && pfn_valid(pte_pfn(pte)) &&		\
> -			pte_page(pte) == ZERO_PAGE(old_addr))		\
> -		newpte = mk_pte(ZERO_PAGE(new_addr), (prot));		\
> -	newpte;								\
> -})
> -
>  extern void paging_init(void);
>  
>  /*
> --- 2.6.20-rc6/mm/filemap_xip.c	2007-01-25 08:25:27.000000000 +0000
> +++ linux/mm/filemap_xip.c	2007-01-29 20:57:35.000000000 +0000
> @@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -		page = ZERO_PAGE(address);
> +		page = ZERO_PAGE(0);
>  		pte = page_check_address(page, mm, address, &ptl);
>  		if (pte) {
>  			/* Nuke the page table entry. */
> @@ -246,7 +246,7 @@ xip_file_nopage(struct vm_area_struct * 
>  		__xip_unmap(mapping, pgoff);
>  	} else {
>  		/* not shared and writable, use ZERO_PAGE() */
> -		page = ZERO_PAGE(address);
> +		page = ZERO_PAGE(0);
>  	}
>  
>  out:
> --- 2.6.20-rc6/mm/mremap.c	2006-11-29 21:57:37.000000000 +0000
> +++ linux/mm/mremap.c	2007-01-29 20:57:35.000000000 +0000
> @@ -105,7 +105,6 @@ static void move_ptes(struct vm_area_str
>  		if (pte_none(*old_pte))
>  			continue;
>  		pte = ptep_clear_flush(vma, old_addr, old_pte);
> -		/* ZERO_PAGE can be dependant on virtual addr */
>  		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  		set_pte_at(mm, new_addr, new_pte, pte);
>  	}
> 
> --

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 21:24               ` Hugh Dickins
  2007-01-30  1:00                 ` Nick Piggin
@ 2007-01-30 14:24                 ` Carsten Otte
  2007-01-30 16:41                   ` Ralf Baechle
  2007-01-30 15:47                 ` Carsten Otte
  2 siblings, 1 reply; 25+ messages in thread
From: Carsten Otte @ 2007-01-30 14:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
> I've no idea if the intersection of filemap_xip users and
> MIPS users is the empty set or more interesting.
As far as I can tell, the intersection is zero. There is no block 
device driver one could use for XIP on mips platforms.

Carsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-29 21:24               ` Hugh Dickins
  2007-01-30  1:00                 ` Nick Piggin
  2007-01-30 14:24                 ` Carsten Otte
@ 2007-01-30 15:47                 ` Carsten Otte
  2007-01-30 22:04                   ` Hugh Dickins
  2 siblings, 1 reply; 25+ messages in thread
From: Carsten Otte @ 2007-01-30 15:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
> Could make it loop over them all, but a quicker patch would be as
> below.  I've no idea if the intersection of filemap_xip users and
> MIPS users is the empty set or more interesting.  But I'd prefer
> you don't just slam in the patch, better have an opinion from
> Carsten and/or Nick first.
Took me some time to catch up on this thread, sorry for that. Yea, I 
think xip can be implemented correctly that it works on mips when we 
loop over all zero pages on unmap. Let me try to come up with a patch 
for that.

Carsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-30 14:24                 ` Carsten Otte
@ 2007-01-30 16:41                   ` Ralf Baechle
  2007-01-30 17:35                     ` Carsten Otte
  0 siblings, 1 reply; 25+ messages in thread
From: Ralf Baechle @ 2007-01-30 16:41 UTC (permalink / raw)
  To: carsteno
  Cc: Hugh Dickins, Linus Torvalds, Nick Piggin,
	Linux Memory Management, Andrew Morton

On Tue, Jan 30, 2007 at 03:24:32PM +0100, Carsten Otte wrote:

> Hugh Dickins wrote:
> >I've no idea if the intersection of filemap_xip users and
> >MIPS users is the empty set or more interesting.
> As far as I can tell, the intersection is zero. There is no block 
> device driver one could use for XIP on mips platforms.

How about flash ;-)

XIP is mostly an embedded feature.  The affected MIPS R4[40]00[SM]C
processors are desktop and server processors so there almost
fundamentally is no overlap.

  Ralf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-30 16:41                   ` Ralf Baechle
@ 2007-01-30 17:35                     ` Carsten Otte
  0 siblings, 0 replies; 25+ messages in thread
From: Carsten Otte @ 2007-01-30 17:35 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: carsteno, Hugh Dickins, Linus Torvalds, Nick Piggin,
	Linux Memory Management, Andrew Morton

Ralf Baechle wrote:
> How about flash ;-)
The current xip implementation does not work with flash. It only works 
with ram/rom, and we use it in a virtual environment to share data 
between multiple virtual machines.
After a long discussion with Joern Engel and David Woodhouse, I 
believe we came up with an idea how we would extend the existing xip 
support to work with memory technology device. But that is not 
implemented, and not work in progress.

> XIP is mostly an embedded feature.  The affected MIPS R4[40]00[SM]C
> processors are desktop and server processors so there almost
> fundamentally is no overlap.
Actually, only few people consider s390 as embedded platform. And s390 
is the only exploiter of xip to date.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-30 15:47                 ` Carsten Otte
@ 2007-01-30 22:04                   ` Hugh Dickins
  2007-01-31 13:51                     ` Carsten Otte
  2007-01-31 13:59                     ` Carsten Otte
  0 siblings, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-01-30 22:04 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Andrew Morton, Ralf Baechle

On Tue, 30 Jan 2007, Carsten Otte wrote:
> Hugh Dickins wrote:
> > Could make it loop over them all, but a quicker patch would be as
> > below.  I've no idea if the intersection of filemap_xip users and
> > MIPS users is the empty set or more interesting.  But I'd prefer
> > you don't just slam in the patch, better have an opinion from
> > Carsten and/or Nick first.
> Took me some time to catch up on this thread, sorry for that.

No problem.  And sorry for being so slow to reply to you,
there seems to have been an assault on my home connectivity.

I think it's now clear that XIP won't be impacted at all by my
ZERO_PAGE(0) change, and that's the patch Linus should put in
for 2.6.20 (given how much he disliked Nick's patch to maintain
the different zeropage counts across mremap move).  Ah, good,
that's now gone into his tree since last I looked.

> Yea, I think xip can be implemented correctly that it works on
> mips when we loop over all zero pages on unmap. Let me try to
> come up with a patch for that.

Sorry, no, I didn't mean for you to try that, and there appears
to be no need for it at all, for the foreseeable future anyway:
please don't waste your time on that.

But there is a change which I now think you do need to make,
for 2.6.21 - let it not distract attention from the pagecount
correctness issue we've been discussing so far.  Something I
should have noticed when I first looked at your clever use of
the ZERO_PAGE, but have only noticed now.  Doesn't it clash
with the clever use of the ZERO_PAGE when reading /dev/zero
(see read_zero_pagealigned in drivers/char/mem.c)?

Consider two PROT_READ|PROT_WRITE,MAP_PRIVATE mappings of a
four-page hole in a XIP file.  One of them just readfaults the
four pages in (and is given ZERO_PAGE for each), the other has
four pages read from /dev/zero into it (which also maps the
ZERO_PAGE into its four ptes).

Then imagine that non-zero data is written to the first page of
that hole, by a write syscall, or through a PROT_WRITE,MAP_SHARED
mapping.  __xip_unmap will replace the first ZERO_PAGE in each of
the MAP_PRIVATE mappings by the new non-zero data page.  Which is
correct for the first mapping which just did readfaults, but wrong
for the second mapping which has overwritten by reading /dev/zero
- those pages ought to remain zeroed, never seeing the later data.

I've never much liked the read_zero_pagealigned cleverness, seems
"too clever by half" as we say, and this overlooked clash shows why.
But I'm also scared to remove any long-established optimization(?),
for fear of impacting some unknown workload.

So, if you're to retain your share-one-page-for-holes cleverness,
I think you need to switch over to allocating a page of your own
for it, instead of using the ZERO_PAGE.

Or have I got it wrong?  A simple test should show.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-30 22:04                   ` Hugh Dickins
@ 2007-01-31 13:51                     ` Carsten Otte
  2007-01-31 13:59                     ` Carsten Otte
  1 sibling, 0 replies; 25+ messages in thread
From: Carsten Otte @ 2007-01-31 13:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Carsten Otte, Linus Torvalds, Nick Piggin,
	Linux Memory Management, Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
> I think it's now clear that XIP won't be impacted at all by my
> ZERO_PAGE(0) change, and that's the patch Linus should put in
> for 2.6.20 (given how much he disliked Nick's patch to maintain
> the different zeropage counts across mremap move).  Ah, good,
> that's now gone into his tree since last I looked.
Good to know this issue is solved. Thank you.

> But there is a change which I now think you do need to make,
> for 2.6.21 - let it not distract attention from the pagecount
> correctness issue we've been discussing so far.  Something I
> should have noticed when I first looked at your clever use of
> the ZERO_PAGE, but have only noticed now.  Doesn't it clash
> with the clever use of the ZERO_PAGE when reading /dev/zero
> (see read_zero_pagealigned in drivers/char/mem.c)?
> 
> Consider two PROT_READ|PROT_WRITE,MAP_PRIVATE mappings of a
> four-page hole in a XIP file.  One of them just readfaults the
> four pages in (and is given ZERO_PAGE for each), the other has
> four pages read from /dev/zero into it (which also maps the
> ZERO_PAGE into its four ptes).
> 
> Then imagine that non-zero data is written to the first page of
> that hole, by a write syscall, or through a PROT_WRITE,MAP_SHARED
> mapping.  __xip_unmap will replace the first ZERO_PAGE in each of
> the MAP_PRIVATE mappings by the new non-zero data page.  Which is
> correct for the first mapping which just did readfaults, but wrong
> for the second mapping which has overwritten by reading /dev/zero
> - those pages ought to remain zeroed, never seeing the later data.
Wait a second, I fail to see why those pages in the second mapping 
should remain zeroed and never see the later data; they do reflect the 
file content that was mmap()ed there correctly at all times. I would 
expect this behavior, and would expect to see both mappings reflect 
this data. Am I missing something?

> Or have I got it wrong?  A simple test should show.
The test is simple indeed, will do that - and compare page cache 
backed file behavior with xip file.

Carsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-30 22:04                   ` Hugh Dickins
  2007-01-31 13:51                     ` Carsten Otte
@ 2007-01-31 13:59                     ` Carsten Otte
  2007-01-31 16:31                       ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Carsten Otte @ 2007-01-31 13:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Carsten Otte, Linus Torvalds, Nick Piggin,
	Linux Memory Management, Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
> But there is a change which I now think you do need to make,
> for 2.6.21 - let it not distract attention from the pagecount
> correctness issue we've been discussing so far.  Something I
> should have noticed when I first looked at your clever use of
> the ZERO_PAGE, but have only noticed now.  Doesn't it clash
> with the clever use of the ZERO_PAGE when reading /dev/zero
> (see read_zero_pagealigned in drivers/char/mem.c)?
> 
> Consider two PROT_READ|PROT_WRITE,MAP_PRIVATE mappings of a
> four-page hole in a XIP file.  One of them just readfaults the
> four pages in (and is given ZERO_PAGE for each), the other has
> four pages read from /dev/zero into it (which also maps the
> ZERO_PAGE into its four ptes).
> 
> Then imagine that non-zero data is written to the first page of
> that hole, by a write syscall, or through a PROT_WRITE,MAP_SHARED
> mapping.  __xip_unmap will replace the first ZERO_PAGE in each of
> the MAP_PRIVATE mappings by the new non-zero data page.  Which is
> correct for the first mapping which just did readfaults, but wrong
> for the second mapping which has overwritten by reading /dev/zero
> - those pages ought to remain zeroed, never seeing the later data.
Nasty. I got your point now, but as far as I can see we're still in-spec:

MAP_PRIVATE
Create a private copy-on-write mapping.  Stores to the region do not
affect the original file.  It is unspecified whether changes made to
the file after the mmap call are visible in the mapped region.

A fix could be to use my own empty page instead of the ZERO_PAGE for 
xip. At least we have different behavior with/without xip here, 
therefore I agree that this requires fixing.

Carsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-31 13:59                     ` Carsten Otte
@ 2007-01-31 16:31                       ` Hugh Dickins
  2007-02-01 16:21                         ` Carsten Otte
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2007-01-31 16:31 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Andrew Morton, Ralf Baechle

On Wed, 31 Jan 2007, Carsten Otte wrote:
> Nasty. I got your point now, but as far as I can see we're still in-spec:
> 
> MAP_PRIVATE
> Create a private copy-on-write mapping.  Stores to the region do not
> affect the original file.  It is unspecified whether changes made to
> the file after the mmap call are visible in the mapped region.

I agree that last sentence _appears_ to give you a let out.  I believe
its intention is to address the case where one page has been faulted
in and written to by the app, the next page is unfaulted then modified
by some other means and then faulted into the app for the first time:
that page will contain the mods made to the underlying object, even
though they were made after the private copy of the previous page was
taken (which copy should never show later mods to the underlying object).
Whereas if the mapping were mmap'ed with MAP_LOCKED (or mlocked), all
pages would be faulted in immediately, and subsequent mods to the
underlying object never seen at all.

Whatever the wording, I don't know of any application which is happy
for the modifications it makes to a MAP_PRIVATE mapping to disappear
without warning - except when it actually asks for that behaviour by
calling madvise(start, len, MADV_DONTNEED).

> 
> A fix could be to use my own empty page instead of the ZERO_PAGE for xip.

Exactly, that's what I was meaning.  Just take care to do enough
but not too much locking when allocating that page: for example
	if (!our_zero_page) {
		allocate a zeroed page;
		spin_lock;
		if (!our_zero_page)
			our_zero_page = page just allocated;
		else
			free page just allocted;
		spin_unlock;
	}

> At least we have different behavior with/without xip here, therefore I
> agree that this requires fixing.

Yes, if your testing shows that it really does behave as I suspect.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] mm: mremap correct rmap accounting
  2007-01-31 16:31                       ` Hugh Dickins
@ 2007-02-01 16:21                         ` Carsten Otte
  0 siblings, 0 replies; 25+ messages in thread
From: Carsten Otte @ 2007-02-01 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Carsten Otte, Linus Torvalds, Nick Piggin,
	Linux Memory Management, Andrew Morton, Ralf Baechle

Hugh Dickins wrote:
 > I agree that last sentence _appears_ to give you a let out.  I
 > believe its intention is to address the case where one page has been
 > faulted in and written to by the app, the next page is unfaulted
 > then modified by some other means and then faulted into the app for
 > the first time:
 > that page will contain the mods made to the underlying object, even
 > though they were made after the private copy of the previous page
 > was taken (which copy should never show later mods to the underlying
 > object).
 > Whereas if the mapping were mmap'ed with MAP_LOCKED (or mlocked),
 > all pages would be faulted in immediately, and subsequent mods to
 > the underlying object never seen at all.
I see. Makes sense to interpret the spec that way.

 > Whatever the wording, I don't know of any application which is happy
 > for the modifications it makes to a MAP_PRIVATE mapping to disappear
 > without warning - except when it actually asks for that behaviour by
 > calling madvise(start, len, MADV_DONTNEED).
I am not happy with any visibility of xip to userland via filesystem 
syscalls. It is supposed to be transparent for userspace.

 > Yes, if your testing shows that it really does behave as I suspect.
It does indeed. My little test program shows it:
- ext2 in regular operation:
current state: read-faulted sparse mappings
content of area1:
content of area2:
current state: read zero into area1
content of area1:
content of area2:
current state: write data via sys_write
content of area1:
content of area2: this change was written using sys_write

- ext with xip:
current state: read-faulted sparse mappings
content of area1:
content of area2:
current state: read zero into area1
content of area1:
content of area2:
current state: write data via sys_write
content of area1: this change was written using sys_write
content of area2: this change was written using sys_write

That proves your suspicion. I will submit a fix. Thank you for 
pointing me at it.

Carsten

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-02-01 16:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-23 14:19 [patch] mm: mremap correct rmap accounting Nick Piggin
2007-01-23 20:55 ` Hugh Dickins
2007-01-23 23:49   ` Nick Piggin
2007-01-29  3:31   ` Nick Piggin
2007-01-29  6:40     ` Andrew Morton
2007-01-29  6:57       ` Nick Piggin
2007-01-29 19:08     ` Hugh Dickins
2007-01-29 19:27       ` Linus Torvalds
2007-01-29 20:03         ` Andrew Morton
2007-01-29 20:18           ` Linus Torvalds
2007-01-29 21:27             ` Ralf Baechle
2007-01-29 20:10         ` Hugh Dickins
2007-01-29 20:22           ` Linus Torvalds
2007-01-29 20:38             ` Hugh Dickins
2007-01-29 21:24               ` Hugh Dickins
2007-01-30  1:00                 ` Nick Piggin
2007-01-30 14:24                 ` Carsten Otte
2007-01-30 16:41                   ` Ralf Baechle
2007-01-30 17:35                     ` Carsten Otte
2007-01-30 15:47                 ` Carsten Otte
2007-01-30 22:04                   ` Hugh Dickins
2007-01-31 13:51                     ` Carsten Otte
2007-01-31 13:59                     ` Carsten Otte
2007-01-31 16:31                       ` Hugh Dickins
2007-02-01 16:21                         ` Carsten Otte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.