All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch][rfc] 0/5: remove PageReserved
@ 2005-06-23  7:05 ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:05 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

Hi,
The following set of patches removes PageReserved from core kernel
code, and clears the way for the page flag to be completely removed
when it is removed from all arch/ code. Drivers are mostly fairly
trivial, but will need auditing.

Arch maintainers and driver writers will need to help get that done.

Actually, only patches 4 and 5 are really required - the first 3 are
very minor things I noticed along the way (but I'm putting them in
this series because they have clashes).

Not quite ready for merging yet, although probably after the next
round of comments it will be. It boots and runs on i386, ppc64, ia64
and not tested elsewhere.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch][rfc] 0/5: remove PageReserved
@ 2005-06-23  7:05 ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:05 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

Hi,
The following set of patches removes PageReserved from core kernel
code, and clears the way for the page flag to be completely removed
when it is removed from all arch/ code. Drivers are mostly fairly
trivial, but will need auditing.

Arch maintainers and driver writers will need to help get that done.

Actually, only patches 4 and 5 are really required - the first 3 are
very minor things I noticed along the way (but I'm putting them in
this series because they have clashes).

Not quite ready for merging yet, although probably after the next
round of comments it will be. It boots and runs on i386, ppc64, ia64
and not tested elsewhere.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch][rfc] 1/5: comment for mm/rmap.c
  2005-06-23  7:05 ` Nick Piggin
  (?)
@ 2005-06-23  7:06 ` Nick Piggin
  2005-06-23  7:06   ` [patch][rfc] 2/5: micro optimisation " Nick Piggin
  -1 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:06 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 4 bytes --]

1/5

[-- Attachment #2: mm-comment-rmap.patch --]
[-- Type: text/plain, Size: 709 bytes --]

Just be clear that VM_RESERVED pages here are a bug, and the test
is not there because they are expected.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -532,6 +532,8 @@ static int try_to_unmap_one(struct page 
 	 * If the page is mlock()d, we cannot swap it out.
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
+	 *
+	 * Pages belonging to VM_RESERVED regions should not happen here.
 	 */
 	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
 			ptep_clear_flush_young(vma, address, pte)) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch][rfc] 2/5: micro optimisation for mm/rmap.c
  2005-06-23  7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin
@ 2005-06-23  7:06   ` Nick Piggin
  2005-06-23  7:07     ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin
  2005-06-23  7:26       ` William Lee Irwin III
  0 siblings, 2 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:06 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 4 bytes --]

2/5

[-- Attachment #2: mm-microopt-rmap.patch --]
[-- Type: text/plain, Size: 1394 bytes --]

Microoptimise page_add_anon_rmap. Although these expressions are used only
in the taken branch of the if() statement, the compiler can't reorder them
inside because atomic_inc_and_test is a barrier.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -442,22 +442,23 @@ int page_referenced(struct page *page, i
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
-	pgoff_t index;
-
 	BUG_ON(PageReserved(page));
-	BUG_ON(!anon_vma);
 
 	inc_mm_counter(vma->vm_mm, anon_rss);
 
-	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
-	index += vma->vm_pgoff;
-	index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-
 	if (atomic_inc_and_test(&page->_mapcount)) {
-		page->index = index;
+		struct anon_vma *anon_vma = vma->anon_vma;
+		pgoff_t index;
+
+		BUG_ON(!anon_vma);
+		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 		page->mapping = (struct address_space *) anon_vma;
+
+		index = (address - vma->vm_start) >> PAGE_SHIFT;
+		index += vma->vm_pgoff;
+		index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+		page->index = index;
+
 		inc_page_state(nr_mapped);
 	}
 	/* else checking page index and mapping is racy */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch][rfc] 3/5: remove atomic bitop when freeing page
  2005-06-23  7:06   ` [patch][rfc] 2/5: micro optimisation " Nick Piggin
@ 2005-06-23  7:07     ` Nick Piggin
  2005-06-23  7:07       ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin
  2005-06-23  7:26       ` William Lee Irwin III
  1 sibling, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:07 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 4 bytes --]

3/5

[-- Attachment #2: mm-remove-atomic-bitop.patch --]
[-- Type: text/plain, Size: 1223 bytes --]

This bitop does not need to be atomic because it is performed when
there will be no references to the page (ie. the page is being freed).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -329,7 +329,7 @@ static inline void free_pages_check(cons
 			1 << PG_writeback )))
 		bad_page(function, page);
 	if (PageDirty(page))
-		ClearPageDirty(page);
+		__ClearPageDirty(page);
 }
 
 /*
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -194,6 +194,7 @@ extern void __mod_page_state(unsigned lo
 #define SetPageDirty(page)	set_bit(PG_dirty, &(page)->flags)
 #define TestSetPageDirty(page)	test_and_set_bit(PG_dirty, &(page)->flags)
 #define ClearPageDirty(page)	clear_bit(PG_dirty, &(page)->flags)
+#define __ClearPageDirty(page)	__clear_bit(PG_dirty, &(page)->flags)
 #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
 
 #define SetPageLRU(page)	set_bit(PG_lru, &(page)->flags)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 4/5: remap ZERO_PAGE mappings
  2005-06-23  7:07     ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin
@ 2005-06-23  7:07       ` Nick Piggin
  2005-06-23  7:08         ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:07 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 4 bytes --]

4/5

[-- Attachment #2: mm-remap-ZERO_PAGE-mappings.patch --]
[-- Type: text/plain, Size: 1081 bytes --]

Remap ZERO_PAGE ptes when remapping memory. This is currently just an
optimisation for MIPS, which is the only architecture with multiple
zero pages - it now retains the mapping it needs for good cache performance,
and as well do_wp_page is now able to always correctly detect and
optimise zero page COW faults.

In future, this becomes required in order to always be able to detect
whether a pte points to a ZERO_PAGE using only the pte, vaddr pair.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -141,6 +141,10 @@ move_one_page(struct vm_area_struct *vma
 			if (dst) {
 				pte_t pte;
 				pte = ptep_clear_flush(vma, old_addr, src);
+				/* ZERO_PAGE can be dependant on virtual addr */
+				if (pfn_valid(pte_pfn(pte)) &&
+					pte_page(pte) == ZERO_PAGE(old_addr))
+					pte = pte_wrprotect(mk_pte(ZERO_PAGE(new_addr), new_vma->vm_page_prot));
 				set_pte_at(mm, new_addr, dst, pte);
 			} else
 				error = -ENOMEM;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch][rfc] 5/5: core remove PageReserved
  2005-06-23  7:07       ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin
@ 2005-06-23  7:08         ` Nick Piggin
  2005-06-23  9:51             ` William Lee Irwin III
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:08 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 4 bytes --]

5/5

[-- Attachment #2: mm-core-remove-PageReserved.patch --]
[-- Type: text/plain, Size: 24220 bytes --]

Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.

PageReserved special casing is removed from get_page and put_page.

All setting and clearning of PageReserved is retained, and it is now
flagged in the page_alloc checks to help ensure we don't introduce
any refcount based freeing of Reserved pages.

MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated. We never completely handled it correctly anyway, and is
difficult to handle nicely - difficult but not impossible, it could
be reintroduced in future if required (Hugh has a proof of concept).

Once PageReserved() calls are removed from all arch/ and driver code,
the Set and Clear calls, and the PG_reserved bit can be trivially
removed.

Many thanks to Hugh Dickins for input.

Signed-off-by: Nick Piggin <npiggin@suse.de>


Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -156,7 +156,8 @@ extern unsigned int kobjsize(const void 
 
 #define VM_DONTCOPY	0x00020000      /* Do not copy this vma on fork */
 #define VM_DONTEXPAND	0x00040000	/* Cannot expand with mremap() */
-#define VM_RESERVED	0x00080000	/* Don't unmap it from swap_out */
+#define VM_RESERVED	0x00080000	/* Pages in region aren't managed with regular pagecache routines (replaces PageReserved), don't unmap it */
+
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
@@ -337,7 +338,7 @@ static inline void get_page(struct page 
 
 static inline void put_page(struct page *page)
 {
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 
@@ -618,6 +619,7 @@ void install_arg_page(struct vm_area_str
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+void print_invalid_pfn(const char *, pte_t, unsigned long, unsigned long);
 
 int __set_page_dirty_buffers(struct page *page);
 int __set_page_dirty_nobuffers(struct page *page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -122,7 +122,7 @@ static long madvise_dontneed(struct vm_a
 			     unsigned long start, unsigned long end)
 {
 	*prev = vma;
-	if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || is_vm_hugetlb_page(vma))
 		return -EINVAL;
 
 	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -333,6 +333,21 @@ out:
 }
 
 /*
+ * This function is called to print an error when a pte in a
+ * !VM_RESERVED region is found pointing to an invalid pfn (which
+ * is an error.
+ *
+ * The calling function must still handle the error.
+ */
+void print_invalid_pfn(const char *errfunc, pte_t pte,
+				unsigned long vm_flags, unsigned long vaddr)
+{
+	printk(KERN_ERR "%s: pte does not point to valid memory. "
+		"process = %s, pte = %08lx, vm_flags = %lx, vaddr = %lx\n",
+		errfunc, current->comm, (long)pte_val(pte), vm_flags, vaddr);
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
@@ -361,25 +376,29 @@ copy_one_pte(struct mm_struct *dst_mm, s
 				spin_unlock(&mmlist_lock);
 			}
 		}
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+		goto out_set_pte;
 	}
 
+	/* If the region is VM_RESERVED, the mapping is not
+	 * mapped via rmap - duplicate the pte as is.
+	 */
+	if (vm_flags & VM_RESERVED)
+		goto out_set_pte;
+
+	/* If the pte points outside of valid memory but
+	 * the region is not VM_RESERVED, we have a problem.
+	 */
 	pfn = pte_pfn(pte);
-	/* the pte points outside of valid memory, the
-	 * mapping is assumed to be good, meaningful
-	 * and not mapped via rmap - duplicate the
-	 * mapping as is.
-	 */
-	page = NULL;
-	if (pfn_valid(pfn))
-		page = pfn_to_page(pfn);
-
-	if (!page || PageReserved(page)) {
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+	if (unlikely(!pfn_valid(pfn))) {
+		print_invalid_pfn("copy_one_pte", pte, vm_flags, addr);
+		goto out_set_pte; /* try to do something sane */
 	}
 
+	page = pfn_to_page(pfn);
+	/* Mappings to zero pages aren't covered by rmap either. */
+	if (page == ZERO_PAGE(addr))
+		goto out_set_pte;
+
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
@@ -400,8 +419,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	inc_mm_counter(dst_mm, rss);
 	if (PageAnon(page))
 		inc_mm_counter(dst_mm, anon_rss);
-	set_pte_at(dst_mm, addr, dst_pte, pte);
 	page_dup_rmap(page);
+out_set_pte:
+	set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
 	return 0;
 }
 
-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static void zap_pte_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -528,10 +549,15 @@ static void zap_pte_range(struct mmu_gat
 		if (pte_present(ptent)) {
 			struct page *page = NULL;
 			unsigned long pfn = pte_pfn(ptent);
-			if (pfn_valid(pfn)) {
-				page = pfn_to_page(pfn);
-				if (PageReserved(page))
-					page = NULL;
+			if (!(vma->vm_flags & VM_RESERVED)) {
+				if (unlikely(!pfn_valid(pfn))) {
+					print_invalid_pfn("zap_pte_range",
+						 ptent, vma->vm_flags, addr);
+				} else {
+					page = pfn_to_page(pfn);
+					if (page == ZERO_PAGE(addr))
+						page = NULL;
+				}
 			}
 			if (unlikely(details) && page) {
 				/*
@@ -584,7 +610,8 @@ static void zap_pte_range(struct mmu_gat
 	pte_unmap(pte - 1);
 }
 
-static inline void zap_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+static inline void zap_pmd_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -596,11 +623,12 @@ static inline void zap_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		zap_pte_range(tlb, pmd, addr, next, details);
+		zap_pte_range(tlb, vma, pmd, addr, next, details);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void zap_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void zap_pud_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -612,7 +640,7 @@ static inline void zap_pud_range(struct 
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		zap_pmd_range(tlb, pud, addr, next, details);
+		zap_pmd_range(tlb, vma, pud, addr, next, details);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -633,7 +661,7 @@ static void unmap_page_range(struct mmu_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		zap_pud_range(tlb, pgd, addr, next, details);
+		zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 }
@@ -980,8 +1008,7 @@ int get_user_pages(struct task_struct *t
 			if (pages) {
 				pages[i] = page;
 				flush_dcache_page(page);
-				if (!PageReserved(page))
-					page_cache_get(page);
+				page_cache_get(page);
 			}
 			if (vmas)
 				vmas[i] = vma;
@@ -1085,8 +1112,7 @@ static int remap_pte_range(struct mm_str
 		return -ENOMEM;
 	do {
 		BUG_ON(!pte_none(*pte));
-		if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
-			set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+		set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
@@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
 		 * This should really halt the system so it can be debugged or
@@ -1232,9 +1260,8 @@ static int do_wp_page(struct mm_struct *
 		 * data, but for the moment just pretend this is OOM.
 		 */
 		pte_unmap(page_table);
-		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
-				address);
 		spin_unlock(&mm->page_table_lock);
+		print_invalid_pfn("do_wp_page", pte, vma->vm_flags, address);
 		return VM_FAULT_OOM;
 	}
 	old_page = pfn_to_page(pfn);
@@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Ok, we need to copy. Oh, well..
 	 */
-	if (!PageReserved(old_page))
+	if (old_page == ZERO_PAGE(address))
+		old_page = NULL;
+	else
 		page_cache_get(old_page);
+
 	spin_unlock(&mm->page_table_lock);
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	if (old_page == ZERO_PAGE(address)) {
+	if (old_page == NULL) {
 		new_page = alloc_zeroed_user_highpage(vma, address);
 		if (!new_page)
 			goto no_new_page;
@@ -1281,12 +1311,13 @@ static int do_wp_page(struct mm_struct *
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
-		if (PageAnon(old_page))
-			dec_mm_counter(mm, anon_rss);
-		if (PageReserved(old_page))
+		if (old_page == NULL)
 			inc_mm_counter(mm, rss);
-		else
+		else {
 			page_remove_rmap(old_page);
+			if (PageAnon(old_page))
+				dec_mm_counter(mm, anon_rss);
+		}
 		flush_cache_page(vma, address, pfn);
 		break_cow(vma, new_page, address, page_table);
 		lru_cache_add_active(new_page);
@@ -1296,13 +1327,16 @@ static int do_wp_page(struct mm_struct *
 		new_page = old_page;
 	}
 	pte_unmap(page_table);
-	page_cache_release(new_page);
-	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
+	if (old_page) {
+		page_cache_release(new_page);
+		page_cache_release(old_page);
+	}
 	return VM_FAULT_MINOR;
 
 no_new_page:
-	page_cache_release(old_page);
+	if (old_page)
+		page_cache_release(old_page);
 	return VM_FAULT_OOM;
 }
 
@@ -1739,7 +1773,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	struct page * page = ZERO_PAGE(addr);
 
 	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
 
 	/* ..except if it's a write access */
 	if (write_access) {
@@ -1878,9 +1912,6 @@ retry:
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		if (!PageReserved(new_page))
-			inc_mm_counter(mm, rss);
-
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1889,8 +1920,10 @@ retry:
 		if (anon) {
 			lru_cache_add_active(new_page);
 			page_add_anon_rmap(new_page, vma, address);
-		} else
+		} else if (!(vma->vm_flags & VM_RESERVED)) {
 			page_add_file_rmap(new_page);
+			inc_mm_counter(mm, rss);
+		}
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -113,7 +113,8 @@ static void bad_page(const char *functio
 			1 << PG_reclaim |
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback);
+			1 << PG_writeback |
+			1 << PG_reserved );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -243,7 +244,6 @@ static inline int page_is_buddy(struct p
 {
        if (PagePrivate(page)           &&
            (page_order(page) == order) &&
-           !PageReserved(page)         &&
             page_count(page) == 0)
                return 1;
        return 0;
@@ -326,7 +326,8 @@ static inline void free_pages_check(cons
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -454,7 +455,8 @@ static void prep_new_page(struct page *p
 			1 << PG_reclaim	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
@@ -1017,7 +1019,7 @@ void __pagevec_free(struct pagevec *pvec
 
 fastcall void __free_pages(struct page *page, unsigned int order)
 {
-	if (!PageReserved(page) && put_page_testzero(page)) {
+	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
 		else
@@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo
 
 	for (page = start; page < (start + size); page++) {
 		set_page_zone(page, NODEZONE(nid, zone));
-		set_page_count(page, 0);
+		set_page_count(page, 1);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
 		INIT_LIST_HEAD(&page->lru);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -48,7 +48,7 @@ void put_page(struct page *page)
 		}
 		return;
 	}
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 EXPORT_SYMBOL(put_page);
@@ -215,7 +215,7 @@ void release_pages(struct page **pages, 
 		struct page *page = pages[i];
 		struct zone *pagezone;
 
-		if (PageReserved(page) || !put_page_testzero(page))
+		if (!put_page_testzero(page))
 			continue;
 
 		pagezone = page_zone(page);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
 		return;
 	if (pte_present(pte)) {
 		unsigned long pfn = pte_pfn(pte);
+		struct page *page;
 
 		flush_cache_page(vma, addr, pfn);
 		pte = ptep_clear_flush(vma, addr, ptep);
-		if (pfn_valid(pfn)) {
-			struct page *page = pfn_to_page(pfn);
-			if (!PageReserved(page)) {
-				if (pte_dirty(pte))
-					set_page_dirty(page);
-				page_remove_rmap(page);
-				page_cache_release(page);
-				dec_mm_counter(mm, rss);
-			}
+		if (unlikely(!pfn_valid(pfn))) {
+			print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
+			return;
+		}
+		page = pfn_to_page(pfn);
+		if (page != ZERO_PAGE(addr)) {
+			if (pte_dirty(pte))
+				set_page_dirty(page);
+			page_remove_rmap(page);
+			dec_mm_counter(mm, rss);
+			page_cache_release(page);
 		}
 	} else {
 		if (!pte_file(pte))
@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c
+++ linux-2.6/mm/msync.c
@@ -37,11 +37,12 @@ static void sync_pte_range(struct vm_are
 		if (!pte_maybe_dirty(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			print_invalid_pfn("sync_pte_range", *pte,
+							vma->vm_flags, addr);
 			continue;
+		}
 		page = pfn_to_page(pfn);
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_dirty(vma, addr, pte) ||
 		    page_test_and_clear_dirty(page))
@@ -149,6 +150,9 @@ static int msync_interval(struct vm_area
 	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
 		return -EBUSY;
 
+	if (vma->vm_flags & VM_RESERVED)
+		return -EINVAL;
+
 	if (file && (vma->vm_flags & VM_SHARED)) {
 		filemap_sync(vma, addr, end);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -442,8 +442,6 @@ int page_referenced(struct page *page, i
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	BUG_ON(PageReserved(page));
-
 	inc_mm_counter(vma->vm_mm, anon_rss);
 
 	if (atomic_inc_and_test(&page->_mapcount)) {
@@ -473,8 +471,7 @@ void page_add_anon_rmap(struct page *pag
 void page_add_file_rmap(struct page *page)
 {
 	BUG_ON(PageAnon(page));
-	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
-		return;
+	BUG_ON(!pfn_valid(page_to_pfn(page)));
 
 	if (atomic_inc_and_test(&page->_mapcount))
 		inc_page_state(nr_mapped);
@@ -488,8 +485,6 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
-	BUG_ON(PageReserved(page));
-
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		BUG_ON(page_mapcount(page) < 0);
 		/*
@@ -647,13 +642,14 @@ static void try_to_unmap_cluster(unsigne
 			continue;
 
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			print_invalid_pfn("try_to_unmap_cluster", *pte,
+							vma->vm_flags, address);
 			continue;
+		}
 
 		page = pfn_to_page(pfn);
 		BUG_ON(PageAnon(page));
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
@@ -816,7 +812,6 @@ int try_to_unmap(struct page *page)
 {
 	int ret;
 
-	BUG_ON(PageReserved(page));
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c
+++ linux-2.6/drivers/scsi/sg.c
@@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
+		if (dirtied)
 			SetPageDirty(sgl[i].page);
 		/* unlock_page(sgl[i].page); */
+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
Index: linux-2.6/drivers/scsi/st.c
===================================================================
--- linux-2.6.orig/drivers/scsi/st.c
+++ linux-2.6/drivers/scsi/st.c
@@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
+		if (dirtied)
 			SetPageDirty(sgl[i].page);
+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->status);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -2985,8 +2984,7 @@ static struct page * snd_pcm_mmap_contro
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->control);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -3059,8 +3057,7 @@ static struct page *snd_pcm_mmap_data_no
 		vaddr = runtime->dma_area + offset;
 		page = virt_to_page(vaddr);
 	}
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1073,6 +1073,17 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
+						== (VM_WRITE | VM_RESERVED)) {
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mmap of VM_RESERVED memory, which "
+				"is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			if (vma->vm_ops && vma->vm_ops->close)
+				vma->vm_ops->close(vma);
+			error = -EACCES;
+			goto unmap_and_free_vma;
+		}
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -131,6 +131,14 @@ mprotect_fixup(struct vm_area_struct *vm
 				return -ENOMEM;
 			newflags |= VM_ACCOUNT;
 		}
+		if (oldflags & VM_RESERVED) {
+			BUG_ON(oldflags & VM_WRITE);
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mprotect of VM_RESERVED memory, "
+				"which is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			return -EACCES;
+		}
 	}
 
 	newprot = protection_map[newflags & 0xf];
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
 				if (j + 16 < BITS_PER_LONG)
 					prefetchw(page + j + 16);
 				__ClearPageReserved(page + j);
+				set_page_count(page + j, 0);
 			}
 			__free_pages(page, order);
 			i += BITS_PER_LONG;
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -253,8 +253,10 @@ static int check_pte_range(struct mm_str
 		if (!pte_present(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			print_invalid_pfn("check_pte_range", *pte, -1UL, addr);
 			continue;
+		}
 		nid = pfn_to_nid(pfn);
 		if (!test_bit(nid, nodes))
 			break;
@@ -326,6 +328,8 @@ check_range(struct mm_struct *mm, unsign
 	first = find_vma(mm, start);
 	if (!first)
 		return ERR_PTR(-EFAULT);
+	if (first->vm_flags & VM_RESERVED)
+		return ERR_PTR(-EACCESS);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!vma->vm_next && vma->vm_end < end)
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
 			continue;
 		page = pfn_to_page(pfn);
 		/*
+		 * PageReserved(page) -
 		 * This condition results from rvmalloc() sans vmalloc_32()
 		 * and architectural memory reservations. This should be
 		 * corrected eventually when the cases giving rise to this
 		 * are better understood.
 		 */
-		if (PageReserved(page)) {
-			printk("highmem reserved page?!\n");
-			continue;
-		}
 		BUG_ON(PageNosave(page));
 		if (PageNosaveFree(page))
 			continue;
@@ -527,13 +524,8 @@ static int saveable(struct zone * zone, 
 		return 0;
 
 	page = pfn_to_page(pfn);
-	BUG_ON(PageReserved(page) && PageNosave(page));
 	if (PageNosave(page))
 		return 0;
-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
-		pr_debug("[nosave pfn 0x%lx]", pfn);
-		return 0;
-	}
 	if (PageNosaveFree(page))
 		return 0;
 
Index: linux-2.6/arch/ppc64/kernel/vdso.c
===================================================================
--- linux-2.6.orig/arch/ppc64/kernel/vdso.c
+++ linux-2.6/arch/ppc64/kernel/vdso.c
@@ -175,13 +175,6 @@ static struct page * vdso_vma_nopage(str
 	if (address < vma->vm_start || address > vma->vm_end)
 		return NOPAGE_SIGBUS;
 
-	/*
-	 * Last page is systemcfg, special handling here, no get_page() a
-	 * this is a reserved page
-	 */
-	if ((vma->vm_end - address) <= PAGE_SIZE)
-		return virt_to_page(systemcfg);
-
 	pg = virt_to_page(vbase + offset);
 	get_page(pg);
 	DBG(" ->page count: %d\n", page_count(pg));

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 2/5: micro optimisation for mm/rmap.c
  2005-06-23  7:06   ` [patch][rfc] 2/5: micro optimisation " Nick Piggin
@ 2005-06-23  7:26       ` William Lee Irwin III
  2005-06-23  7:26       ` William Lee Irwin III
  1 sibling, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

On Thu, Jun 23, 2005 at 05:06:35PM +1000, Nick Piggin wrote:
> +		index = (address - vma->vm_start) >> PAGE_SHIFT;
> +		index += vma->vm_pgoff;
> +		index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
> +		page->index = index;

linear_page_index()


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 2/5: micro optimisation for mm/rmap.c
@ 2005-06-23  7:26       ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

On Thu, Jun 23, 2005 at 05:06:35PM +1000, Nick Piggin wrote:
> +		index = (address - vma->vm_start) >> PAGE_SHIFT;
> +		index += vma->vm_pgoff;
> +		index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
> +		page->index = index;

linear_page_index()


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 2/5: micro optimisation for mm/rmap.c
  2005-06-23  7:26       ` William Lee Irwin III
  (?)
@ 2005-06-23  7:33       ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23  7:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

William Lee Irwin III wrote:
> On Thu, Jun 23, 2005 at 05:06:35PM +1000, Nick Piggin wrote:
> 
>>+		index = (address - vma->vm_start) >> PAGE_SHIFT;
>>+		index += vma->vm_pgoff;
>>+		index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
>>+		page->index = index;
> 
> 
> linear_page_index()
> 

Ah indeed it is, thanks. I'll queue this up as patch 2.5, then?


[-- Attachment #2: mm-cleanup-rmap.patch --]
[-- Type: text/plain, Size: 751 bytes --]

Use linear_page_index in mm/rmap.c, as noted by Bill Irwin.

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -448,16 +448,11 @@ void page_add_anon_rmap(struct page *pag
 
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		struct anon_vma *anon_vma = vma->anon_vma;
-		pgoff_t index;
-
 		BUG_ON(!anon_vma);
 		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 		page->mapping = (struct address_space *) anon_vma;
 
-		index = (address - vma->vm_start) >> PAGE_SHIFT;
-		index += vma->vm_pgoff;
-		index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-		page->index = index;
+		page->index = linear_page_index(vma, address);
 
 		inc_page_state(nr_mapped);
 	}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23  7:08         ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin
@ 2005-06-23  9:51             ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -337,7 +338,7 @@ static inline void get_page(struct page 
>  static inline void put_page(struct page *page)
>  {
> -	if (!PageReserved(page) && put_page_testzero(page))
> +	if (put_page_testzero(page))
>  		__page_cache_release(page);
>  }

No sweep before this? I'm not so sure.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
>  	return 0;
>  }
>  
> -static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> +static void zap_pte_range(struct mmu_gather *tlb,
> +				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
>  {

As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
go into struct zap_details without excess args or diff.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
>  	unsigned long pfn = pte_pfn(pte);
>  	pte_t entry;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	if (unlikely(!pfn_valid(pfn))) {
>  		/*
>  		 * This should really halt the system so it can be debugged or

!pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we
BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner
where some new infrastructure has been erected for reporting such errors.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
>  	/*
>  	 * Ok, we need to copy. Oh, well..
>  	 */
> -	if (!PageReserved(old_page))
> +	if (old_page == ZERO_PAGE(address))
> +		old_page = NULL;
> +	else
>  		page_cache_get(old_page);
> +
>  	spin_unlock(&mm->page_table_lock);
>  
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto no_new_page;
> -	if (old_page == ZERO_PAGE(address)) {
> +	if (old_page == NULL) {
>  		new_page = alloc_zeroed_user_highpage(vma, address);
>  		if (!new_page)
>  			goto no_new_page;

There are some micro-optimizations mixed in with this and some
subsequent do_wp_page() alterations.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo
>  
>  	for (page = start; page < (start + size); page++) {
>  		set_page_zone(page, NODEZONE(nid, zone));
> -		set_page_count(page, 0);
> +		set_page_count(page, 1);
>  		reset_page_mapcount(page);
>  		SetPageReserved(page);
>  		INIT_LIST_HEAD(&page->lru);

The refcounting and PG_reserved activity in memmap_init_zone() is
superfluous. bootmem.c does all the necessary accounting internally.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c
> +++ linux-2.6/mm/fremap.c
> @@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
>  		return;
>  	if (pte_present(pte)) {
>  		unsigned long pfn = pte_pfn(pte);
> +		struct page *page;
>  
>  		flush_cache_page(vma, addr, pfn);
>  		pte = ptep_clear_flush(vma, addr, ptep);
> -		if (pfn_valid(pfn)) {
> -			struct page *page = pfn_to_page(pfn);
> -			if (!PageReserved(page)) {
> -				if (pte_dirty(pte))
> -					set_page_dirty(page);
> -				page_remove_rmap(page);
> -				page_cache_release(page);
> -				dec_mm_counter(mm, rss);
> -			}
> +		if (unlikely(!pfn_valid(pfn))) {
> +			print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
> +			return;
> +		}
> +		page = pfn_to_page(pfn);
> +		if (page != ZERO_PAGE(addr)) {
> +			if (pte_dirty(pte))
> +				set_page_dirty(page);
> +			page_remove_rmap(page);
> +			dec_mm_counter(mm, rss);
> +			page_cache_release(page);
>  		}
>  	} else {
>  		if (!pte_file(pte))

There is no error returned here to be handled by the caller.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
>  	pgd_t *pgd;
>  	pte_t pte_val;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	pgd = pgd_offset(mm, addr);
>  	spin_lock(&mm->page_table_lock);
>  	
> @@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
>  	pgd_t *pgd;
>  	pte_t pte_val;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	pgd = pgd_offset(mm, addr);
>  	spin_lock(&mm->page_table_lock);

This has no effect but to artificially constrain the interface. There
is no a priori reason to avoid the use of install_page() to deposit
mappings to normal pages in VM_RESERVED vmas. It's only the reverse
being "banned" here. Similar comments also apply to zap_pte().


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/drivers/scsi/sg.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/sg.c
> +++ linux-2.6/drivers/scsi/sg.c
> @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>  	int i;
>  
>  	for (i=0; i < nr_pages; i++) {
> -		if (dirtied && !PageReserved(sgl[i].page))
> +		if (dirtied)
>  			SetPageDirty(sgl[i].page);
>  		/* unlock_page(sgl[i].page); */
> +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  		/* FIXME: cache flush missing for rw==READ
>  		 * FIXME: call the correct reference counting function
>  		 */

An answer should be devised for this. My numerous SCSI CD-ROM devices
(I have 5 across several different machines of several different arches)
are rather unlikely to be happy with /* FIXME: XXX ... as an answer.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/drivers/scsi/st.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/st.c
> +++ linux-2.6/drivers/scsi/st.c
> @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>  	int i;
>  
>  	for (i=0; i < nr_pages; i++) {
> -		if (dirtied && !PageReserved(sgl[i].page))
> +		if (dirtied)
>  			SetPageDirty(sgl[i].page);
> +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  		/* FIXME: cache flush missing for rw==READ
>  		 * FIXME: call the correct reference counting function
>  		 */

Mutatis mutandis for my SCSI tape drive.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/sound/core/pcm_native.c
> ===================================================================
> --- linux-2.6.orig/sound/core/pcm_native.c
> +++ linux-2.6/sound/core/pcm_native.c
> @@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
>  		return NOPAGE_OOM;
>  	runtime = substream->runtime;
>  	page = virt_to_page(runtime->status);
> -	if (!PageReserved(page))
> -		get_page(page);
> +	get_page(page);
>  	if (type)
>  		*type = VM_FAULT_MINOR;
>  	return page;

snd_malloc_pages() marks all pages it allocates PG_reserved. This
merits some commentary, and likely the removal of the superfluous
PG_reserved usage.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -1073,6 +1073,17 @@ munmap_back:
>  		error = file->f_op->mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> +		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
> +						== (VM_WRITE | VM_RESERVED)) {
> +			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
> +				"PROT_WRITE mmap of VM_RESERVED memory, which "
> +				"is deprecated. Please report this to "
> +				"linux-kernel@vger.kernel.org\n",current->comm);
> +			if (vma->vm_ops && vma->vm_ops->close)
> +				vma->vm_ops->close(vma);
> +			error = -EACCES;
> +			goto unmap_and_free_vma;
> +		}
>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)

This is user-triggerable where the driver honors mmap() protection
flags blindly.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/bootmem.c
> ===================================================================
> --- linux-2.6.orig/mm/bootmem.c
> +++ linux-2.6/mm/bootmem.c
> @@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
>  				if (j + 16 < BITS_PER_LONG)
>  					prefetchw(page + j + 16);
>  				__ClearPageReserved(page + j);
> +				set_page_count(page + j, 0);
>  			}
>  			__free_pages(page, order);
>  			i += BITS_PER_LONG;

ibid re: bootmem


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/kernel/power/swsusp.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/swsusp.c
> +++ linux-2.6/kernel/power/swsusp.c
> @@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
>  			continue;
>  		page = pfn_to_page(pfn);
>  		/*
> +		 * PageReserved(page) -
>  		 * This condition results from rvmalloc() sans vmalloc_32()
>  		 * and architectural memory reservations. This should be
>  		 * corrected eventually when the cases giving rise to this
>  		 * are better understood.
>  		 */
> -		if (PageReserved(page)) {
> -			printk("highmem reserved page?!\n");
> -			continue;
> -		}
>  		BUG_ON(PageNosave(page));
>  		if (PageNosaveFree(page))
>  			continue;

This behavioral change needs to be commented on. There are some additional
difficulties when memory holes are unintentionally covered by mem_map[];
It is beneficial otherwise. It's likely to triplefault on such holes.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -527,13 +524,8 @@ static int saveable(struct zone * zone, 
>  		return 0;
>  
>  	page = pfn_to_page(pfn);
> -	BUG_ON(PageReserved(page) && PageNosave(page));
>  	if (PageNosave(page))
>  		return 0;
> -	if (PageReserved(page) && pfn_is_nosave(pfn)) {
> -		pr_debug("[nosave pfn 0x%lx]", pfn);
> -		return 0;
> -	}
>  	if (PageNosaveFree(page))
>  		return 0;

The pfn_is_nosave() check must stand barring justification of why
unintentionally saving (and hence restoring) the page is tolerable.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-23  9:51             ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -337,7 +338,7 @@ static inline void get_page(struct page 
>  static inline void put_page(struct page *page)
>  {
> -	if (!PageReserved(page) && put_page_testzero(page))
> +	if (put_page_testzero(page))
>  		__page_cache_release(page);
>  }

No sweep before this? I'm not so sure.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
>  	return 0;
>  }
>  
> -static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> +static void zap_pte_range(struct mmu_gather *tlb,
> +				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
>  {

As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
go into struct zap_details without excess args or diff.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
>  	unsigned long pfn = pte_pfn(pte);
>  	pte_t entry;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	if (unlikely(!pfn_valid(pfn))) {
>  		/*
>  		 * This should really halt the system so it can be debugged or

!pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we
BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner
where some new infrastructure has been erected for reporting such errors.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
>  	/*
>  	 * Ok, we need to copy. Oh, well..
>  	 */
> -	if (!PageReserved(old_page))
> +	if (old_page == ZERO_PAGE(address))
> +		old_page = NULL;
> +	else
>  		page_cache_get(old_page);
> +
>  	spin_unlock(&mm->page_table_lock);
>  
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto no_new_page;
> -	if (old_page == ZERO_PAGE(address)) {
> +	if (old_page == NULL) {
>  		new_page = alloc_zeroed_user_highpage(vma, address);
>  		if (!new_page)
>  			goto no_new_page;

There are some micro-optimizations mixed in with this and some
subsequent do_wp_page() alterations.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo
>  
>  	for (page = start; page < (start + size); page++) {
>  		set_page_zone(page, NODEZONE(nid, zone));
> -		set_page_count(page, 0);
> +		set_page_count(page, 1);
>  		reset_page_mapcount(page);
>  		SetPageReserved(page);
>  		INIT_LIST_HEAD(&page->lru);

The refcounting and PG_reserved activity in memmap_init_zone() is
superfluous. bootmem.c does all the necessary accounting internally.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c
> +++ linux-2.6/mm/fremap.c
> @@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
>  		return;
>  	if (pte_present(pte)) {
>  		unsigned long pfn = pte_pfn(pte);
> +		struct page *page;
>  
>  		flush_cache_page(vma, addr, pfn);
>  		pte = ptep_clear_flush(vma, addr, ptep);
> -		if (pfn_valid(pfn)) {
> -			struct page *page = pfn_to_page(pfn);
> -			if (!PageReserved(page)) {
> -				if (pte_dirty(pte))
> -					set_page_dirty(page);
> -				page_remove_rmap(page);
> -				page_cache_release(page);
> -				dec_mm_counter(mm, rss);
> -			}
> +		if (unlikely(!pfn_valid(pfn))) {
> +			print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
> +			return;
> +		}
> +		page = pfn_to_page(pfn);
> +		if (page != ZERO_PAGE(addr)) {
> +			if (pte_dirty(pte))
> +				set_page_dirty(page);
> +			page_remove_rmap(page);
> +			dec_mm_counter(mm, rss);
> +			page_cache_release(page);
>  		}
>  	} else {
>  		if (!pte_file(pte))

There is no error returned here to be handled by the caller.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
>  	pgd_t *pgd;
>  	pte_t pte_val;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	pgd = pgd_offset(mm, addr);
>  	spin_lock(&mm->page_table_lock);
>  	
> @@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
>  	pgd_t *pgd;
>  	pte_t pte_val;
>  
> +	BUG_ON(vma->vm_flags & VM_RESERVED);
> +
>  	pgd = pgd_offset(mm, addr);
>  	spin_lock(&mm->page_table_lock);

This has no effect but to artificially constrain the interface. There
is no a priori reason to avoid the use of install_page() to deposit
mappings to normal pages in VM_RESERVED vmas. It's only the reverse
being "banned" here. Similar comments also apply to zap_pte().


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/drivers/scsi/sg.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/sg.c
> +++ linux-2.6/drivers/scsi/sg.c
> @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>  	int i;
>  
>  	for (i=0; i < nr_pages; i++) {
> -		if (dirtied && !PageReserved(sgl[i].page))
> +		if (dirtied)
>  			SetPageDirty(sgl[i].page);
>  		/* unlock_page(sgl[i].page); */
> +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  		/* FIXME: cache flush missing for rw==READ
>  		 * FIXME: call the correct reference counting function
>  		 */

An answer should be devised for this. My numerous SCSI CD-ROM devices
(I have 5 across several different machines of several different arches)
are rather unlikely to be happy with /* FIXME: XXX ... as an answer.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/drivers/scsi/st.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/st.c
> +++ linux-2.6/drivers/scsi/st.c
> @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>  	int i;
>  
>  	for (i=0; i < nr_pages; i++) {
> -		if (dirtied && !PageReserved(sgl[i].page))
> +		if (dirtied)
>  			SetPageDirty(sgl[i].page);
> +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  		/* FIXME: cache flush missing for rw==READ
>  		 * FIXME: call the correct reference counting function
>  		 */

Mutatis mutandis for my SCSI tape drive.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/sound/core/pcm_native.c
> ===================================================================
> --- linux-2.6.orig/sound/core/pcm_native.c
> +++ linux-2.6/sound/core/pcm_native.c
> @@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
>  		return NOPAGE_OOM;
>  	runtime = substream->runtime;
>  	page = virt_to_page(runtime->status);
> -	if (!PageReserved(page))
> -		get_page(page);
> +	get_page(page);
>  	if (type)
>  		*type = VM_FAULT_MINOR;
>  	return page;

snd_malloc_pages() marks all pages it allocates PG_reserved. This
merits some commentary, and likely the removal of the superfluous
PG_reserved usage.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -1073,6 +1073,17 @@ munmap_back:
>  		error = file->f_op->mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> +		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
> +						== (VM_WRITE | VM_RESERVED)) {
> +			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
> +				"PROT_WRITE mmap of VM_RESERVED memory, which "
> +				"is deprecated. Please report this to "
> +				"linux-kernel@vger.kernel.org\n",current->comm);
> +			if (vma->vm_ops && vma->vm_ops->close)
> +				vma->vm_ops->close(vma);
> +			error = -EACCES;
> +			goto unmap_and_free_vma;
> +		}
>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)

This is user-triggerable where the driver honors mmap() protection
flags blindly.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/mm/bootmem.c
> ===================================================================
> --- linux-2.6.orig/mm/bootmem.c
> +++ linux-2.6/mm/bootmem.c
> @@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
>  				if (j + 16 < BITS_PER_LONG)
>  					prefetchw(page + j + 16);
>  				__ClearPageReserved(page + j);
> +				set_page_count(page + j, 0);
>  			}
>  			__free_pages(page, order);
>  			i += BITS_PER_LONG;

ibid re: bootmem


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> Index: linux-2.6/kernel/power/swsusp.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/swsusp.c
> +++ linux-2.6/kernel/power/swsusp.c
> @@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
>  			continue;
>  		page = pfn_to_page(pfn);
>  		/*
> +		 * PageReserved(page) -
>  		 * This condition results from rvmalloc() sans vmalloc_32()
>  		 * and architectural memory reservations. This should be
>  		 * corrected eventually when the cases giving rise to this
>  		 * are better understood.
>  		 */
> -		if (PageReserved(page)) {
> -			printk("highmem reserved page?!\n");
> -			continue;
> -		}
>  		BUG_ON(PageNosave(page));
>  		if (PageNosaveFree(page))
>  			continue;

This behavioral change needs to be commented on. There are some additional
difficulties when memory holes are unintentionally covered by mem_map[];
It is beneficial otherwise. It's likely to triplefault on such holes.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> @@ -527,13 +524,8 @@ static int saveable(struct zone * zone, 
>  		return 0;
>  
>  	page = pfn_to_page(pfn);
> -	BUG_ON(PageReserved(page) && PageNosave(page));
>  	if (PageNosave(page))
>  		return 0;
> -	if (PageReserved(page) && pfn_is_nosave(pfn)) {
> -		pr_debug("[nosave pfn 0x%lx]", pfn);
> -		return 0;
> -	}
>  	if (PageNosaveFree(page))
>  		return 0;

The pfn_is_nosave() check must stand barring justification of why
unintentionally saving (and hence restoring) the page is tolerable.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23  9:51             ` William Lee Irwin III
@ 2005-06-23 10:32               ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23 10:32 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -337,7 +338,7 @@ static inline void get_page(struct page 
>> static inline void put_page(struct page *page)
>> {
>>-	if (!PageReserved(page) && put_page_testzero(page))
>>+	if (put_page_testzero(page))
>> 		__page_cache_release(page);
>> }
> 
> 
> No sweep before this? I'm not so sure.
> 

Sweep what? There is a warning that will get triggered in the case
we free a PageReserved page through this path.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
>> 	return 0;
>> }
>> 
>>-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>>+static void zap_pte_range(struct mmu_gather *tlb,
>>+				struct vm_area_struct *vma, pmd_t *pmd,
>> 				unsigned long addr, unsigned long end,
>> 				struct zap_details *details)
>> {
> 
> 
> As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
> go into struct zap_details without excess args or diff.
> 

Actually, it isn't trivial. I thought of trying that.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
>> 	unsigned long pfn = pte_pfn(pte);
>> 	pte_t entry;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	if (unlikely(!pfn_valid(pfn))) {
>> 		/*
>> 		 * This should really halt the system so it can be debugged or
> 
> 
> !pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we
> BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner
> where some new infrastructure has been erected for reporting such errors.
> 

No, it uses print_invalid_pfn too.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
>> 	/*
>> 	 * Ok, we need to copy. Oh, well..
>> 	 */
>>-	if (!PageReserved(old_page))
>>+	if (old_page == ZERO_PAGE(address))
>>+		old_page = NULL;
>>+	else
>> 		page_cache_get(old_page);
>>+
>> 	spin_unlock(&mm->page_table_lock);
>> 
>> 	if (unlikely(anon_vma_prepare(vma)))
>> 		goto no_new_page;
>>-	if (old_page == ZERO_PAGE(address)) {
>>+	if (old_page == NULL) {
>> 		new_page = alloc_zeroed_user_highpage(vma, address);
>> 		if (!new_page)
>> 			goto no_new_page;
> 
> 
> There are some micro-optimizations mixed in with this and some
> subsequent do_wp_page() alterations.
> 

We no longer page_cache_get ZERO_PAGE, ever. That I changed it so
in a more optimal way than it was is surely acceptable?

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo
>> 
>> 	for (page = start; page < (start + size); page++) {
>> 		set_page_zone(page, NODEZONE(nid, zone));
>>-		set_page_count(page, 0);
>>+		set_page_count(page, 1);
>> 		reset_page_mapcount(page);
>> 		SetPageReserved(page);
>> 		INIT_LIST_HEAD(&page->lru);
> 
> 
> The refcounting and PG_reserved activity in memmap_init_zone() is
> superfluous. bootmem.c does all the necessary accounting internally.
> 

Well not superfluous yet. It is kept around to support all the arch
code that still uses it (not much, mainly reserved memory reporting).

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/fremap.c
>>===================================================================
>>--- linux-2.6.orig/mm/fremap.c
>>+++ linux-2.6/mm/fremap.c
>>@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
>> 		return;
>> 	if (pte_present(pte)) {
>> 		unsigned long pfn = pte_pfn(pte);
>>+		struct page *page;
>> 
>> 		flush_cache_page(vma, addr, pfn);
>> 		pte = ptep_clear_flush(vma, addr, ptep);
>>-		if (pfn_valid(pfn)) {
>>-			struct page *page = pfn_to_page(pfn);
>>-			if (!PageReserved(page)) {
>>-				if (pte_dirty(pte))
>>-					set_page_dirty(page);
>>-				page_remove_rmap(page);
>>-				page_cache_release(page);
>>-				dec_mm_counter(mm, rss);
>>-			}
>>+		if (unlikely(!pfn_valid(pfn))) {
>>+			print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
>>+			return;
>>+		}
>>+		page = pfn_to_page(pfn);
>>+		if (page != ZERO_PAGE(addr)) {
>>+			if (pte_dirty(pte))
>>+				set_page_dirty(page);
>>+			page_remove_rmap(page);
>>+			dec_mm_counter(mm, rss);
>>+			page_cache_release(page);
>> 		}
>> 	} else {
>> 		if (!pte_file(pte))
> 
> 
> There is no error returned here to be handled by the caller.
> 

That's OK, the pte has been cleared. Nothing else we can do.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
>> 	pgd_t *pgd;
>> 	pte_t pte_val;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	pgd = pgd_offset(mm, addr);
>> 	spin_lock(&mm->page_table_lock);
>> 	
>>@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
>> 	pgd_t *pgd;
>> 	pte_t pte_val;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	pgd = pgd_offset(mm, addr);
>> 	spin_lock(&mm->page_table_lock);
> 
> 
> This has no effect but to artificially constrain the interface. There
> is no a priori reason to avoid the use of install_page() to deposit
> mappings to normal pages in VM_RESERVED vmas. It's only the reverse
> being "banned" here. Similar comments also apply to zap_pte().
> 

No, install_page is playing with the page (eg. page_add_file_rmap)
which is explicity banned even before my PageReserved removal. It
is unclear that this ever safely worked for normal pages, and will
hit NULL page dereferences if trying to do it with iomem.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/drivers/scsi/sg.c
>>===================================================================
>>--- linux-2.6.orig/drivers/scsi/sg.c
>>+++ linux-2.6/drivers/scsi/sg.c
>>@@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>> 	int i;
>> 
>> 	for (i=0; i < nr_pages; i++) {
>>-		if (dirtied && !PageReserved(sgl[i].page))
>>+		if (dirtied)
>> 			SetPageDirty(sgl[i].page);
>> 		/* unlock_page(sgl[i].page); */
>>+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>> 		/* FIXME: cache flush missing for rw==READ
>> 		 * FIXME: call the correct reference counting function
>> 		 */
> 
> 
> An answer should be devised for this. My numerous SCSI CD-ROM devices
> (I have 5 across several different machines of several different arches)
> are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 

The worst it will do is dirty a VM_RESERVED page. So it is going
to work unless you're thinking about doing something crazy like
mmap /dev/mem and send your CDROM some pages from there. But yeah,
I have to find someone who knows what they're doing to look at this.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/drivers/scsi/st.c
>>===================================================================
>>--- linux-2.6.orig/drivers/scsi/st.c
>>+++ linux-2.6/drivers/scsi/st.c
>>@@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>> 	int i;
>> 
>> 	for (i=0; i < nr_pages; i++) {
>>-		if (dirtied && !PageReserved(sgl[i].page))
>>+		if (dirtied)
>> 			SetPageDirty(sgl[i].page);
>>+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>> 		/* FIXME: cache flush missing for rw==READ
>> 		 * FIXME: call the correct reference counting function
>> 		 */
> 
> 
> Mutatis mutandis for my SCSI tape drive.
> 
> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/sound/core/pcm_native.c
>>===================================================================
>>--- linux-2.6.orig/sound/core/pcm_native.c
>>+++ linux-2.6/sound/core/pcm_native.c
>>@@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
>> 		return NOPAGE_OOM;
>> 	runtime = substream->runtime;
>> 	page = virt_to_page(runtime->status);
>>-	if (!PageReserved(page))
>>-		get_page(page);
>>+	get_page(page);
>> 	if (type)
>> 		*type = VM_FAULT_MINOR;
>> 	return page;
> 
> 
> snd_malloc_pages() marks all pages it allocates PG_reserved. This
> merits some commentary, and likely the removal of the superfluous
> PG_reserved usage.
> 

Sure, but not in this patch. The aim here is just to eliminate special
casing of refcounting. Other PG_reserved usage can stay around for the
moment (and is actually good for catching errors).

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/mmap.c
>>===================================================================
>>--- linux-2.6.orig/mm/mmap.c
>>+++ linux-2.6/mm/mmap.c
>>@@ -1073,6 +1073,17 @@ munmap_back:
>> 		error = file->f_op->mmap(file, vma);
>> 		if (error)
>> 			goto unmap_and_free_vma;
>>+		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
>>+						== (VM_WRITE | VM_RESERVED)) {
>>+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
>>+				"PROT_WRITE mmap of VM_RESERVED memory, which "
>>+				"is deprecated. Please report this to "
>>+				"linux-kernel@vger.kernel.org\n",current->comm);
>>+			if (vma->vm_ops && vma->vm_ops->close)
>>+				vma->vm_ops->close(vma);
>>+			error = -EACCES;
>>+			goto unmap_and_free_vma;
>>+		}
>> 	} else if (vm_flags & VM_SHARED) {
>> 		error = shmem_zero_setup(vma);
>> 		if (error)
> 
> 
> This is user-triggerable where the driver honors mmap() protection
> flags blindly.
> 

If the user is allowed write access to VM_RESERVED memory, then I
suspect there is a lot worse they can do than flood the log.

But the check isn't going to stay around forever.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/bootmem.c
>>===================================================================
>>--- linux-2.6.orig/mm/bootmem.c
>>+++ linux-2.6/mm/bootmem.c
>>@@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
>> 				if (j + 16 < BITS_PER_LONG)
>> 					prefetchw(page + j + 16);
>> 				__ClearPageReserved(page + j);
>>+				set_page_count(page + j, 0);
>> 			}
>> 			__free_pages(page, order);
>> 			i += BITS_PER_LONG;
> 
> 
> ibid re: bootmem
> 

Not yet.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/kernel/power/swsusp.c
>>===================================================================
>>--- linux-2.6.orig/kernel/power/swsusp.c
>>+++ linux-2.6/kernel/power/swsusp.c
>>@@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
>> 			continue;
>> 		page = pfn_to_page(pfn);
>> 		/*
>>+		 * PageReserved(page) -
>> 		 * This condition results from rvmalloc() sans vmalloc_32()
>> 		 * and architectural memory reservations. This should be
>> 		 * corrected eventually when the cases giving rise to this
>> 		 * are better understood.
>> 		 */
>>-		if (PageReserved(page)) {
>>-			printk("highmem reserved page?!\n");
>>-			continue;
>>-		}
>> 		BUG_ON(PageNosave(page));
>> 		if (PageNosaveFree(page))
>> 			continue;
> 
> 
> This behavioral change needs to be commented on. There are some additional
> difficulties when memory holes are unintentionally covered by mem_map[];
> It is beneficial otherwise. It's likely to triplefault on such holes.
> 

It seems the author of this code themselves didn't really understand
what was going on here, so I'm buggered if I can be bothered :)

Remember though, PageReserved can stay around for as long as we need,
so this hunk can be trivially reverted if it is an immediate problem.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -527,13 +524,8 @@ static int saveable(struct zone * zone, 
>> 		return 0;
>> 
>> 	page = pfn_to_page(pfn);
>>-	BUG_ON(PageReserved(page) && PageNosave(page));
>> 	if (PageNosave(page))
>> 		return 0;
>>-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
>>-		pr_debug("[nosave pfn 0x%lx]", pfn);
>>-		return 0;
>>-	}
>> 	if (PageNosaveFree(page))
>> 		return 0;
> 
> 
> The pfn_is_nosave() check must stand barring justification of why
> unintentionally saving (and hence restoring) the page is tolerable.
> 

I thought the PageNosave should catch that, and the next line is
just a debug check. But I'm looking for someone who *actually* knows
how swsusp works, if anyone would like to volunteer :)

Thanks very much for the review!

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-23 10:32               ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23 10:32 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -337,7 +338,7 @@ static inline void get_page(struct page 
>> static inline void put_page(struct page *page)
>> {
>>-	if (!PageReserved(page) && put_page_testzero(page))
>>+	if (put_page_testzero(page))
>> 		__page_cache_release(page);
>> }
> 
> 
> No sweep before this? I'm not so sure.
> 

Sweep what? There is a warning that will get triggered in the case
we free a PageReserved page through this path.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
>> 	return 0;
>> }
>> 
>>-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>>+static void zap_pte_range(struct mmu_gather *tlb,
>>+				struct vm_area_struct *vma, pmd_t *pmd,
>> 				unsigned long addr, unsigned long end,
>> 				struct zap_details *details)
>> {
> 
> 
> As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
> go into struct zap_details without excess args or diff.
> 

Actually, it isn't trivial. I thought of trying that.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
>> 	unsigned long pfn = pte_pfn(pte);
>> 	pte_t entry;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	if (unlikely(!pfn_valid(pfn))) {
>> 		/*
>> 		 * This should really halt the system so it can be debugged or
> 
> 
> !pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we
> BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner
> where some new infrastructure has been erected for reporting such errors.
> 

No, it uses print_invalid_pfn too.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
>> 	/*
>> 	 * Ok, we need to copy. Oh, well..
>> 	 */
>>-	if (!PageReserved(old_page))
>>+	if (old_page == ZERO_PAGE(address))
>>+		old_page = NULL;
>>+	else
>> 		page_cache_get(old_page);
>>+
>> 	spin_unlock(&mm->page_table_lock);
>> 
>> 	if (unlikely(anon_vma_prepare(vma)))
>> 		goto no_new_page;
>>-	if (old_page == ZERO_PAGE(address)) {
>>+	if (old_page == NULL) {
>> 		new_page = alloc_zeroed_user_highpage(vma, address);
>> 		if (!new_page)
>> 			goto no_new_page;
> 
> 
> There are some micro-optimizations mixed in with this and some
> subsequent do_wp_page() alterations.
> 

We no longer page_cache_get ZERO_PAGE, ever. That I changed it so
in a more optimal way than it was is surely acceptable?

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo
>> 
>> 	for (page = start; page < (start + size); page++) {
>> 		set_page_zone(page, NODEZONE(nid, zone));
>>-		set_page_count(page, 0);
>>+		set_page_count(page, 1);
>> 		reset_page_mapcount(page);
>> 		SetPageReserved(page);
>> 		INIT_LIST_HEAD(&page->lru);
> 
> 
> The refcounting and PG_reserved activity in memmap_init_zone() is
> superfluous. bootmem.c does all the necessary accounting internally.
> 

Well not superfluous yet. It is kept around to support all the arch
code that still uses it (not much, mainly reserved memory reporting).

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/fremap.c
>>===================================================================
>>--- linux-2.6.orig/mm/fremap.c
>>+++ linux-2.6/mm/fremap.c
>>@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
>> 		return;
>> 	if (pte_present(pte)) {
>> 		unsigned long pfn = pte_pfn(pte);
>>+		struct page *page;
>> 
>> 		flush_cache_page(vma, addr, pfn);
>> 		pte = ptep_clear_flush(vma, addr, ptep);
>>-		if (pfn_valid(pfn)) {
>>-			struct page *page = pfn_to_page(pfn);
>>-			if (!PageReserved(page)) {
>>-				if (pte_dirty(pte))
>>-					set_page_dirty(page);
>>-				page_remove_rmap(page);
>>-				page_cache_release(page);
>>-				dec_mm_counter(mm, rss);
>>-			}
>>+		if (unlikely(!pfn_valid(pfn))) {
>>+			print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
>>+			return;
>>+		}
>>+		page = pfn_to_page(pfn);
>>+		if (page != ZERO_PAGE(addr)) {
>>+			if (pte_dirty(pte))
>>+				set_page_dirty(page);
>>+			page_remove_rmap(page);
>>+			dec_mm_counter(mm, rss);
>>+			page_cache_release(page);
>> 		}
>> 	} else {
>> 		if (!pte_file(pte))
> 
> 
> There is no error returned here to be handled by the caller.
> 

That's OK, the pte has been cleared. Nothing else we can do.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
>> 	pgd_t *pgd;
>> 	pte_t pte_val;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	pgd = pgd_offset(mm, addr);
>> 	spin_lock(&mm->page_table_lock);
>> 	
>>@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
>> 	pgd_t *pgd;
>> 	pte_t pte_val;
>> 
>>+	BUG_ON(vma->vm_flags & VM_RESERVED);
>>+
>> 	pgd = pgd_offset(mm, addr);
>> 	spin_lock(&mm->page_table_lock);
> 
> 
> This has no effect but to artificially constrain the interface. There
> is no a priori reason to avoid the use of install_page() to deposit
> mappings to normal pages in VM_RESERVED vmas. It's only the reverse
> being "banned" here. Similar comments also apply to zap_pte().
> 

No, install_page is playing with the page (eg. page_add_file_rmap)
which is explicity banned even before my PageReserved removal. It
is unclear that this ever safely worked for normal pages, and will
hit NULL page dereferences if trying to do it with iomem.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/drivers/scsi/sg.c
>>===================================================================
>>--- linux-2.6.orig/drivers/scsi/sg.c
>>+++ linux-2.6/drivers/scsi/sg.c
>>@@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>> 	int i;
>> 
>> 	for (i=0; i < nr_pages; i++) {
>>-		if (dirtied && !PageReserved(sgl[i].page))
>>+		if (dirtied)
>> 			SetPageDirty(sgl[i].page);
>> 		/* unlock_page(sgl[i].page); */
>>+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>> 		/* FIXME: cache flush missing for rw==READ
>> 		 * FIXME: call the correct reference counting function
>> 		 */
> 
> 
> An answer should be devised for this. My numerous SCSI CD-ROM devices
> (I have 5 across several different machines of several different arches)
> are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 

The worst it will do is dirty a VM_RESERVED page. So it is going
to work unless you're thinking about doing something crazy like
mmap /dev/mem and send your CDROM some pages from there. But yeah,
I have to find someone who knows what they're doing to look at this.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/drivers/scsi/st.c
>>===================================================================
>>--- linux-2.6.orig/drivers/scsi/st.c
>>+++ linux-2.6/drivers/scsi/st.c
>>@@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>> 	int i;
>> 
>> 	for (i=0; i < nr_pages; i++) {
>>-		if (dirtied && !PageReserved(sgl[i].page))
>>+		if (dirtied)
>> 			SetPageDirty(sgl[i].page);
>>+		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>> 		/* FIXME: cache flush missing for rw==READ
>> 		 * FIXME: call the correct reference counting function
>> 		 */
> 
> 
> Mutatis mutandis for my SCSI tape drive.
> 
> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/sound/core/pcm_native.c
>>===================================================================
>>--- linux-2.6.orig/sound/core/pcm_native.c
>>+++ linux-2.6/sound/core/pcm_native.c
>>@@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
>> 		return NOPAGE_OOM;
>> 	runtime = substream->runtime;
>> 	page = virt_to_page(runtime->status);
>>-	if (!PageReserved(page))
>>-		get_page(page);
>>+	get_page(page);
>> 	if (type)
>> 		*type = VM_FAULT_MINOR;
>> 	return page;
> 
> 
> snd_malloc_pages() marks all pages it allocates PG_reserved. This
> merits some commentary, and likely the removal of the superfluous
> PG_reserved usage.
> 

Sure, but not in this patch. The aim here is just to eliminate special
casing of refcounting. Other PG_reserved usage can stay around for the
moment (and is actually good for catching errors).

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/mmap.c
>>===================================================================
>>--- linux-2.6.orig/mm/mmap.c
>>+++ linux-2.6/mm/mmap.c
>>@@ -1073,6 +1073,17 @@ munmap_back:
>> 		error = file->f_op->mmap(file, vma);
>> 		if (error)
>> 			goto unmap_and_free_vma;
>>+		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
>>+						== (VM_WRITE | VM_RESERVED)) {
>>+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
>>+				"PROT_WRITE mmap of VM_RESERVED memory, which "
>>+				"is deprecated. Please report this to "
>>+				"linux-kernel@vger.kernel.org\n",current->comm);
>>+			if (vma->vm_ops && vma->vm_ops->close)
>>+				vma->vm_ops->close(vma);
>>+			error = -EACCES;
>>+			goto unmap_and_free_vma;
>>+		}
>> 	} else if (vm_flags & VM_SHARED) {
>> 		error = shmem_zero_setup(vma);
>> 		if (error)
> 
> 
> This is user-triggerable where the driver honors mmap() protection
> flags blindly.
> 

If the user is allowed write access to VM_RESERVED memory, then I
suspect there is a lot worse they can do than flood the log.

But the check isn't going to stay around forever.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/mm/bootmem.c
>>===================================================================
>>--- linux-2.6.orig/mm/bootmem.c
>>+++ linux-2.6/mm/bootmem.c
>>@@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
>> 				if (j + 16 < BITS_PER_LONG)
>> 					prefetchw(page + j + 16);
>> 				__ClearPageReserved(page + j);
>>+				set_page_count(page + j, 0);
>> 			}
>> 			__free_pages(page, order);
>> 			i += BITS_PER_LONG;
> 
> 
> ibid re: bootmem
> 

Not yet.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>Index: linux-2.6/kernel/power/swsusp.c
>>===================================================================
>>--- linux-2.6.orig/kernel/power/swsusp.c
>>+++ linux-2.6/kernel/power/swsusp.c
>>@@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
>> 			continue;
>> 		page = pfn_to_page(pfn);
>> 		/*
>>+		 * PageReserved(page) -
>> 		 * This condition results from rvmalloc() sans vmalloc_32()
>> 		 * and architectural memory reservations. This should be
>> 		 * corrected eventually when the cases giving rise to this
>> 		 * are better understood.
>> 		 */
>>-		if (PageReserved(page)) {
>>-			printk("highmem reserved page?!\n");
>>-			continue;
>>-		}
>> 		BUG_ON(PageNosave(page));
>> 		if (PageNosaveFree(page))
>> 			continue;
> 
> 
> This behavioral change needs to be commented on. There are some additional
> difficulties when memory holes are unintentionally covered by mem_map[];
> It is beneficial otherwise. It's likely to triplefault on such holes.
> 

It seems the author of this code themselves didn't really understand
what was going on here, so I'm buggered if I can be bothered :)

Remember though, PageReserved can stay around for as long as we need,
so this hunk can be trivially reverted if it is an immediate problem.

> 
> On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
> 
>>@@ -527,13 +524,8 @@ static int saveable(struct zone * zone, 
>> 		return 0;
>> 
>> 	page = pfn_to_page(pfn);
>>-	BUG_ON(PageReserved(page) && PageNosave(page));
>> 	if (PageNosave(page))
>> 		return 0;
>>-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
>>-		pr_debug("[nosave pfn 0x%lx]", pfn);
>>-		return 0;
>>-	}
>> 	if (PageNosaveFree(page))
>> 		return 0;
> 
> 
> The pfn_is_nosave() check must stand barring justification of why
> unintentionally saving (and hence restoring) the page is tolerable.
> 

I thought the PageNosave should catch that, and the next line is
just a debug check. But I'm looking for someone who *actually* knows
how swsusp works, if anyone would like to volunteer :)

Thanks very much for the review!

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23 10:32               ` Nick Piggin
@ 2005-06-23 22:08                 ` William Lee Irwin III
  -1 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23 22:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
>> As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>> go into struct zap_details without excess args or diff.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Actually, it isn't trivial. I thought of trying that.

You're probably thinking of the zap_page_range() vs. unmap_page_range()
topmost-level entrypoints as being the only places to set the flag in
details. Unconditionally clobbering the field in details in
zap_pud_range() should resolve any issues associated with that.

This is obviously a style and/or diff-minimization issue.


William Lee Irwin III wrote:
>> The refcounting and PG_reserved activity in memmap_init_zone() is
>> superfluous. bootmem.c does all the necessary accounting internally.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Well not superfluous yet. It is kept around to support all the arch
> code that still uses it (not much, mainly reserved memory reporting).

That activity is outside memmap_init_zone(). Specifically, the page
structures are not used for anything whatsoever before bootmem puts
them into the buddy allocators. It is not particularly interesting
or difficult to defer even the initialization to the precise point
in time bootmem cares to perform buddy bitmap insertion. This goes
for all fields of struct page. What's notable about PG_reserved and
refcounts here is that memmap_init_zone() goes about flipping bits one
way where free_all_bootmem_core() undoes all its work.


William Lee Irwin III wrote:
>> There is no error returned here to be handled by the caller.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> That's OK, the pte has been cleared. Nothing else we can do.

This is called in the interior of a loop, which may be beneficial to
terminate if this intended semantic is to be enforced. Furthermore, no
error is propagated to the caller, which is not the desired effect in
the stated error reporting scheme. So the code is inconsistent with
explicitly stated intention.


William Lee Irwin III wrote:
>> This has no effect but to artificially constrain the interface. There
>> is no a priori reason to avoid the use of install_page() to deposit
>> mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>> being "banned" here. Similar comments also apply to zap_pte().

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> No, install_page is playing with the page (eg. page_add_file_rmap)
> which is explicity banned even before my PageReserved removal. It
> is unclear that this ever safely worked for normal pages, and will
> hit NULL page dereferences if trying to do it with iomem.

You are going on about the fact that install_page() can't be used on
memory outside mem_map[] as it requires a page structure, and can't be
used on reserved pages because page_add_file_rmap() will BUG. This case
is not being discussed.

The issue at stake is inserting normal pages into a VM_RESERVED vma.
These will arise as e.g. kernel-allocated buffers managed by normal
reference counting. remap_pfn_range() can't do it; it refuses to
operate on "valid memory". install_page() now won't do it; it refuses
to touch a VM_RESERVED vma. So this creates a giant semantic hole,
and potentially breaks working code (i.e. if you were going to do
this you would need not only a replacement but also a sweep to adjust
all the drivers doing it or prove their nonexistence).


William Lee Irwin III wrote:
>> An answer should be devised for this. My numerous SCSI CD-ROM devices
>> (I have 5 across several different machines of several different arches)
>> are rather unlikely to be happy with /* FIXME: XXX ... as an answer.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> The worst it will do is dirty a VM_RESERVED page. So it is going
> to work unless you're thinking about doing something crazy like
> mmap /dev/mem and send your CDROM some pages from there. But yeah,
> I have to find someone who knows what they're doing to look at this.

Then you should replace FIXME: XXX with this explanation. By and large
the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
code. It should definitely not be done with new code, but rather,
exclusively confined to documenting discoveries of preexisting brokenness.
After all, if a patch is introducing broken code, why would we merge it?
Best to adjust the commentary and avoid that question altogether.

There are actually larger questions about this than the reserved page
handling. If e.g. pagecache pages need to be dirtied the raw bitflag
toggling is probably not how it should be done.


William Lee Irwin III wrote:
>> snd_malloc_pages() marks all pages it allocates PG_reserved. This
>> merits some commentary, and likely the removal of the superfluous
>> PG_reserved usage.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Sure, but not in this patch. The aim here is just to eliminate special
> casing of refcounting. Other PG_reserved usage can stay around for the
> moment (and is actually good for catching errors).

Unfortunately for this scheme, it's very much a case of putting the
cart before the horse. PG_reserved is toggled at random in this driver
after this change, to no useful effect (debugging or otherwise). And
this really goes for the whole affair. Diddling the core first is just
going to create bugs. Converting the users first is the way these
things need to be done. When complete, nothing needs the core flags
twiddling anymore and you just nuke the flag twiddling from the core.


William Lee Irwin III wrote:
>> This is user-triggerable where the driver honors mmap() protection
>> flags blindly.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> If the user is allowed write access to VM_RESERVED memory, then I
> suspect there is a lot worse they can do than flood the log.
> But the check isn't going to stay around forever.

This doesn't really do a whole lot of good for the unwitting user
who invokes a privileged application relying on such kernel behavior.
User-visible changes need to be taken on with somewhat more care
(okay, vastly more, along with long-term backward compatibility).


William Lee Irwin III wrote:
>> This behavioral change needs to be commented on. There are some additional
>> difficulties when memory holes are unintentionally covered by mem_map[];
>> It is beneficial otherwise. It's likely to triplefault on such holes.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> It seems the author of this code themselves didn't really understand
> what was going on here, so I'm buggered if I can be bothered :)
> Remember though, PageReserved can stay around for as long as we need,
> so this hunk can be trivially reverted if it is an immediate problem.

This doesn't really fly. A fair number of drivers are poorly-understood
and numerous gyrations have to be gone through to avoid breaking their
possible assumptions until at long last clarifications are made (that
is, if they're ever made). swsusp is not demotable to below their
status on a whim.

The failure mode must also be considered. Users are highly unlikely to
generate comprehensible bugreports from triplefaults.

It's also rather easy to see this case will arise on every non-NUMA
ia32 machine with > 4GB RAM, and that we dare not attempt to save
or restore such pages during suspend/resume cycles.


William Lee Irwin III wrote:
>> The pfn_is_nosave() check must stand barring justification of why
>> unintentionally saving (and hence restoring) the page is tolerable.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> I thought the PageNosave should catch that, and the next line is
> just a debug check. But I'm looking for someone who *actually* knows
> how swsusp works, if anyone would like to volunteer :)

This one is readily observable. It's doing a bounds check on the pfn.
This corresponds to the __nosave region being assigned struct pages
to cover it. We can't get by with accidentally saving and restoring it.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-23 22:08                 ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-23 22:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
>> As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>> go into struct zap_details without excess args or diff.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Actually, it isn't trivial. I thought of trying that.

You're probably thinking of the zap_page_range() vs. unmap_page_range()
topmost-level entrypoints as being the only places to set the flag in
details. Unconditionally clobbering the field in details in
zap_pud_range() should resolve any issues associated with that.

This is obviously a style and/or diff-minimization issue.


William Lee Irwin III wrote:
>> The refcounting and PG_reserved activity in memmap_init_zone() is
>> superfluous. bootmem.c does all the necessary accounting internally.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Well not superfluous yet. It is kept around to support all the arch
> code that still uses it (not much, mainly reserved memory reporting).

That activity is outside memmap_init_zone(). Specifically, the page
structures are not used for anything whatsoever before bootmem puts
them into the buddy allocators. It is not particularly interesting
or difficult to defer even the initialization to the precise point
in time bootmem cares to perform buddy bitmap insertion. This goes
for all fields of struct page. What's notable about PG_reserved and
refcounts here is that memmap_init_zone() goes about flipping bits one
way where free_all_bootmem_core() undoes all its work.


William Lee Irwin III wrote:
>> There is no error returned here to be handled by the caller.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> That's OK, the pte has been cleared. Nothing else we can do.

This is called in the interior of a loop, which may be beneficial to
terminate if this intended semantic is to be enforced. Furthermore, no
error is propagated to the caller, which is not the desired effect in
the stated error reporting scheme. So the code is inconsistent with
explicitly stated intention.


William Lee Irwin III wrote:
>> This has no effect but to artificially constrain the interface. There
>> is no a priori reason to avoid the use of install_page() to deposit
>> mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>> being "banned" here. Similar comments also apply to zap_pte().

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> No, install_page is playing with the page (eg. page_add_file_rmap)
> which is explicity banned even before my PageReserved removal. It
> is unclear that this ever safely worked for normal pages, and will
> hit NULL page dereferences if trying to do it with iomem.

You are going on about the fact that install_page() can't be used on
memory outside mem_map[] as it requires a page structure, and can't be
used on reserved pages because page_add_file_rmap() will BUG. This case
is not being discussed.

The issue at stake is inserting normal pages into a VM_RESERVED vma.
These will arise as e.g. kernel-allocated buffers managed by normal
reference counting. remap_pfn_range() can't do it; it refuses to
operate on "valid memory". install_page() now won't do it; it refuses
to touch a VM_RESERVED vma. So this creates a giant semantic hole,
and potentially breaks working code (i.e. if you were going to do
this you would need not only a replacement but also a sweep to adjust
all the drivers doing it or prove their nonexistence).


William Lee Irwin III wrote:
>> An answer should be devised for this. My numerous SCSI CD-ROM devices
>> (I have 5 across several different machines of several different arches)
>> are rather unlikely to be happy with /* FIXME: XXX ... as an answer.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> The worst it will do is dirty a VM_RESERVED page. So it is going
> to work unless you're thinking about doing something crazy like
> mmap /dev/mem and send your CDROM some pages from there. But yeah,
> I have to find someone who knows what they're doing to look at this.

Then you should replace FIXME: XXX with this explanation. By and large
the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
code. It should definitely not be done with new code, but rather,
exclusively confined to documenting discoveries of preexisting brokenness.
After all, if a patch is introducing broken code, why would we merge it?
Best to adjust the commentary and avoid that question altogether.

There are actually larger questions about this than the reserved page
handling. If e.g. pagecache pages need to be dirtied the raw bitflag
toggling is probably not how it should be done.


William Lee Irwin III wrote:
>> snd_malloc_pages() marks all pages it allocates PG_reserved. This
>> merits some commentary, and likely the removal of the superfluous
>> PG_reserved usage.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> Sure, but not in this patch. The aim here is just to eliminate special
> casing of refcounting. Other PG_reserved usage can stay around for the
> moment (and is actually good for catching errors).

Unfortunately for this scheme, it's very much a case of putting the
cart before the horse. PG_reserved is toggled at random in this driver
after this change, to no useful effect (debugging or otherwise). And
this really goes for the whole affair. Diddling the core first is just
going to create bugs. Converting the users first is the way these
things need to be done. When complete, nothing needs the core flags
twiddling anymore and you just nuke the flag twiddling from the core.


William Lee Irwin III wrote:
>> This is user-triggerable where the driver honors mmap() protection
>> flags blindly.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> If the user is allowed write access to VM_RESERVED memory, then I
> suspect there is a lot worse they can do than flood the log.
> But the check isn't going to stay around forever.

This doesn't really do a whole lot of good for the unwitting user
who invokes a privileged application relying on such kernel behavior.
User-visible changes need to be taken on with somewhat more care
(okay, vastly more, along with long-term backward compatibility).


William Lee Irwin III wrote:
>> This behavioral change needs to be commented on. There are some additional
>> difficulties when memory holes are unintentionally covered by mem_map[];
>> It is beneficial otherwise. It's likely to triplefault on such holes.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> It seems the author of this code themselves didn't really understand
> what was going on here, so I'm buggered if I can be bothered :)
> Remember though, PageReserved can stay around for as long as we need,
> so this hunk can be trivially reverted if it is an immediate problem.

This doesn't really fly. A fair number of drivers are poorly-understood
and numerous gyrations have to be gone through to avoid breaking their
possible assumptions until at long last clarifications are made (that
is, if they're ever made). swsusp is not demotable to below their
status on a whim.

The failure mode must also be considered. Users are highly unlikely to
generate comprehensible bugreports from triplefaults.

It's also rather easy to see this case will arise on every non-NUMA
ia32 machine with > 4GB RAM, and that we dare not attempt to save
or restore such pages during suspend/resume cycles.


William Lee Irwin III wrote:
>> The pfn_is_nosave() check must stand barring justification of why
>> unintentionally saving (and hence restoring) the page is tolerable.

On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> I thought the PageNosave should catch that, and the next line is
> just a debug check. But I'm looking for someone who *actually* knows
> how swsusp works, if anyone would like to volunteer :)

This one is readily observable. It's doing a bounds check on the pfn.
This corresponds to the __nosave region being assigned struct pages
to cover it. We can't get by with accidentally saving and restoring it.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23 22:08                 ` William Lee Irwin III
@ 2005-06-23 23:21                   ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23 23:21 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>>>go into struct zap_details without excess args or diff.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Actually, it isn't trivial. I thought of trying that.
> 
> 
> You're probably thinking of the zap_page_range() vs. unmap_page_range()
> topmost-level entrypoints as being the only places to set the flag in
> details. Unconditionally clobbering the field in details in
> zap_pud_range() should resolve any issues associated with that.
> 
> This is obviously a style and/or diff-minimization issue.
> 

I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
details field at all, and all the tests for details being marked
unlikely. I ended up thinking it was less ugly this way.

> 
> William Lee Irwin III wrote:
> 
>>>The refcounting and PG_reserved activity in memmap_init_zone() is
>>>superfluous. bootmem.c does all the necessary accounting internally.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Well not superfluous yet. It is kept around to support all the arch
>>code that still uses it (not much, mainly reserved memory reporting).
> 
> 
> That activity is outside memmap_init_zone(). Specifically, the page
> structures are not used for anything whatsoever before bootmem puts
> them into the buddy allocators. It is not particularly interesting
> or difficult to defer even the initialization to the precise point
> in time bootmem cares to perform buddy bitmap insertion. This goes
> for all fields of struct page. What's notable about PG_reserved and
> refcounts here is that memmap_init_zone() goes about flipping bits one
> way where free_all_bootmem_core() undoes all its work.
> 

This patch doesn't care how it works, that would be for a later patch.

> 
> William Lee Irwin III wrote:
> 
>>>There is no error returned here to be handled by the caller.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>That's OK, the pte has been cleared. Nothing else we can do.
> 
> 
> This is called in the interior of a loop, which may be beneficial to
> terminate if this intended semantic is to be enforced. Furthermore, no
> error is propagated to the caller, which is not the desired effect in
> the stated error reporting scheme. So the code is inconsistent with
> explicitly stated intention.
> 

No, the error reporting scheme says it doesn't handle any error,
that is all. What we have here in terms of behaviour is exactly
what used to happen, that is - do something saneish on error.
Changing behaviour would be outside the scope of this patch, but
be my guest.

> 
> William Lee Irwin III wrote:
> 
>>>This has no effect but to artificially constrain the interface. There
>>>is no a priori reason to avoid the use of install_page() to deposit
>>>mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>>>being "banned" here. Similar comments also apply to zap_pte().
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>No, install_page is playing with the page (eg. page_add_file_rmap)
>>which is explicity banned even before my PageReserved removal. It
>>is unclear that this ever safely worked for normal pages, and will
>>hit NULL page dereferences if trying to do it with iomem.
> 
> 
> You are going on about the fact that install_page() can't be used on
> memory outside mem_map[] as it requires a page structure, and can't be
> used on reserved pages because page_add_file_rmap() will BUG. This case
> is not being discussed.
> 

And that it isn't allowed to touch struct page of physical pages
in a VM_RESERVED region.

> The issue at stake is inserting normal pages into a VM_RESERVED vma.
> These will arise as e.g. kernel-allocated buffers managed by normal
> reference counting. remap_pfn_range() can't do it; it refuses to
> operate on "valid memory". install_page() now won't do it; it refuses
> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
> and potentially breaks working code (i.e. if you were going to do
> this you would need not only a replacement but also a sweep to adjust
> all the drivers doing it or prove their nonexistence).
> 

I think you'll find that remap_pfn_range will be happy to operate
on valid memory, and that any driver trying to use install_page
on VM_RESERVED probably needs fixing anyway.

> 
> William Lee Irwin III wrote:
> 
>>>An answer should be devised for this. My numerous SCSI CD-ROM devices
>>>(I have 5 across several different machines of several different arches)
>>>are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>The worst it will do is dirty a VM_RESERVED page. So it is going
>>to work unless you're thinking about doing something crazy like
>>mmap /dev/mem and send your CDROM some pages from there. But yeah,
>>I have to find someone who knows what they're doing to look at this.
> 
> 
> Then you should replace FIXME: XXX with this explanation. By and large
> the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
> code. It should definitely not be done with new code, but rather,
> exclusively confined to documenting discoveries of preexisting brokenness.
> After all, if a patch is introducing broken code, why would we merge it?
> Best to adjust the commentary and avoid that question altogether.
> 

We wouldn't merge it. Hence this is an rfc and I explicitly said
it is not for merging.

> There are actually larger questions about this than the reserved page
> handling. If e.g. pagecache pages need to be dirtied the raw bitflag
> toggling is probably not how it should be done.
> 

Yep.

> 
> William Lee Irwin III wrote:
> 
>>>snd_malloc_pages() marks all pages it allocates PG_reserved. This
>>>merits some commentary, and likely the removal of the superfluous
>>>PG_reserved usage.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Sure, but not in this patch. The aim here is just to eliminate special
>>casing of refcounting. Other PG_reserved usage can stay around for the
>>moment (and is actually good for catching errors).
> 
> 
> Unfortunately for this scheme, it's very much a case of putting the
> cart before the horse. PG_reserved is toggled at random in this driver
> after this change, to no useful effect (debugging or otherwise). And
> this really goes for the whole affair. Diddling the core first is just
> going to create bugs. Converting the users first is the way these
> things need to be done. When complete, nothing needs the core flags
> twiddling anymore and you just nuke the flag twiddling from the core.
> 

I'm sorry, I don't see how 'diddling' the core will create bugs.

This is a fine way to do it, and "converting" users first (whatever
that means) is not possible because VM_RESERVED handling in core
code is not up to the task of replacing PageReserved without this
patch.

> 
> William Lee Irwin III wrote:
> 
>>>This is user-triggerable where the driver honors mmap() protection
>>>flags blindly.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>If the user is allowed write access to VM_RESERVED memory, then I
>>suspect there is a lot worse they can do than flood the log.
>>But the check isn't going to stay around forever.
> 
> 
> This doesn't really do a whole lot of good for the unwitting user
> who invokes a privileged application relying on such kernel behavior.
> User-visible changes need to be taken on with somewhat more care
> (okay, vastly more, along with long-term backward compatibility).
> 

It does a great lot of good, because they can tell us about it
we'll fix it. Search the kernel sources and you'll find other
examples that look almost exactly like this one.

> 
> William Lee Irwin III wrote:
> 
>>>This behavioral change needs to be commented on. There are some additional
>>>difficulties when memory holes are unintentionally covered by mem_map[];
>>>It is beneficial otherwise. It's likely to triplefault on such holes.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>It seems the author of this code themselves didn't really understand
>>what was going on here, so I'm buggered if I can be bothered :)
>>Remember though, PageReserved can stay around for as long as we need,
>>so this hunk can be trivially reverted if it is an immediate problem.
> 
> 
> This doesn't really fly. A fair number of drivers are poorly-understood
> and numerous gyrations have to be gone through to avoid breaking their
> possible assumptions until at long last clarifications are made (that
> is, if they're ever made). swsusp is not demotable to below their
> status on a whim.
> 

Yep, that's why I'm going to ask some swsusp developers to have
a look at it.

I wouldn't pretend to be able to fix every bug everywhere in the
kernel myself.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-23 23:21                   ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-23 23:21 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
>>>go into struct zap_details without excess args or diff.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Actually, it isn't trivial. I thought of trying that.
> 
> 
> You're probably thinking of the zap_page_range() vs. unmap_page_range()
> topmost-level entrypoints as being the only places to set the flag in
> details. Unconditionally clobbering the field in details in
> zap_pud_range() should resolve any issues associated with that.
> 
> This is obviously a style and/or diff-minimization issue.
> 

I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
details field at all, and all the tests for details being marked
unlikely. I ended up thinking it was less ugly this way.

> 
> William Lee Irwin III wrote:
> 
>>>The refcounting and PG_reserved activity in memmap_init_zone() is
>>>superfluous. bootmem.c does all the necessary accounting internally.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Well not superfluous yet. It is kept around to support all the arch
>>code that still uses it (not much, mainly reserved memory reporting).
> 
> 
> That activity is outside memmap_init_zone(). Specifically, the page
> structures are not used for anything whatsoever before bootmem puts
> them into the buddy allocators. It is not particularly interesting
> or difficult to defer even the initialization to the precise point
> in time bootmem cares to perform buddy bitmap insertion. This goes
> for all fields of struct page. What's notable about PG_reserved and
> refcounts here is that memmap_init_zone() goes about flipping bits one
> way where free_all_bootmem_core() undoes all its work.
> 

This patch doesn't care how it works, that would be for a later patch.

> 
> William Lee Irwin III wrote:
> 
>>>There is no error returned here to be handled by the caller.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>That's OK, the pte has been cleared. Nothing else we can do.
> 
> 
> This is called in the interior of a loop, which may be beneficial to
> terminate if this intended semantic is to be enforced. Furthermore, no
> error is propagated to the caller, which is not the desired effect in
> the stated error reporting scheme. So the code is inconsistent with
> explicitly stated intention.
> 

No, the error reporting scheme says it doesn't handle any error,
that is all. What we have here in terms of behaviour is exactly
what used to happen, that is - do something saneish on error.
Changing behaviour would be outside the scope of this patch, but
be my guest.

> 
> William Lee Irwin III wrote:
> 
>>>This has no effect but to artificially constrain the interface. There
>>>is no a priori reason to avoid the use of install_page() to deposit
>>>mappings to normal pages in VM_RESERVED vmas. It's only the reverse
>>>being "banned" here. Similar comments also apply to zap_pte().
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>No, install_page is playing with the page (eg. page_add_file_rmap)
>>which is explicity banned even before my PageReserved removal. It
>>is unclear that this ever safely worked for normal pages, and will
>>hit NULL page dereferences if trying to do it with iomem.
> 
> 
> You are going on about the fact that install_page() can't be used on
> memory outside mem_map[] as it requires a page structure, and can't be
> used on reserved pages because page_add_file_rmap() will BUG. This case
> is not being discussed.
> 

And that it isn't allowed to touch struct page of physical pages
in a VM_RESERVED region.

> The issue at stake is inserting normal pages into a VM_RESERVED vma.
> These will arise as e.g. kernel-allocated buffers managed by normal
> reference counting. remap_pfn_range() can't do it; it refuses to
> operate on "valid memory". install_page() now won't do it; it refuses
> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
> and potentially breaks working code (i.e. if you were going to do
> this you would need not only a replacement but also a sweep to adjust
> all the drivers doing it or prove their nonexistence).
> 

I think you'll find that remap_pfn_range will be happy to operate
on valid memory, and that any driver trying to use install_page
on VM_RESERVED probably needs fixing anyway.

> 
> William Lee Irwin III wrote:
> 
>>>An answer should be devised for this. My numerous SCSI CD-ROM devices
>>>(I have 5 across several different machines of several different arches)
>>>are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>The worst it will do is dirty a VM_RESERVED page. So it is going
>>to work unless you're thinking about doing something crazy like
>>mmap /dev/mem and send your CDROM some pages from there. But yeah,
>>I have to find someone who knows what they're doing to look at this.
> 
> 
> Then you should replace FIXME: XXX with this explanation. By and large
> the presence of "FIXME: XXX" is a sign there is a gigantic hole in the
> code. It should definitely not be done with new code, but rather,
> exclusively confined to documenting discoveries of preexisting brokenness.
> After all, if a patch is introducing broken code, why would we merge it?
> Best to adjust the commentary and avoid that question altogether.
> 

We wouldn't merge it. Hence this is an rfc and I explicitly said
it is not for merging.

> There are actually larger questions about this than the reserved page
> handling. If e.g. pagecache pages need to be dirtied the raw bitflag
> toggling is probably not how it should be done.
> 

Yep.

> 
> William Lee Irwin III wrote:
> 
>>>snd_malloc_pages() marks all pages it allocates PG_reserved. This
>>>merits some commentary, and likely the removal of the superfluous
>>>PG_reserved usage.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>Sure, but not in this patch. The aim here is just to eliminate special
>>casing of refcounting. Other PG_reserved usage can stay around for the
>>moment (and is actually good for catching errors).
> 
> 
> Unfortunately for this scheme, it's very much a case of putting the
> cart before the horse. PG_reserved is toggled at random in this driver
> after this change, to no useful effect (debugging or otherwise). And
> this really goes for the whole affair. Diddling the core first is just
> going to create bugs. Converting the users first is the way these
> things need to be done. When complete, nothing needs the core flags
> twiddling anymore and you just nuke the flag twiddling from the core.
> 

I'm sorry, I don't see how 'diddling' the core will create bugs.

This is a fine way to do it, and "converting" users first (whatever
that means) is not possible because VM_RESERVED handling in core
code is not up to the task of replacing PageReserved without this
patch.

> 
> William Lee Irwin III wrote:
> 
>>>This is user-triggerable where the driver honors mmap() protection
>>>flags blindly.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>If the user is allowed write access to VM_RESERVED memory, then I
>>suspect there is a lot worse they can do than flood the log.
>>But the check isn't going to stay around forever.
> 
> 
> This doesn't really do a whole lot of good for the unwitting user
> who invokes a privileged application relying on such kernel behavior.
> User-visible changes need to be taken on with somewhat more care
> (okay, vastly more, along with long-term backward compatibility).
> 

It does a great lot of good, because they can tell us about it
we'll fix it. Search the kernel sources and you'll find other
examples that look almost exactly like this one.

> 
> William Lee Irwin III wrote:
> 
>>>This behavioral change needs to be commented on. There are some additional
>>>difficulties when memory holes are unintentionally covered by mem_map[];
>>>It is beneficial otherwise. It's likely to triplefault on such holes.
> 
> 
> On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote:
> 
>>It seems the author of this code themselves didn't really understand
>>what was going on here, so I'm buggered if I can be bothered :)
>>Remember though, PageReserved can stay around for as long as we need,
>>so this hunk can be trivially reverted if it is an immediate problem.
> 
> 
> This doesn't really fly. A fair number of drivers are poorly-understood
> and numerous gyrations have to be gone through to avoid breaking their
> possible assumptions until at long last clarifications are made (that
> is, if they're ever made). swsusp is not demotable to below their
> status on a whim.
> 

Yep, that's why I'm going to ask some swsusp developers to have
a look at it.

I wouldn't pretend to be able to fix every bug everywhere in the
kernel myself.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23 23:21                   ` Nick Piggin
@ 2005-06-24  0:59                     ` William Lee Irwin III
  -1 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-24  0:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
>> You're probably thinking of the zap_page_range() vs. unmap_page_range()
>> topmost-level entrypoints as being the only places to set the flag in
>> details. Unconditionally clobbering the field in details in
>> zap_pud_range() should resolve any issues associated with that.
>> This is obviously a style and/or diff-minimization issue.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
> details field at all, and all the tests for details being marked
> unlikely. I ended up thinking it was less ugly this way.

This doesn't pose any particular difficulty, but so noted since that's
what you had in mind.


William Lee Irwin III wrote:
>> That activity is outside memmap_init_zone(). Specifically, the page
>> structures are not used for anything whatsoever before bootmem puts
>> them into the buddy allocators. It is not particularly interesting
>> or difficult to defer even the initialization to the precise point
>> in time bootmem cares to perform buddy bitmap insertion. This goes
>> for all fields of struct page. What's notable about PG_reserved and
>> refcounts here is that memmap_init_zone() goes about flipping bits one
>> way where free_all_bootmem_core() undoes all its work.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> This patch doesn't care how it works, that would be for a later patch.

The general gist of all this is that the patch doesn't cover anywhere
near enough ground, and so the above illustrates that the usage
in/around bootmem is among the easiest of usages to remove. I have
implemented what I'm talking about on several independent occasions.


William Lee Irwin III wrote:
>> This is called in the interior of a loop, which may be beneficial to
>> terminate if this intended semantic is to be enforced. Furthermore, no
>> error is propagated to the caller, which is not the desired effect in
>> the stated error reporting scheme. So the code is inconsistent with
>> explicitly stated intention.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> No, the error reporting scheme says it doesn't handle any error,
> that is all. What we have here in terms of behaviour is exactly
> what used to happen, that is - do something saneish on error.
> Changing behaviour would be outside the scope of this patch, but
> be my guest.

Some places BUG(), some back out with an error return, others blithely
proceed. This kind of inconsistency will broadly confuse callers of
the API's.


William Lee Irwin III wrote:
>> You are going on about the fact that install_page() can't be used on
>> memory outside mem_map[] as it requires a page structure, and can't be
>> used on reserved pages because page_add_file_rmap() will BUG. This case
>> is not being discussed.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> And that it isn't allowed to touch struct page of physical pages
> in a VM_RESERVED region.

non sequitur


William Lee Irwin III wrote:
>> The issue at stake is inserting normal pages into a VM_RESERVED vma.
>> These will arise as e.g. kernel-allocated buffers managed by normal
>> reference counting. remap_pfn_range() can't do it; it refuses to
>> operate on "valid memory". install_page() now won't do it; it refuses
>> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
>> and potentially breaks working code (i.e. if you were going to do
>> this you would need not only a replacement but also a sweep to adjust
>> all the drivers doing it or prove their nonexistence).

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I think you'll find that remap_pfn_range will be happy to operate
> on valid memory, and that any driver trying to use install_page
> on VM_RESERVED probably needs fixing anyway.

install_page() of a !PageReserved() page in a VM_RESERVED vma is
neither broken now nor does it merit going BUG().

/dev/mem was disallowed from mapping ordinary kernel memory for a
reason (though I disagree with it), so the removal of the
pfn_valid()/PageReserved() checks can't be blithely done like in
your patch. Other arrangements must be made.


William Lee Irwin III wrote:
>> Unfortunately for this scheme, it's very much a case of putting the
>> cart before the horse. PG_reserved is toggled at random in this driver
>> after this change, to no useful effect (debugging or otherwise). And
>> this really goes for the whole affair. Diddling the core first is just
>> going to create bugs. Converting the users first is the way these
>> things need to be done. When complete, nothing needs the core flags
>> twiddling anymore and you just nuke the flag twiddling from the core.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I'm sorry, I don't see how 'diddling' the core will create bugs.
> This is a fine way to do it, and "converting" users first (whatever
> that means)
[cut text included in full in the following quote block]

This is going way too far. Someone please deal with this.


On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
[continued]
> This is a fine way to do it, and "converting" users first (whatever
> that means) is not possible because VM_RESERVED handling in core
> code is not up to the task of replacing PageReserved without this
> patch.

You aren't replacing any of the PG_reserved usage with VM_RESERVED
in any of these patches. The primary motive of all this is AFAICT
little more than getting the PG_reserved check out of put_page(),
drivers and arches be damned.


William Lee Irwin III wrote:
>> This doesn't really fly. A fair number of drivers are poorly-understood
>> and numerous gyrations have to be gone through to avoid breaking their
>> possible assumptions until at long last clarifications are made (that
>> is, if they're ever made). swsusp is not demotable to below their
>> status on a whim.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> Yep, that's why I'm going to ask some swsusp developers to have
> a look at it.

Do you have any idea who wrote the comment you referred to?


On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I wouldn't pretend to be able to fix every bug everywhere in the
> kernel myself.

I've got bad news for you. You will need to be willing and able to
read and alter the source of things as unglamorous as crappy ancient
drivers and obscure dogslow architectures and to do so carefully enough
things continue to work when you're changing broadly-used code. You
will never have the luxury of the hardware to test anywhere near 10%
of it on.

Worse yet, when you make the changes, you bear the burden of sweeps.
The fact they break when you make your change does not mean they were
buggy. It means you broke them. Sure, best effort is all you can ever
do, but this is far from anything like that.

If you can't handle the sweep, you have no business writing the patch.

Someone please take this over. None of this will be heeded so long as
I'm the one saying it.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  0:59                     ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-24  0:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:
>> You're probably thinking of the zap_page_range() vs. unmap_page_range()
>> topmost-level entrypoints as being the only places to set the flag in
>> details. Unconditionally clobbering the field in details in
>> zap_pud_range() should resolve any issues associated with that.
>> This is obviously a style and/or diff-minimization issue.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I'm thinking of exit_mmap and unmap_vmas, that don't pass in a
> details field at all, and all the tests for details being marked
> unlikely. I ended up thinking it was less ugly this way.

This doesn't pose any particular difficulty, but so noted since that's
what you had in mind.


William Lee Irwin III wrote:
>> That activity is outside memmap_init_zone(). Specifically, the page
>> structures are not used for anything whatsoever before bootmem puts
>> them into the buddy allocators. It is not particularly interesting
>> or difficult to defer even the initialization to the precise point
>> in time bootmem cares to perform buddy bitmap insertion. This goes
>> for all fields of struct page. What's notable about PG_reserved and
>> refcounts here is that memmap_init_zone() goes about flipping bits one
>> way where free_all_bootmem_core() undoes all its work.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> This patch doesn't care how it works, that would be for a later patch.

The general gist of all this is that the patch doesn't cover anywhere
near enough ground, and so the above illustrates that the usage
in/around bootmem is among the easiest of usages to remove. I have
implemented what I'm talking about on several independent occasions.


William Lee Irwin III wrote:
>> This is called in the interior of a loop, which may be beneficial to
>> terminate if this intended semantic is to be enforced. Furthermore, no
>> error is propagated to the caller, which is not the desired effect in
>> the stated error reporting scheme. So the code is inconsistent with
>> explicitly stated intention.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> No, the error reporting scheme says it doesn't handle any error,
> that is all. What we have here in terms of behaviour is exactly
> what used to happen, that is - do something saneish on error.
> Changing behaviour would be outside the scope of this patch, but
> be my guest.

Some places BUG(), some back out with an error return, others blithely
proceed. This kind of inconsistency will broadly confuse callers of
the API's.


William Lee Irwin III wrote:
>> You are going on about the fact that install_page() can't be used on
>> memory outside mem_map[] as it requires a page structure, and can't be
>> used on reserved pages because page_add_file_rmap() will BUG. This case
>> is not being discussed.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> And that it isn't allowed to touch struct page of physical pages
> in a VM_RESERVED region.

non sequitur


William Lee Irwin III wrote:
>> The issue at stake is inserting normal pages into a VM_RESERVED vma.
>> These will arise as e.g. kernel-allocated buffers managed by normal
>> reference counting. remap_pfn_range() can't do it; it refuses to
>> operate on "valid memory". install_page() now won't do it; it refuses
>> to touch a VM_RESERVED vma. So this creates a giant semantic hole,
>> and potentially breaks working code (i.e. if you were going to do
>> this you would need not only a replacement but also a sweep to adjust
>> all the drivers doing it or prove their nonexistence).

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I think you'll find that remap_pfn_range will be happy to operate
> on valid memory, and that any driver trying to use install_page
> on VM_RESERVED probably needs fixing anyway.

install_page() of a !PageReserved() page in a VM_RESERVED vma is
neither broken now nor does it merit going BUG().

/dev/mem was disallowed from mapping ordinary kernel memory for a
reason (though I disagree with it), so the removal of the
pfn_valid()/PageReserved() checks can't be blithely done like in
your patch. Other arrangements must be made.


William Lee Irwin III wrote:
>> Unfortunately for this scheme, it's very much a case of putting the
>> cart before the horse. PG_reserved is toggled at random in this driver
>> after this change, to no useful effect (debugging or otherwise). And
>> this really goes for the whole affair. Diddling the core first is just
>> going to create bugs. Converting the users first is the way these
>> things need to be done. When complete, nothing needs the core flags
>> twiddling anymore and you just nuke the flag twiddling from the core.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I'm sorry, I don't see how 'diddling' the core will create bugs.
> This is a fine way to do it, and "converting" users first (whatever
> that means)
[cut text included in full in the following quote block]

This is going way too far. Someone please deal with this.


On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
[continued]
> This is a fine way to do it, and "converting" users first (whatever
> that means) is not possible because VM_RESERVED handling in core
> code is not up to the task of replacing PageReserved without this
> patch.

You aren't replacing any of the PG_reserved usage with VM_RESERVED
in any of these patches. The primary motive of all this is AFAICT
little more than getting the PG_reserved check out of put_page(),
drivers and arches be damned.


William Lee Irwin III wrote:
>> This doesn't really fly. A fair number of drivers are poorly-understood
>> and numerous gyrations have to be gone through to avoid breaking their
>> possible assumptions until at long last clarifications are made (that
>> is, if they're ever made). swsusp is not demotable to below their
>> status on a whim.

On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> Yep, that's why I'm going to ask some swsusp developers to have
> a look at it.

Do you have any idea who wrote the comment you referred to?


On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> I wouldn't pretend to be able to fix every bug everywhere in the
> kernel myself.

I've got bad news for you. You will need to be willing and able to
read and alter the source of things as unglamorous as crappy ancient
drivers and obscure dogslow architectures and to do so carefully enough
things continue to work when you're changing broadly-used code. You
will never have the luxury of the hardware to test anywhere near 10%
of it on.

Worse yet, when you make the changes, you bear the burden of sweeps.
The fact they break when you make your change does not mean they were
buggy. It means you broke them. Sure, best effort is all you can ever
do, but this is far from anything like that.

If you can't handle the sweep, you have no business writing the patch.

Someone please take this over. None of this will be heeded so long as
I'm the one saying it.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-24  0:59                     ` William Lee Irwin III
@ 2005-06-24  1:17                       ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:

> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>This patch doesn't care how it works, that would be for a later patch.
> 
> 
> The general gist of all this is that the patch doesn't cover anywhere
> near enough ground, and so the above illustrates that the usage
> in/around bootmem is among the easiest of usages to remove. I have
> implemented what I'm talking about on several independent occasions.
> 

I don't think you understand what ground the patch covers.
It removes PageReserved() queries from core code, leaving
all PG_reserved flags and other uses of it intact so it
doesn't cause mass breakage.

Things can then be looked at and fixed up properly at a slower
pace while the patch is in -mm, for example.

> 
> William Lee Irwin III wrote:
> 
>>>This is called in the interior of a loop, which may be beneficial to
>>>terminate if this intended semantic is to be enforced. Furthermore, no
>>>error is propagated to the caller, which is not the desired effect in
>>>the stated error reporting scheme. So the code is inconsistent with
>>>explicitly stated intention.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>No, the error reporting scheme says it doesn't handle any error,
>>that is all. What we have here in terms of behaviour is exactly
>>what used to happen, that is - do something saneish on error.
>>Changing behaviour would be outside the scope of this patch, but
>>be my guest.
> 
> 
> Some places BUG(), some back out with an error return, others blithely
> proceed. This kind of inconsistency will broadly confuse callers of
> the API's.
> 

I don't think any places BUG(), but regardless, this patch doesn't
deal with changing behaviour, just reporting of the problem.

> 
> William Lee Irwin III wrote:
> 
>>>You are going on about the fact that install_page() can't be used on
>>>memory outside mem_map[] as it requires a page structure, and can't be
>>>used on reserved pages because page_add_file_rmap() will BUG. This case
>>>is not being discussed.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>And that it isn't allowed to touch struct page of physical pages
>>in a VM_RESERVED region.
> 
> 
> non sequitur
> 

Huh? That is why it is broken. Previously, PageReserved pages
*were not* accounted with the normal rmap and friends in core
code, so you can't just do it in here and hope it works.

> 
> William Lee Irwin III wrote:
> 
>>>The issue at stake is inserting normal pages into a VM_RESERVED vma.
>>>These will arise as e.g. kernel-allocated buffers managed by normal
>>>reference counting. remap_pfn_range() can't do it; it refuses to
>>>operate on "valid memory". install_page() now won't do it; it refuses
>>>to touch a VM_RESERVED vma. So this creates a giant semantic hole,
>>>and potentially breaks working code (i.e. if you were going to do
>>>this you would need not only a replacement but also a sweep to adjust
>>>all the drivers doing it or prove their nonexistence).
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>I think you'll find that remap_pfn_range will be happy to operate
>>on valid memory, and that any driver trying to use install_page
>>on VM_RESERVED probably needs fixing anyway.
> 
> 
> install_page() of a !PageReserved() page in a VM_RESERVED vma is
> neither broken now nor does it merit going BUG().
> 

Hugh says it is broken, so you can ask him the details.

> /dev/mem was disallowed from mapping ordinary kernel memory for a
> reason (though I disagree with it), so the removal of the
> pfn_valid()/PageReserved() checks can't be blithely done like in
> your patch. Other arrangements must be made.
> 
> 
> William Lee Irwin III wrote:
> 
>>>Unfortunately for this scheme, it's very much a case of putting the
>>>cart before the horse. PG_reserved is toggled at random in this driver
>>>after this change, to no useful effect (debugging or otherwise). And
>>>this really goes for the whole affair. Diddling the core first is just
>>>going to create bugs. Converting the users first is the way these
>>>things need to be done. When complete, nothing needs the core flags
>>>twiddling anymore and you just nuke the flag twiddling from the core.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>I'm sorry, I don't see how 'diddling' the core will create bugs.
>>This is a fine way to do it, and "converting" users first (whatever
>>that means)
> 
> [cut text included in full in the following quote block]
> 
> This is going way too far. Someone please deal with this.
> 

No, just tell me how it might magically create bugs in drivers
that didn't exist in the first place?

> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> [continued]
> 
>>This is a fine way to do it, and "converting" users first (whatever
>>that means) is not possible because VM_RESERVED handling in core
>>code is not up to the task of replacing PageReserved without this
>>patch.
> 
> 
> You aren't replacing any of the PG_reserved usage with VM_RESERVED
> in any of these patches. The primary motive of all this is AFAICT
> little more than getting the PG_reserved check out of put_page(),
> drivers and arches be damned.
> 

Excuse me? drivers work, arches work.

And how might you "fix" them first, without the necessary core
code in place?

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  1:17                       ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:

> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>This patch doesn't care how it works, that would be for a later patch.
> 
> 
> The general gist of all this is that the patch doesn't cover anywhere
> near enough ground, and so the above illustrates that the usage
> in/around bootmem is among the easiest of usages to remove. I have
> implemented what I'm talking about on several independent occasions.
> 

I don't think you understand what ground the patch covers.
It removes PageReserved() queries from core code, leaving
all PG_reserved flags and other uses of it intact so it
doesn't cause mass breakage.

Things can then be looked at and fixed up properly at a slower
pace while the patch is in -mm, for example.

> 
> William Lee Irwin III wrote:
> 
>>>This is called in the interior of a loop, which may be beneficial to
>>>terminate if this intended semantic is to be enforced. Furthermore, no
>>>error is propagated to the caller, which is not the desired effect in
>>>the stated error reporting scheme. So the code is inconsistent with
>>>explicitly stated intention.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>No, the error reporting scheme says it doesn't handle any error,
>>that is all. What we have here in terms of behaviour is exactly
>>what used to happen, that is - do something saneish on error.
>>Changing behaviour would be outside the scope of this patch, but
>>be my guest.
> 
> 
> Some places BUG(), some back out with an error return, others blithely
> proceed. This kind of inconsistency will broadly confuse callers of
> the API's.
> 

I don't think any places BUG(), but regardless, this patch doesn't
deal with changing behaviour, just reporting of the problem.

> 
> William Lee Irwin III wrote:
> 
>>>You are going on about the fact that install_page() can't be used on
>>>memory outside mem_map[] as it requires a page structure, and can't be
>>>used on reserved pages because page_add_file_rmap() will BUG. This case
>>>is not being discussed.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>And that it isn't allowed to touch struct page of physical pages
>>in a VM_RESERVED region.
> 
> 
> non sequitur
> 

Huh? That is why it is broken. Previously, PageReserved pages
*were not* accounted with the normal rmap and friends in core
code, so you can't just do it in here and hope it works.

> 
> William Lee Irwin III wrote:
> 
>>>The issue at stake is inserting normal pages into a VM_RESERVED vma.
>>>These will arise as e.g. kernel-allocated buffers managed by normal
>>>reference counting. remap_pfn_range() can't do it; it refuses to
>>>operate on "valid memory". install_page() now won't do it; it refuses
>>>to touch a VM_RESERVED vma. So this creates a giant semantic hole,
>>>and potentially breaks working code (i.e. if you were going to do
>>>this you would need not only a replacement but also a sweep to adjust
>>>all the drivers doing it or prove their nonexistence).
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>I think you'll find that remap_pfn_range will be happy to operate
>>on valid memory, and that any driver trying to use install_page
>>on VM_RESERVED probably needs fixing anyway.
> 
> 
> install_page() of a !PageReserved() page in a VM_RESERVED vma is
> neither broken now nor does it merit going BUG().
> 

Hugh says it is broken, so you can ask him the details.

> /dev/mem was disallowed from mapping ordinary kernel memory for a
> reason (though I disagree with it), so the removal of the
> pfn_valid()/PageReserved() checks can't be blithely done like in
> your patch. Other arrangements must be made.
> 
> 
> William Lee Irwin III wrote:
> 
>>>Unfortunately for this scheme, it's very much a case of putting the
>>>cart before the horse. PG_reserved is toggled at random in this driver
>>>after this change, to no useful effect (debugging or otherwise). And
>>>this really goes for the whole affair. Diddling the core first is just
>>>going to create bugs. Converting the users first is the way these
>>>things need to be done. When complete, nothing needs the core flags
>>>twiddling anymore and you just nuke the flag twiddling from the core.
> 
> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> 
>>I'm sorry, I don't see how 'diddling' the core will create bugs.
>>This is a fine way to do it, and "converting" users first (whatever
>>that means)
> 
> [cut text included in full in the following quote block]
> 
> This is going way too far. Someone please deal with this.
> 

No, just tell me how it might magically create bugs in drivers
that didn't exist in the first place?

> 
> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
> [continued]
> 
>>This is a fine way to do it, and "converting" users first (whatever
>>that means) is not possible because VM_RESERVED handling in core
>>code is not up to the task of replacing PageReserved without this
>>patch.
> 
> 
> You aren't replacing any of the PG_reserved usage with VM_RESERVED
> in any of these patches. The primary motive of all this is AFAICT
> little more than getting the PG_reserved check out of put_page(),
> drivers and arches be damned.
> 

Excuse me? drivers work, arches work.

And how might you "fix" them first, without the necessary core
code in place?

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-24  0:59                     ` William Lee Irwin III
@ 2005-06-24  1:25                       ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:25 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:

>>I wouldn't pretend to be able to fix every bug everywhere in the
>>kernel myself.
> 

> If you can't handle the sweep, you have no business writing the patch.
> 

Let me just say that where I have introduced incompatibilities
I will try to look through drivers and arch code. I can't look
through and find everything that is already buggy though.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  1:25                       ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:25 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

William Lee Irwin III wrote:

>>I wouldn't pretend to be able to fix every bug everywhere in the
>>kernel myself.
> 

> If you can't handle the sweep, you have no business writing the patch.
> 

Let me just say that where I have introduced incompatibilities
I will try to look through drivers and arch code. I can't look
through and find everything that is already buggy though.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-24  1:17                       ` Nick Piggin
@ 2005-06-24  1:47                         ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:47 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

Nick Piggin wrote:
> William Lee Irwin III wrote:
> 

>> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
>>
>>> I'm sorry, I don't see how 'diddling' the core will create bugs.
>>> This is a fine way to do it, and "converting" users first (whatever
>>> that means)
>>
>>
>> [cut text included in full in the following quote block]
>>
>> This is going way too far. Someone please deal with this.
>>
> 
> No, just tell me how it might magically create bugs in drivers
> that didn't exist in the first place?


Sorry, I got a bit carried away. That remark wasn't helpful.

What I mean is: if you see an aspect of this change that will
cause breakage in previously correct drivers, please raise
your specific concerns with me.

I have tried to be careful about this, and put in bugs/warnings
where problems in already broken code are magnified.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  1:47                         ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-24  1:47 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty

Nick Piggin wrote:
> William Lee Irwin III wrote:
> 

>> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote:
>>
>>> I'm sorry, I don't see how 'diddling' the core will create bugs.
>>> This is a fine way to do it, and "converting" users first (whatever
>>> that means)
>>
>>
>> [cut text included in full in the following quote block]
>>
>> This is going way too far. Someone please deal with this.
>>
> 
> No, just tell me how it might magically create bugs in drivers
> that didn't exist in the first place?


Sorry, I got a bit carried away. That remark wasn't helpful.

What I mean is: if you see an aspect of this change that will
cause breakage in previously correct drivers, please raise
your specific concerns with me.

I have tried to be careful about this, and put in bugs/warnings
where problems in already broken code are magnified.
Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-23  9:51             ` William Lee Irwin III
@ 2005-06-24  4:50               ` Andrew Morton
  -1 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2005-06-24  4:50 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

William Lee Irwin III <wli@holomorphy.com> wrote:
>
>  On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
>  > Index: linux-2.6/drivers/scsi/sg.c
>  > ===================================================================
>  > --- linux-2.6.orig/drivers/scsi/sg.c
>  > +++ linux-2.6/drivers/scsi/sg.c
>  > @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>  >  	int i;
>  >  
>  >  	for (i=0; i < nr_pages; i++) {
>  > -		if (dirtied && !PageReserved(sgl[i].page))
>  > +		if (dirtied)
>  >  			SetPageDirty(sgl[i].page);
>  >  		/* unlock_page(sgl[i].page); */
>  > +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  >  		/* FIXME: cache flush missing for rw==READ
>  >  		 * FIXME: call the correct reference counting function
>  >  		 */
> 
>  An answer should be devised for this. My numerous SCSI CD-ROM devices
>  (I have 5 across several different machines of several different arches)
>  are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 
> 
>  On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
>  > Index: linux-2.6/drivers/scsi/st.c
>  > ===================================================================
>  > --- linux-2.6.orig/drivers/scsi/st.c
>  > +++ linux-2.6/drivers/scsi/st.c
>  > @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>  >  	int i;
>  >  
>  >  	for (i=0; i < nr_pages; i++) {
>  > -		if (dirtied && !PageReserved(sgl[i].page))
>  > +		if (dirtied)
>  >  			SetPageDirty(sgl[i].page);
>  > +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  >  		/* FIXME: cache flush missing for rw==READ
>  >  		 * FIXME: call the correct reference counting function
>  >  		 */
> 
>  Mutatis mutandis for my SCSI tape drive.

This scsi code is already rather wrong.  There isn't much point in just
setting PG_dirty and leaving the page marked as clean in the radix tree. 
As it is we'll lose data if the user reads it into a MAP_SHARED memory
buffer.

set_page_dirty_lock() should be used here.  That can sleep.

<looks>

The above two functions are called under write_lock_irqsave() (at least)
and might be called from irq context (dunno).  So we cannot use
set_page_dirty_lock() and we don't have a ref on the page's inode.  We
could use set_page_dirty() and be racy against page reclaim.

But to get all this correct (and it's very incorrect now) we'd need to punt
the page dirtying up to process context, along the lines of
bio_check_pages_dirty().

Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from
irq context then we should arrange for them to be called without locks held
and use set_page_dirty_lock().


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  4:50               ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2005-06-24  4:50 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

William Lee Irwin III <wli@holomorphy.com> wrote:
>
>  On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
>  > Index: linux-2.6/drivers/scsi/sg.c
>  > ===================================================================
>  > --- linux-2.6.orig/drivers/scsi/sg.c
>  > +++ linux-2.6/drivers/scsi/sg.c
>  > @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
>  >  	int i;
>  >  
>  >  	for (i=0; i < nr_pages; i++) {
>  > -		if (dirtied && !PageReserved(sgl[i].page))
>  > +		if (dirtied)
>  >  			SetPageDirty(sgl[i].page);
>  >  		/* unlock_page(sgl[i].page); */
>  > +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  >  		/* FIXME: cache flush missing for rw==READ
>  >  		 * FIXME: call the correct reference counting function
>  >  		 */
> 
>  An answer should be devised for this. My numerous SCSI CD-ROM devices
>  (I have 5 across several different machines of several different arches)
>  are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
> 
> 
>  On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:
>  > Index: linux-2.6/drivers/scsi/st.c
>  > ===================================================================
>  > --- linux-2.6.orig/drivers/scsi/st.c
>  > +++ linux-2.6/drivers/scsi/st.c
>  > @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
>  >  	int i;
>  >  
>  >  	for (i=0; i < nr_pages; i++) {
>  > -		if (dirtied && !PageReserved(sgl[i].page))
>  > +		if (dirtied)
>  >  			SetPageDirty(sgl[i].page);
>  > +		/* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
>  >  		/* FIXME: cache flush missing for rw==READ
>  >  		 * FIXME: call the correct reference counting function
>  >  		 */
> 
>  Mutatis mutandis for my SCSI tape drive.

This scsi code is already rather wrong.  There isn't much point in just
setting PG_dirty and leaving the page marked as clean in the radix tree. 
As it is we'll lose data if the user reads it into a MAP_SHARED memory
buffer.

set_page_dirty_lock() should be used here.  That can sleep.

<looks>

The above two functions are called under write_lock_irqsave() (at least)
and might be called from irq context (dunno).  So we cannot use
set_page_dirty_lock() and we don't have a ref on the page's inode.  We
could use set_page_dirty() and be racy against page reclaim.

But to get all this correct (and it's very incorrect now) we'd need to punt
the page dirtying up to process context, along the lines of
bio_check_pages_dirty().

Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from
irq context then we should arrange for them to be called without locks held
and use set_page_dirty_lock().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-24  4:50               ` Andrew Morton
@ 2005-06-24  8:24                 ` William Lee Irwin III
  -1 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-24  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

William Lee Irwin III <wli@holomorphy.com> wrote:
>>  An answer should be devised for this. My numerous SCSI CD-ROM devices
>>  (I have 5 across several different machines of several different arches)
>>  are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
[...]
>>  Mutatis mutandis for my SCSI tape drive.

On Thu, Jun 23, 2005 at 09:50:11PM -0700, Andrew Morton wrote:
> This scsi code is already rather wrong.  There isn't much point in just
> setting PG_dirty and leaving the page marked as clean in the radix tree. 
> As it is we'll lose data if the user reads it into a MAP_SHARED memory
> buffer.
> set_page_dirty_lock() should be used here.  That can sleep.
> The above two functions are called under write_lock_irqsave() (at least)
> and might be called from irq context (dunno).  So we cannot use
> set_page_dirty_lock() and we don't have a ref on the page's inode.  We
> could use set_page_dirty() and be racy against page reclaim.
> But to get all this correct (and it's very incorrect now) we'd need to punt
> the page dirtying up to process context, along the lines of
> bio_check_pages_dirty().
> Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from
> irq context then we should arrange for them to be called without locks held
> and use set_page_dirty_lock().

This all sounds very reasonable. I was originally more concerned about
the new FIXME getting introduced but this sounds like a good way to
resolve the preexisting FIXME's surrounding all this.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-24  8:24                 ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2005-06-24  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

William Lee Irwin III <wli@holomorphy.com> wrote:
>>  An answer should be devised for this. My numerous SCSI CD-ROM devices
>>  (I have 5 across several different machines of several different arches)
>>  are rather unlikely to be happy with /* FIXME: XXX ... as an answer.
[...]
>>  Mutatis mutandis for my SCSI tape drive.

On Thu, Jun 23, 2005 at 09:50:11PM -0700, Andrew Morton wrote:
> This scsi code is already rather wrong.  There isn't much point in just
> setting PG_dirty and leaving the page marked as clean in the radix tree. 
> As it is we'll lose data if the user reads it into a MAP_SHARED memory
> buffer.
> set_page_dirty_lock() should be used here.  That can sleep.
> The above two functions are called under write_lock_irqsave() (at least)
> and might be called from irq context (dunno).  So we cannot use
> set_page_dirty_lock() and we don't have a ref on the page's inode.  We
> could use set_page_dirty() and be racy against page reclaim.
> But to get all this correct (and it's very incorrect now) we'd need to punt
> the page dirtying up to process context, along the lines of
> bio_check_pages_dirty().
> Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from
> irq context then we should arrange for them to be called without locks held
> and use set_page_dirty_lock().

This all sounds very reasonable. I was originally more concerned about
the new FIXME getting introduced but this sounds like a good way to
resolve the preexisting FIXME's surrounding all this.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
  2005-06-24  4:50               ` Andrew Morton
@ 2005-06-26  8:41                 ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-26  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:

>> Mutatis mutandis for my SCSI tape drive.
> 
> 

OK, for the VM_RESERVED case, it looks like it won't be much of a problem
because get_user_pages faults on VM_IO regions (which is already set in
remap_pfn_range which is used by mem.c and most drivers). So this code will
simply not encounter VM_RESERVED regions - well obviously, get_user_pages
should be made to explicitly check for VM_RESERVED too, but the point being
that introducing such a check will not overly restrict drivers.

[snip SetPageDirty is wrong]

Not that this helps the existing bug...

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch][rfc] 5/5: core remove PageReserved
@ 2005-06-26  8:41                 ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2005-06-26  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, linux-kernel, linux-mm, hugh, pbadari, linux-scsi

Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:

>> Mutatis mutandis for my SCSI tape drive.
> 
> 

OK, for the VM_RESERVED case, it looks like it won't be much of a problem
because get_user_pages faults on VM_IO regions (which is already set in
remap_pfn_range which is used by mem.c and most drivers). So this code will
simply not encounter VM_RESERVED regions - well obviously, get_user_pages
should be made to explicitly check for VM_RESERVED too, but the point being
that introducing such a check will not overly restrict drivers.

[snip SetPageDirty is wrong]

Not that this helps the existing bug...

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-06-26  8:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-23  7:05 [patch][rfc] 0/5: remove PageReserved Nick Piggin
2005-06-23  7:05 ` Nick Piggin
2005-06-23  7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin
2005-06-23  7:06   ` [patch][rfc] 2/5: micro optimisation " Nick Piggin
2005-06-23  7:07     ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin
2005-06-23  7:07       ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin
2005-06-23  7:08         ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin
2005-06-23  9:51           ` William Lee Irwin III
2005-06-23  9:51             ` William Lee Irwin III
2005-06-23 10:32             ` Nick Piggin
2005-06-23 10:32               ` Nick Piggin
2005-06-23 22:08               ` William Lee Irwin III
2005-06-23 22:08                 ` William Lee Irwin III
2005-06-23 23:21                 ` Nick Piggin
2005-06-23 23:21                   ` Nick Piggin
2005-06-24  0:59                   ` William Lee Irwin III
2005-06-24  0:59                     ` William Lee Irwin III
2005-06-24  1:17                     ` Nick Piggin
2005-06-24  1:17                       ` Nick Piggin
2005-06-24  1:47                       ` Nick Piggin
2005-06-24  1:47                         ` Nick Piggin
2005-06-24  1:25                     ` Nick Piggin
2005-06-24  1:25                       ` Nick Piggin
2005-06-24  4:50             ` Andrew Morton
2005-06-24  4:50               ` Andrew Morton
2005-06-24  8:24               ` William Lee Irwin III
2005-06-24  8:24                 ` William Lee Irwin III
2005-06-26  8:41               ` Nick Piggin
2005-06-26  8:41                 ` Nick Piggin
2005-06-23  7:26     ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III
2005-06-23  7:26       ` William Lee Irwin III
2005-06-23  7:33       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.