linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] swap 01/13 no SWAP_ERROR
@ 2003-03-25 22:10 Hugh Dickins
  2003-03-25 22:12 ` [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap Hugh Dickins
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

First of thirteen miscellaneous independent patches in the swap/rmap
area, several touching the same files therefore numbered in sequence.
Many but not all fixes extracted from the reverted anobjrmap patch.
Based upon 2.5.65-mm4, the aggregate has diffstat:

 include/linux/rmap-locking.h |    2 
 include/linux/swap.h         |   18 ++----
 mm/fremap.c                  |   12 ++--
 mm/mmap.c                    |    8 --
 mm/rmap.c                    |   68 ++++++++++++++-----------
 mm/shmem.c                   |   26 ++++++---
 mm/swap_state.c              |    4 -
 mm/swapfile.c                |  115 ++++++++++++++++++++++++++-----------------
 mm/vmscan.c                  |   14 ++---
 9 files changed, 151 insertions(+), 116 deletions(-)

swap 01/13 no SWAP_ERROR
Delete unused SWAP_ERROR and non-existent page_over_rsslimit().

--- 2.5.65-mm4/include/linux/swap.h	Wed Mar  5 07:26:34 2003
+++ swap01/include/linux/swap.h	Tue Mar 25 20:42:56 2003
@@ -174,13 +174,11 @@
 					struct pte_chain *));
 void FASTCALL(page_remove_rmap(struct page *, pte_t *));
 int FASTCALL(try_to_unmap(struct page *));
-int FASTCALL(page_over_rsslimit(struct page *));
 
 /* return values of try_to_unmap */
 #define	SWAP_SUCCESS	0
 #define	SWAP_AGAIN	1
 #define	SWAP_FAIL	2
-#define	SWAP_ERROR	3
 
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
--- 2.5.65-mm4/mm/rmap.c	Sun Mar 23 10:30:15 2003
+++ swap01/mm/rmap.c	Tue Mar 25 20:42:56 2003
@@ -677,7 +677,6 @@
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a trylock, try again later
  * SWAP_FAIL	- the page is unswappable
- * SWAP_ERROR	- an error occurred
  */
 int try_to_unmap(struct page * page)
 {
@@ -754,9 +753,6 @@
 			case SWAP_FAIL:
 				ret = SWAP_FAIL;
 				goto out;
-			case SWAP_ERROR:
-				ret = SWAP_ERROR;
-				goto out;
 			}
 		}
 	}
--- 2.5.65-mm4/mm/vmscan.c	Tue Feb 18 02:14:32 2003
+++ swap01/mm/vmscan.c	Tue Mar 25 20:42:56 2003
@@ -284,7 +284,6 @@
 		 */
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page)) {
-			case SWAP_ERROR:
 			case SWAP_FAIL:
 				pte_chain_unlock(page);
 				goto activate_locked;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
@ 2003-03-25 22:12 ` Hugh Dickins
  2003-03-25 22:13 ` [PATCH] swap 03/13 add_to_swap_cache Hugh Dickins
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Raised #endif CONFIG_SWAP in shrink_list, it was excluding
try_to_unmap of file pages.  Suspect !CONFIG_MMU relied on
that to suppress try_to_unmap, added SWAP_FAIL stub for it.

--- swap01/include/linux/swap.h	Tue Mar 25 20:42:56 2003
+++ swap02/include/linux/swap.h	Tue Mar 25 20:43:07 2003
@@ -175,19 +175,18 @@
 void FASTCALL(page_remove_rmap(struct page *, pte_t *));
 int FASTCALL(try_to_unmap(struct page *));
 
-/* return values of try_to_unmap */
-#define	SWAP_SUCCESS	0
-#define	SWAP_AGAIN	1
-#define	SWAP_FAIL	2
-
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
-
 #else
-#define page_referenced(page) \
-	TestClearPageReferenced(page)
+#define page_referenced(page)	TestClearPageReferenced(page)
+#define try_to_unmap(page)	SWAP_FAIL
 #endif /* CONFIG_MMU */
 
+/* return values of try_to_unmap */
+#define	SWAP_SUCCESS	0
+#define	SWAP_AGAIN	1
+#define	SWAP_FAIL	2
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
--- swap01/mm/vmscan.c	Tue Mar 25 20:42:56 2003
+++ swap02/mm/vmscan.c	Tue Mar 25 20:43:07 2003
@@ -277,6 +277,7 @@
 			pte_chain_lock(page);
 			mapping = page->mapping;
 		}
+#endif /* CONFIG_SWAP */
 
 		/*
 		 * The page is mapped into the page tables of one or more
@@ -294,7 +295,6 @@
 				; /* try to free the page below */
 			}
 		}
-#endif /* CONFIG_SWAP */
 		pte_chain_unlock(page);
 
 		/*


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 03/13 add_to_swap_cache
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
  2003-03-25 22:12 ` [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap Hugh Dickins
@ 2003-03-25 22:13 ` Hugh Dickins
  2003-03-25 22:13 ` [PATCH] swap 04/13 page_convert_anon -ENOMEM Hugh Dickins
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Make add_to_swap_cache static, it's only used by read_swap_cache_async;
and since that has just done a GFP_HIGHUSER allocation, surely it's
better for add_to_swap_cache to use GFP_KERNEL than GFP_ATOMIC.

--- swap02/include/linux/swap.h	Tue Mar 25 20:43:07 2003
+++ swap03/include/linux/swap.h	Tue Mar 25 20:43:18 2003
@@ -197,7 +197,6 @@
 extern struct address_space swapper_space;
 #define total_swapcache_pages  swapper_space.nrpages
 extern void show_swap_cache_info(void);
-extern int add_to_swap_cache(struct page *, swp_entry_t);
 extern int add_to_swap(struct page *);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
--- swap02/mm/swap_state.c	Wed Mar  5 07:26:34 2003
+++ swap03/mm/swap_state.c	Tue Mar 25 20:43:18 2003
@@ -68,7 +68,7 @@
 		swap_cache_info.noent_race, swap_cache_info.exist_race);
 }
 
-int add_to_swap_cache(struct page *page, swp_entry_t entry)
+static int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
@@ -78,7 +78,7 @@
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
-	error = add_to_page_cache(page, &swapper_space, entry.val, GFP_ATOMIC);
+	error = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
 	/*
 	 * Anon pages are already on the LRU, we don't run lru_cache_add here.
 	 */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 04/13 page_convert_anon -ENOMEM
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
  2003-03-25 22:12 ` [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap Hugh Dickins
  2003-03-25 22:13 ` [PATCH] swap 03/13 add_to_swap_cache Hugh Dickins
@ 2003-03-25 22:13 ` Hugh Dickins
  2003-03-25 22:14 ` [PATCH] swap 05/13 page_convert_anon unlocking Hugh Dickins
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

objrmap page_convert_anon tries to allocate pte_chains, so must allow
for failure (even GFP_KERNEL can fail when OOM-killed); and move its
prototype to join the rest of the rmap.c prototypes in swap.h.

--- swap03/include/linux/rmap-locking.h	Sun Mar 23 10:30:14 2003
+++ swap04/include/linux/rmap-locking.h	Tue Mar 25 20:43:29 2003
@@ -45,5 +45,3 @@
 	if (pte_chain)
 		__pte_chain_free(pte_chain);
 }
-
-void page_convert_anon(struct page *page);
--- swap03/include/linux/swap.h	Tue Mar 25 20:43:18 2003
+++ swap04/include/linux/swap.h	Tue Mar 25 20:43:29 2003
@@ -175,6 +175,8 @@
 void FASTCALL(page_remove_rmap(struct page *, pte_t *));
 int FASTCALL(try_to_unmap(struct page *));
 
+int page_convert_anon(struct page *);
+
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 #else
--- swap03/mm/fremap.c	Sun Mar 23 10:30:15 2003
+++ swap04/mm/fremap.c	Tue Mar 25 20:43:29 2003
@@ -69,8 +69,10 @@
 	pgidx = (addr - vma->vm_start) >> PAGE_SHIFT;
 	pgidx += vma->vm_pgoff;
 	pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-	if (!PageAnon(page) && (page->index != pgidx))
-		page_convert_anon(page);
+	if (!PageAnon(page) && (page->index != pgidx)) {
+		if (page_convert_anon(page) < 0)
+			goto err_free;
+	}
 
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
@@ -95,12 +97,10 @@
 	if (flush)
 		flush_tlb_page(vma, addr);
 
-	spin_unlock(&mm->page_table_lock);
-	pte_chain_free(pte_chain);
-	return 0;
-
+	err = 0;
 err_unlock:
 	spin_unlock(&mm->page_table_lock);
+err_free:
 	pte_chain_free(pte_chain);
 err:
 	return err;
--- swap03/mm/rmap.c	Tue Mar 25 20:42:56 2003
+++ swap04/mm/rmap.c	Tue Mar 25 20:43:29 2003
@@ -774,7 +774,7 @@
  * of pte_chain structures to ensure that it can complete without releasing
  * the lock.
  */
-void page_convert_anon(struct page *page)
+int page_convert_anon(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct vm_area_struct *vma;
@@ -783,6 +783,7 @@
 	pte_addr_t pte_paddr;
 	int mapcount;
 	int index = 0;
+	int err = 0;
 
 	if (PageAnon(page))
 		goto out;
@@ -795,6 +796,15 @@
 	if (mapcount > 1) {
 		for (; index < mapcount; index += NRPTE) {
 			ptec = pte_chain_alloc(GFP_KERNEL);
+			if (!ptec) {
+				while (pte_chain) {
+					ptec = pte_chain->next;
+					pte_chain_free(pte_chain);
+					pte_chain = ptec;
+				}
+				err = -ENOMEM;
+				goto out;
+			}
 			ptec->next = pte_chain;
 			pte_chain = ptec;
 		}
@@ -867,7 +877,7 @@
 	pte_chain_unlock(page);
 	up(&mapping->i_shared_sem);
 out:
-	return;
+	return err;
 }
 
 /**


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 05/13 page_convert_anon unlocking
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (2 preceding siblings ...)
  2003-03-25 22:13 ` [PATCH] swap 04/13 page_convert_anon -ENOMEM Hugh Dickins
@ 2003-03-25 22:14 ` Hugh Dickins
  2003-03-25 22:15 ` [PATCH] swap 06/13 wrap below vm_start Hugh Dickins
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

objrmap page_convert_anon was missing an unlock and an upsem.

--- swap04/mm/rmap.c	Tue Mar 25 20:43:29 2003
+++ swap05/mm/rmap.c	Tue Mar 25 20:43:40 2003
@@ -818,6 +818,7 @@
 	 */
 	if (mapcount < page->pte.mapcount) {
 		pte_chain_unlock(page);
+		up(&mapping->i_shared_sem);
 		goto retry;
 	} else if ((mapcount > page->pte.mapcount) && (mapcount > 1)) {
 		mapcount = page->pte.mapcount;
@@ -833,7 +834,7 @@
 	SetPageAnon(page);
 
 	if (mapcount == 0)
-		goto out;
+		goto out_unlock;
 	else if (mapcount == 1) {
 		SetPageDirect(page);
 		page->pte.direct = 0;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 06/13 wrap below vm_start
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (3 preceding siblings ...)
  2003-03-25 22:14 ` [PATCH] swap 05/13 page_convert_anon unlocking Hugh Dickins
@ 2003-03-25 22:15 ` Hugh Dickins
  2003-03-25 22:17 ` [PATCH] swap 07/13 objrmap page_table_lock Hugh Dickins
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

objrmap must check lest address wrapped below vma->vm_start.

--- swap05/mm/rmap.c	Tue Mar 25 20:43:40 2003
+++ swap06/mm/rmap.c	Tue Mar 25 20:43:51 2003
@@ -100,12 +100,8 @@
 	unsigned long address;
 
 	loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
-	if (loffset < vma->vm_pgoff)
-		goto out;
-
 	address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
-
-	if (address >= vma->vm_end)
+	if (address < vma->vm_start || address >= vma->vm_end)
 		goto out;
 
 	pgd = pgd_offset(mm, address);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 07/13 objrmap page_table_lock
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (4 preceding siblings ...)
  2003-03-25 22:15 ` [PATCH] swap 06/13 wrap below vm_start Hugh Dickins
@ 2003-03-25 22:17 ` Hugh Dickins
  2003-03-25 22:18 ` [PATCH] swap 08/13 rmap comments Hugh Dickins
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

A recent update removed the page_table_lock from objrmap.  That was
simply wrong: it's necessary to guard mm->rss (in try_to_unmap), and
it's necessary to ensure the page tables don't get freed beneath us
(in which case any modification of the pte, by page_referenced or
by try_to_unmap, might be corrupting a freed and reused page).

Then restore the original SWAP_AGAIN semantics in try_to_unmap:
simplest for inner levels to return SWAP_AGAIN or SWAP_FAILED,
outer level to decide SWAP_SUCCESS if all pages were unmapped.
Stop searching when all have been unmapped.

page_convert_anon left for the moment with FIXMEs rather than
page_table_lock.  I believe it can be done without page_table_lock
(which would be problematic there), but have noticed several other
bugs there, more patches will follow at a later date (though I'd
much rather anobjrmap, which handles all this so much more cleanly).

--- swap06/mm/rmap.c	Tue Mar 25 20:43:51 2003
+++ swap07/mm/rmap.c	Tue Mar 25 20:44:02 2003
@@ -143,9 +143,13 @@
 static int
 page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte;
 	int referenced = 0;
 
+	if (!spin_trylock(&mm->page_table_lock))
+		return 1;
+
 	pte = find_pte(vma, page, NULL);
 	if (pte) {
 		if (ptep_test_and_clear_young(pte))
@@ -153,6 +157,7 @@
 		pte_unmap(pte);
 	}
 
+	spin_unlock(&mm->page_table_lock);
 	return referenced;
 }
 
@@ -489,7 +494,10 @@
 	unsigned long address;
 	pte_t *pte;
 	pte_t pteval;
-	int ret = SWAP_SUCCESS;
+	int ret = SWAP_AGAIN;
+
+	if (!spin_trylock(&mm->page_table_lock))
+		return ret;
 
 	pte = find_pte(vma, page, &address);
 	if (!pte)
@@ -518,6 +526,7 @@
 	pte_unmap(pte);
 
 out:
+	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
 
@@ -538,7 +547,7 @@
 {
 	struct address_space *mapping = page->mapping;
 	struct vm_area_struct *vma;
-	int ret = SWAP_SUCCESS;
+	int ret = SWAP_AGAIN;
 
 	if (!mapping)
 		BUG();
@@ -547,23 +556,20 @@
 		BUG();
 
 	if (down_trylock(&mapping->i_shared_sem))
-		return SWAP_AGAIN;
+		return ret;
 	
 	list_for_each_entry(vma, &mapping->i_mmap, shared) {
 		ret = try_to_unmap_obj_one(vma, page);
-		if (ret != SWAP_SUCCESS)
+		if (ret == SWAP_FAIL || !page->pte.mapcount)
 			goto out;
 	}
 
 	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
 		ret = try_to_unmap_obj_one(vma, page);
-		if (ret != SWAP_SUCCESS)
+		if (ret == SWAP_FAIL || !page->pte.mapcount)
 			goto out;
 	}
 
-	if (page->pte.mapcount)
-		BUG();
-
 out:
 	up(&mapping->i_shared_sem);
 	return ret;
@@ -753,8 +759,10 @@
 		}
 	}
 out:
-	if (!page_mapped(page))
+	if (!page_mapped(page)) {
 		dec_page_state(nr_mapped);
+		ret = SWAP_SUCCESS;
+	}
 	return ret;
 }
 
@@ -839,6 +847,7 @@
 
 	index = NRPTE-1;
 	list_for_each_entry(vma, &mapping->i_mmap, shared) {
+		/* FIXME: unsafe without page_table_lock */
 		pte = find_pte(vma, page, NULL);
 		if (pte) {
 			pte_paddr = ptep_to_paddr(pte);
@@ -855,6 +864,7 @@
 		}
 	}
 	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+		/* FIXME: unsafe without page_table_lock */
 		pte = find_pte(vma, page, NULL);
 		if (pte) {
 			pte_paddr = ptep_to_paddr(pte);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 08/13 rmap comments
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (5 preceding siblings ...)
  2003-03-25 22:17 ` [PATCH] swap 07/13 objrmap page_table_lock Hugh Dickins
@ 2003-03-25 22:18 ` Hugh Dickins
  2003-03-25 22:19 ` [PATCH] swap 09/13 tmpfs truncation Hugh Dickins
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Update a few locking comments in rmap.c.

--- swap07/mm/rmap.c	Tue Mar 25 20:44:02 2003
+++ swap08/mm/rmap.c	Tue Mar 25 20:44:13 2003
@@ -14,8 +14,8 @@
 /*
  * Locking:
  * - the page->pte.chain is protected by the PG_chainlock bit,
- *   which nests within the zone->lru_lock, then the
- *   mm->page_table_lock, and then the page lock.
+ *   which nests within the the mm->page_table_lock,
+ *   which nests within the page lock.
  * - because swapout locking is opposite to the locking order
  *   in the page fault path, the swapout path uses trylocks
  *   on the mm->page_table_lock
@@ -584,9 +584,8 @@
  * table entry mapping a page. Because locking order here is opposite
  * to the locking order used by the page fault path, we use trylocks.
  * Locking:
- *	zone->lru_lock			page_launder()
- *	    page lock			page_launder(), trylock
- *		pte_chain_lock		page_launder()
+ *	    page lock			shrink_list(), trylock
+ *		pte_chain_lock		shrink_list()
  *		    mm->page_table_lock	try_to_unmap_one(), trylock
  */
 static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t));
@@ -673,8 +672,8 @@
  * @page: the page to get unmapped
  *
  * Tries to remove all the page table entries which are mapping this
- * page, used in the pageout path.  Caller must hold zone->lru_lock
- * and the page lock.  Return values are:
+ * page, used in the pageout path.  Caller must hold the page lock
+ * and its pte chain lock.  Return values are:
  *
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a trylock, try again later


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 09/13 tmpfs truncation
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (6 preceding siblings ...)
  2003-03-25 22:18 ` [PATCH] swap 08/13 rmap comments Hugh Dickins
@ 2003-03-25 22:19 ` Hugh Dickins
  2003-03-25 22:20 ` [PATCH] swap 10/13 tmpfs atomics Hugh Dickins
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Recent testing has shown that swapoff can sneak a page back into the
tmpfs page cache after truncate_inode_pages has cleaned it, before
shmem_truncate resets next_index to stop that: BUG_ON(inode->i_blocks)
in shmem_delete_inode.  So call truncate_inode_pages again to be safe.

--- swap08/mm/shmem.c	Sun Mar 23 10:30:15 2003
+++ swap09/mm/shmem.c	Tue Mar 25 20:44:24 2003
@@ -486,6 +486,16 @@
 	}
 done2:
 	BUG_ON(info->swapped > info->next_index);
+	if (inode->i_mapping->nrpages) {
+		/*
+		 * Call truncate_inode_pages again: racing shmem_unuse_inode
+		 * may have swizzled a page in from swap since vmtruncate or
+		 * generic_delete_inode did it, before we lowered next_index.
+		 */
+		spin_unlock(&info->lock);
+		truncate_inode_pages(inode->i_mapping, inode->i_size);
+		spin_lock(&info->lock);
+	}
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 }


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 10/13 tmpfs atomics
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (7 preceding siblings ...)
  2003-03-25 22:19 ` [PATCH] swap 09/13 tmpfs truncation Hugh Dickins
@ 2003-03-25 22:20 ` Hugh Dickins
  2003-03-26  0:50   ` Andrew Morton
  2003-03-25 22:21 ` [PATCH] swap 11/13 fix unuse_pmd fixme Hugh Dickins
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

move_from_swap_cache and add_to_page_cache_lru are using GFP_ATOMIC,
which can easily fail in an intermittent way.  Rude if shmem_getpage
then fails with -ENOMEM: cond_resched to let kswapd in, and repeat.

--- swap09/mm/shmem.c	Tue Mar 25 20:44:24 2003
+++ swap10/mm/shmem.c	Tue Mar 25 20:44:35 2003
@@ -838,8 +838,7 @@
 			SetPageUptodate(filepage);
 			set_page_dirty(filepage);
 			swap_free(swap);
-		} else if (!(error = move_from_swap_cache(
-				swappage, idx, mapping))) {
+		} else if (move_from_swap_cache(swappage, idx, mapping) == 0) {
 			shmem_swp_set(info, entry, 0);
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
@@ -850,8 +849,8 @@
 			spin_unlock(&info->lock);
 			unlock_page(swappage);
 			page_cache_release(swappage);
-			if (error != -EEXIST)
-				goto failed;
+			/* let kswapd refresh zone for GFP_ATOMICs */
+			cond_resched();
 			goto repeat;
 		}
 	} else if (sgp == SGP_READ && !filepage) {
@@ -897,15 +896,16 @@
 				swap = *entry;
 				shmem_swp_unmap(entry);
 			}
-			if (error || swap.val ||
-			    (error = add_to_page_cache_lru(
-					filepage, mapping, idx, GFP_ATOMIC))) {
+			if (error || swap.val || 0 != add_to_page_cache_lru(
+					filepage, mapping, idx, GFP_ATOMIC)) {
 				spin_unlock(&info->lock);
 				page_cache_release(filepage);
 				shmem_free_block(inode);
 				filepage = NULL;
-				if (error != -EEXIST)
+				if (error)
 					goto failed;
+				/* let kswapd refresh zone for GFP_ATOMICs */
+				cond_resched();
 				goto repeat;
 			}
 		}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 11/13 fix unuse_pmd fixme
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (8 preceding siblings ...)
  2003-03-25 22:20 ` [PATCH] swap 10/13 tmpfs atomics Hugh Dickins
@ 2003-03-25 22:21 ` Hugh Dickins
  2003-03-25 22:38   ` William Lee Irwin III
  2003-03-25 22:22 ` [PATCH] swap 12/13 vm_enough_memory double counts Hugh Dickins
  2003-03-25 22:23 ` [PATCH] swap 13/13 may_enter_fs? Hugh Dickins
  11 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

try_to_unuse drop mmlist_lock across unuse_process (with pretty dance
of atomic_incs and mmputs of various mmlist markers, and a polite new
cond_resched there), so unuse_process can pte_chain_alloc(GFP_KERNEL)
and pass that down and down and down and down to unuse_pte: which
cannot succeed more than once on a given mm (make that explicit by
returning back up once succeeded).  Preliminary checks moved up from
unuse_pte to unuse_pmd, and done more efficiently (avoid that extra
pte_file test added recently), swapoff spends far too long in here.
Updated locking comments and references to try_to_swap_out.

--- swap10/mm/swapfile.c	Sun Mar 23 10:30:15 2003
+++ swap11/mm/swapfile.c	Tue Mar 25 20:44:46 2003
@@ -377,42 +377,34 @@
  * share this swap entry, so be cautious and let do_wp_page work out
  * what to do if a write is requested later.
  */
-/* mmlist_lock and vma->vm_mm->page_table_lock are held */
+/* vma->vm_mm->page_table_lock is held */
 static void
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page, struct pte_chain **pte_chainp)
 {
-	pte_t pte = *dir;
-
-	if (pte_file(pte))
-		return;
-	if (likely(pte_to_swp_entry(pte).val != entry.val))
-		return;
-	if (unlikely(pte_none(pte) || pte_present(pte)))
-		return;
+	vma->vm_mm->rss++;
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	SetPageAnon(page);
 	*pte_chainp = page_add_rmap(page, dir, *pte_chainp);
 	swap_free(entry);
-	++vma->vm_mm->rss;
 }
 
-/* mmlist_lock and vma->vm_mm->page_table_lock are held */
-static void unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
+/* vma->vm_mm->page_table_lock is held */
+static int unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
 	unsigned long address, unsigned long size, unsigned long offset,
-	swp_entry_t entry, struct page* page)
+	swp_entry_t entry, struct page *page, struct pte_chain **pte_chainp)
 {
 	pte_t * pte;
 	unsigned long end;
-	struct pte_chain *pte_chain = NULL;
+	pte_t swp_pte = swp_entry_to_pte(entry);
 
 	if (pmd_none(*dir))
-		return;
+		return 0;
 	if (pmd_bad(*dir)) {
 		pmd_ERROR(*dir);
 		pmd_clear(dir);
-		return;
+		return 0;
 	}
 	pte = pte_offset_map(dir, address);
 	offset += address & PMD_MASK;
@@ -422,33 +414,36 @@
 		end = PMD_SIZE;
 	do {
 		/*
-		 * FIXME: handle pte_chain_alloc() failures
+		 * swapoff spends a _lot_ of time in this loop!
+		 * Test inline before going to call unuse_pte.
 		 */
-		if (pte_chain == NULL)
-			pte_chain = pte_chain_alloc(GFP_ATOMIC);
-		unuse_pte(vma, offset+address-vma->vm_start,
-				pte, entry, page, &pte_chain);
+		if (unlikely(pte_same(*pte, swp_pte))) {
+			unuse_pte(vma, offset + address, pte,
+					entry, page, pte_chainp);
+			pte_unmap(pte);
+			return 1;
+		}
 		address += PAGE_SIZE;
 		pte++;
 	} while (address && (address < end));
 	pte_unmap(pte - 1);
-	pte_chain_free(pte_chain);
+	return 0;
 }
 
-/* mmlist_lock and vma->vm_mm->page_table_lock are held */
-static void unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
+/* vma->vm_mm->page_table_lock is held */
+static int unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
 	unsigned long address, unsigned long size,
-	swp_entry_t entry, struct page* page)
+	swp_entry_t entry, struct page *page, struct pte_chain **pte_chainp)
 {
 	pmd_t * pmd;
 	unsigned long offset, end;
 
 	if (pgd_none(*dir))
-		return;
+		return 0;
 	if (pgd_bad(*dir)) {
 		pgd_ERROR(*dir);
 		pgd_clear(dir);
-		return;
+		return 0;
 	}
 	pmd = pmd_offset(dir, address);
 	offset = address & PGDIR_MASK;
@@ -459,32 +454,42 @@
 	if (address >= end)
 		BUG();
 	do {
-		unuse_pmd(vma, pmd, address, end - address, offset, entry,
-			  page);
+		if (unuse_pmd(vma, pmd, address, end - address,
+				offset, entry, page, pte_chainp))
+			return 1;
 		address = (address + PMD_SIZE) & PMD_MASK;
 		pmd++;
 	} while (address && (address < end));
+	return 0;
 }
 
-/* mmlist_lock and vma->vm_mm->page_table_lock are held */
-static void unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir,
-			swp_entry_t entry, struct page* page)
+/* vma->vm_mm->page_table_lock is held */
+static int unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir,
+	swp_entry_t entry, struct page *page, struct pte_chain **pte_chainp)
 {
 	unsigned long start = vma->vm_start, end = vma->vm_end;
 
 	if (start >= end)
 		BUG();
 	do {
-		unuse_pgd(vma, pgdir, start, end - start, entry, page);
+		if (unuse_pgd(vma, pgdir, start, end - start,
+				entry, page, pte_chainp))
+			return 1;
 		start = (start + PGDIR_SIZE) & PGDIR_MASK;
 		pgdir++;
 	} while (start && (start < end));
+	return 0;
 }
 
-static void unuse_process(struct mm_struct * mm,
+static int unuse_process(struct mm_struct * mm,
 			swp_entry_t entry, struct page* page)
 {
 	struct vm_area_struct* vma;
+	struct pte_chain *pte_chain;
+
+	pte_chain = pte_chain_alloc(GFP_KERNEL);
+	if (!pte_chain)
+		return -ENOMEM;
 
 	/*
 	 * Go through process' page directory.
@@ -492,10 +497,12 @@
 	spin_lock(&mm->page_table_lock);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		pgd_t * pgd = pgd_offset(mm, vma->vm_start);
-		unuse_vma(vma, pgd, entry, page);
+		if (unuse_vma(vma, pgd, entry, page, &pte_chain))
+			break;
 	}
 	spin_unlock(&mm->page_table_lock);
-	return;
+	pte_chain_free(pte_chain);
+	return 0;
 }
 
 /*
@@ -639,36 +646,54 @@
 			if (start_mm == &init_mm)
 				shmem = shmem_unuse(entry, page);
 			else
-				unuse_process(start_mm, entry, page);
+				retval = unuse_process(start_mm, entry, page);
 		}
 		if (*swap_map > 1) {
 			int set_start_mm = (*swap_map >= swcount);
 			struct list_head *p = &start_mm->mmlist;
 			struct mm_struct *new_start_mm = start_mm;
+			struct mm_struct *prev_mm = start_mm;
 			struct mm_struct *mm;
 
+			atomic_inc(&new_start_mm->mm_users);
+			atomic_inc(&prev_mm->mm_users);
 			spin_lock(&mmlist_lock);
-			while (*swap_map > 1 &&
+			while (*swap_map > 1 && !retval &&
 					(p = p->next) != &start_mm->mmlist) {
 				mm = list_entry(p, struct mm_struct, mmlist);
+				atomic_inc(&mm->mm_users);
+				spin_unlock(&mmlist_lock);
+				mmput(prev_mm);
+				prev_mm = mm;
+
+				cond_resched();
+
 				swcount = *swap_map;
-				if (mm == &init_mm) {
+				if (swcount <= 1)
+					;
+				else if (mm == &init_mm) {
 					set_start_mm = 1;
-					spin_unlock(&mmlist_lock);
 					shmem = shmem_unuse(entry, page);
-					spin_lock(&mmlist_lock);
 				} else
-					unuse_process(mm, entry, page);
+					retval = unuse_process(mm, entry, page);
 				if (set_start_mm && *swap_map < swcount) {
+					mmput(new_start_mm);
+					atomic_inc(&mm->mm_users);
 					new_start_mm = mm;
 					set_start_mm = 0;
 				}
+				spin_lock(&mmlist_lock);
 			}
-			atomic_inc(&new_start_mm->mm_users);
 			spin_unlock(&mmlist_lock);
+			mmput(prev_mm);
 			mmput(start_mm);
 			start_mm = new_start_mm;
 		}
+		if (retval) {
+			unlock_page(page);
+			page_cache_release(page);
+			break;
+		}
 
 		/*
 		 * How could swap count reach 0x7fff when the maximum
@@ -692,7 +717,7 @@
 
 		/*
 		 * If a reference remains (rare), we would like to leave
-		 * the page in the swap cache; but try_to_swap_out could
+		 * the page in the swap cache; but try_to_unmap could
 		 * then re-duplicate the entry once we drop page lock,
 		 * so we might loop indefinitely; also, that page could
 		 * not be swapped out to other storage meanwhile.  So:
@@ -728,7 +753,7 @@
 		/*
 		 * So we could skip searching mms once swap count went
 		 * to 1, we did not mark any present ptes as dirty: must
-		 * mark page dirty so try_to_swap_out will preserve it.
+		 * mark page dirty so shrink_list will preserve it.
 		 */
 		SetPageDirty(page);
 		unlock_page(page);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 12/13 vm_enough_memory double counts
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (9 preceding siblings ...)
  2003-03-25 22:21 ` [PATCH] swap 11/13 fix unuse_pmd fixme Hugh Dickins
@ 2003-03-25 22:22 ` Hugh Dickins
  2003-03-25 22:23 ` [PATCH] swap 13/13 may_enter_fs? Hugh Dickins
  11 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Stop vm_enough_memory double counting total_swapcache_pages: it dates
from the days when we didn't free swap when freeing swapcache page.

--- swap11/mm/mmap.c	Sun Mar 23 10:30:15 2003
+++ swap12/mm/mmap.c	Tue Mar 25 20:44:57 2003
@@ -82,14 +82,6 @@
 		free += nr_swap_pages;
 
 		/*
-		 * This double-counts: the nrpages are both in the
-		 * page-cache and in the swapper space. At the same time,
-		 * this compensates for the swap-space over-allocation
-		 * (ie "nr_swap_pages" being too small).
-		 */
-		free += total_swapcache_pages;
-
-		/*
 		 * The code below doesn't account for free space in the
 		 * inode and dentry slab cache, slab cache fragmentation,
 		 * inodes and dentries which will become freeable under


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] swap 13/13 may_enter_fs?
  2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
                   ` (10 preceding siblings ...)
  2003-03-25 22:22 ` [PATCH] swap 12/13 vm_enough_memory double counts Hugh Dickins
@ 2003-03-25 22:23 ` Hugh Dickins
  2003-03-26  1:12   ` Andrew Morton
  11 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2003-03-25 22:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

shrink_list's may_enter_fs (may_write_page would be a better name)
currently reflects that swapcache page I/O won't suffer from FS
complications, so can be done if __GFP_IO without __GFP_FS; but
the same is true of a tmpfs page (which is just this stage away
from being a swapcache page), so check bdi->memory_backed instead.

--- swap12/mm/vmscan.c	Tue Mar 25 20:43:07 2003
+++ swap13/mm/vmscan.c	Tue Mar 25 20:45:08 2003
@@ -235,7 +235,6 @@
 	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		struct page *page;
-		int may_enter_fs;
 
 		page = list_entry(page_list->prev, struct page, lru);
 		list_del(&page->lru);
@@ -248,8 +247,6 @@
 			(*nr_mapped)++;
 
 		BUG_ON(PageActive(page));
-		may_enter_fs = (gfp_mask & __GFP_FS) ||
-				(PageSwapCache(page) && (gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page))
 			goto keep_locked;
@@ -315,15 +312,19 @@
 		 * See swapfile.c:page_queue_congested().
 		 */
 		if (PageDirty(page)) {
+			struct backing_dev_info *bdi;
+
 			if (!is_page_cache_freeable(page))
 				goto keep_locked;
 			if (!mapping)
 				goto keep_locked;
 			if (mapping->a_ops->writepage == NULL)
 				goto activate_locked;
-			if (!may_enter_fs)
+			bdi = mapping->backing_dev_info;
+			if (!(gfp_mask & (bdi->memory_backed?
+					__GFP_IO: __GFP_FS)))
 				goto keep_locked;
-			if (!may_write_to_queue(mapping->backing_dev_info))
+			if (!may_write_to_queue(bdi))
 				goto keep_locked;
 			write_lock(&mapping->page_lock);
 			if (test_clear_page_dirty(page)) {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 11/13 fix unuse_pmd fixme
  2003-03-25 22:21 ` [PATCH] swap 11/13 fix unuse_pmd fixme Hugh Dickins
@ 2003-03-25 22:38   ` William Lee Irwin III
  0 siblings, 0 replies; 22+ messages in thread
From: William Lee Irwin III @ 2003-03-25 22:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Tue, Mar 25, 2003 at 10:21:51PM +0000, Hugh Dickins wrote:
> try_to_unuse drop mmlist_lock across unuse_process (with pretty dance
> of atomic_incs and mmputs of various mmlist markers, and a polite new
> cond_resched there), so unuse_process can pte_chain_alloc(GFP_KERNEL)
> and pass that down and down and down and down to unuse_pte: which
> cannot succeed more than once on a given mm (make that explicit by
> returning back up once succeeded).  Preliminary checks moved up from
> unuse_pte to unuse_pmd, and done more efficiently (avoid that extra
> pte_file test added recently), swapoff spends far too long in here.
> Updated locking comments and references to try_to_swap_out.

Aha, the unlock_page() and the contorted loop breakout were wrong
in my patch. Well, that and the cleanup of fossilized comments dating
back to the Silurian is nice.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 10/13 tmpfs atomics
  2003-03-25 22:20 ` [PATCH] swap 10/13 tmpfs atomics Hugh Dickins
@ 2003-03-26  0:50   ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2003-03-26  0:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> move_from_swap_cache and add_to_page_cache_lru are using GFP_ATOMIC,
> which can easily fail in an intermittent way.  Rude if shmem_getpage
> then fails with -ENOMEM: cond_resched to let kswapd in, and repeat.

I think the preferred way of waiting for the IO system and kswapd to catch up
is blk_congestion_wait().  Because it waits for the "right" amount of time.

I'll make that change.

And yes, the name is silly: "what if it's not block-backed"?  It hasn't
caused any problems yet, but maybe one day we'll need to find a way for
network-backed filesystems to deliver a wakeup to sleepers there.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-25 22:23 ` [PATCH] swap 13/13 may_enter_fs? Hugh Dickins
@ 2003-03-26  1:12   ` Andrew Morton
  2003-03-26 17:22     ` Hugh Dickins
  2003-03-26 18:06     ` Bill Davidsen
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2003-03-26  1:12 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> shrink_list's may_enter_fs (may_write_page would be a better name)
> currently reflects that swapcache page I/O won't suffer from FS
> complications, so can be done if __GFP_IO without __GFP_FS; but
> the same is true of a tmpfs page (which is just this stage away
> from being a swapcache page), so check bdi->memory_backed instead.
> 
> ...
>
> +			if (!(gfp_mask & (bdi->memory_backed?
> +					__GFP_IO: __GFP_FS)))
>  				goto keep_locked;

Barf.  I haven't used a question mark operator in ten years, and this is a
fine demonstration of why ;)

I think a feasibly comprehensible transformation would be:

	/*
	 * A comment goes here
	 */
	if (bdi->memory_backed)
		may_enter_fs = gfp_mask & __GFP_IO;
	else
		may_enter_fs = gfp_mask & __GFP_FS;


That being said, this is a bit presumptuous.  PF_MEMALLOC will protect us
from infinite recursion but there are other reasons for GFP_NOFS.

For example, a memory-backed filesystem may be trying to allocate GFP_NOFS
memory while holding filesystem locks which are taken by its writepage.

How about adding a new field to backing_dev_info for this case?  Damned if I
can think of a name for it though.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-26  1:12   ` Andrew Morton
@ 2003-03-26 17:22     ` Hugh Dickins
  2003-03-27  0:31       ` Andrew Morton
  2003-03-26 18:06     ` Bill Davidsen
  1 sibling, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2003-03-26 17:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Tue, 25 Mar 2003, Andrew Morton wrote:
> 
> For example, a memory-backed filesystem may be trying to allocate GFP_NOFS
> memory while holding filesystem locks which are taken by its writepage.
> 
> How about adding a new field to backing_dev_info for this case?  Damned if I
> can think of a name for it though.

I think you're overcomplicating it.  Of the existing memory_backed bdis
(ramdisk, hugetlbfs, ramfs, sysfs, tmpfs, swap) only two have non-NULL
writepage, and both of those two go (indirectly and directly) to swap
(and neither holds FS lock while waiting to allocate memory).

If we were looking for a correct solution, I don't think backing_dev_info
would be the right place: we're talking about GFP_ needed for writepage,
which should be specified in the struct address_space filled in by the
FS: I think it's more a limitation of the FS than its backing device.

But I don't really want to add another obscure gfp_mask initialization
all over the place.  So how about this?  As I say, "swap_backed" is
(for the foreseeable future) unnecessary, but does help to correct
mistaken impressions arising from the term "memory_backed".

I've conceded to your delicate sensibility by using if-else rather
than ?: (I used it rather too compactly there, but see nothing wrong
with the construct in general); but you'll be disgusted at my defiance
in not adding a comment in the place you indicated - what would I say?

Hugh

--- 2.5.66-mm1/mm/shmem.c	Wed Mar 26 11:51:01 2003
+++ swap13/mm/shmem.c	Wed Mar 26 16:35:08 2003
@@ -123,7 +123,8 @@
 
 static struct backing_dev_info shmem_backing_dev_info = {
 	.ra_pages	= 0,	/* No readahead */
-	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+	.memory_backed	= 1,	/* Do not count its dirty pages in nr_dirty */
+	.swap_backed	= 1,	/* Its memory_backed writepage goes to swap */
 };
 
 LIST_HEAD (shmem_inodes);
--- 2.5.66-mm1/mm/swap_state.c	Wed Mar 26 11:51:01 2003
+++ swap13/mm/swap_state.c	Wed Mar 26 16:35:08 2003
@@ -27,7 +27,8 @@
 
 static struct backing_dev_info swap_backing_dev_info = {
 	.ra_pages	= 0,	/* No readahead */
-	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+	.memory_backed	= 1,	/* Do not count its dirty pages in nr_dirty */
+	.swap_backed	= 1,	/* Its memory_backed writepage goes to swap */
 };
 
 extern struct address_space_operations swap_aops;
--- 2.5.66-mm1/mm/vmscan.c	Wed Mar 26 11:51:01 2003
+++ swap13/mm/vmscan.c	Wed Mar 26 16:35:08 2003
@@ -235,7 +235,6 @@
 	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		struct page *page;
-		int may_enter_fs;
 
 		page = list_entry(page_list->prev, struct page, lru);
 		list_del(&page->lru);
@@ -248,8 +247,6 @@
 			(*nr_mapped)++;
 
 		BUG_ON(PageActive(page));
-		may_enter_fs = (gfp_mask & __GFP_FS) ||
-				(PageSwapCache(page) && (gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page))
 			goto keep_locked;
@@ -315,15 +312,23 @@
 		 * See swapfile.c:page_queue_congested().
 		 */
 		if (PageDirty(page)) {
+			struct backing_dev_info *bdi;
+			unsigned int gfp_needed_for_writepage;
+
 			if (!is_page_cache_freeable(page))
 				goto keep_locked;
 			if (!mapping)
 				goto keep_locked;
 			if (mapping->a_ops->writepage == NULL)
 				goto activate_locked;
-			if (!may_enter_fs)
+			bdi = mapping->backing_dev_info;
+			if (bdi->swap_backed)
+				gfp_needed_for_writepage = __GFP_IO;
+			else
+				gfp_needed_for_writepage = __GFP_FS;
+			if (!(gfp_mask & gfp_needed_for_writepage))
 				goto keep_locked;
-			if (!may_write_to_queue(mapping->backing_dev_info))
+			if (!may_write_to_queue(bdi))
 				goto keep_locked;
 			write_lock(&mapping->page_lock);
 			if (test_clear_page_dirty(page)) {
--- 2.5.66-mm1/include/linux/backing-dev.h	Wed Mar 26 11:50:36 2003
+++ swap13/include/linux/backing-dev.h	Wed Mar 26 16:35:08 2003
@@ -23,7 +23,9 @@
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
-	int memory_backed;	/* Cannot clean pages with writepage */
+	unsigned int
+		memory_backed:1,/* Do not count its dirty pages in nr_dirty */
+		swap_backed:1;	/* Its memory_backed writepage goes to swap */
 };
 
 extern struct backing_dev_info default_backing_dev_info;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-26  1:12   ` Andrew Morton
  2003-03-26 17:22     ` Hugh Dickins
@ 2003-03-26 18:06     ` Bill Davidsen
  2003-03-26 18:42       ` Hugh Dickins
  1 sibling, 1 reply; 22+ messages in thread
From: Bill Davidsen @ 2003-03-26 18:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel

On Tue, 25 Mar 2003, Andrew Morton wrote:

> Hugh Dickins <hugh@veritas.com> wrote:
> >
> > shrink_list's may_enter_fs (may_write_page would be a better name)
> > currently reflects that swapcache page I/O won't suffer from FS
> > complications, so can be done if __GFP_IO without __GFP_FS; but
> > the same is true of a tmpfs page (which is just this stage away
> > from being a swapcache page), so check bdi->memory_backed instead.
> > 
> > ...
> >
> > +			if (!(gfp_mask & (bdi->memory_backed?
> > +					__GFP_IO: __GFP_FS)))
> >  				goto keep_locked;
> 
> Barf.  I haven't used a question mark operator in ten years, and this is a
> fine demonstration of why ;)
> 
> I think a feasibly comprehensible transformation would be:
> 
> 	/*
> 	 * A comment goes here
> 	 */
> 	if (bdi->memory_backed)
> 		may_enter_fs = gfp_mask & __GFP_IO;
> 	else
> 		may_enter_fs = gfp_mask & __GFP_FS;

Unless there's a subtle difference in functionality here that I'm missing,
you are doing the same thing in a larger and slower way, and the logic is
less clear.

Is there some benefit I'm missing?

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-26 18:06     ` Bill Davidsen
@ 2003-03-26 18:42       ` Hugh Dickins
  2003-03-27 14:55         ` Bill Davidsen
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2003-03-26 18:42 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Andrew Morton, linux-kernel

On Wed, 26 Mar 2003, Bill Davidsen wrote:
> 
> Unless there's a subtle difference in functionality here that I'm missing,
> you are doing the same thing in a larger and slower way, and the logic is
> less clear.
> 
> Is there some benefit I'm missing?

No, it's just that Andrew finds the logic clearer when written out his way.

Hugh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-26 17:22     ` Hugh Dickins
@ 2003-03-27  0:31       ` Andrew Morton
  2003-03-27 15:21         ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2003-03-27  0:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Tue, 25 Mar 2003, Andrew Morton wrote:
> > 
> > For example, a memory-backed filesystem may be trying to allocate GFP_NOFS
> > memory while holding filesystem locks which are taken by its writepage.
> > 
> > How about adding a new field to backing_dev_info for this case?  Damned if I
> > can think of a name for it though.
> 
> I think you're overcomplicating it.  Of the existing memory_backed bdis
> (ramdisk, hugetlbfs, ramfs, sysfs, tmpfs, swap) only two have non-NULL
> writepage, and both of those two go (indirectly and directly) to swap
> (and neither holds FS lock while waiting to allocate memory).

But this is a much nicer patch.  Thanks for doing all this btw.  I was
barfing at ?:, not your code ;)

> If we were looking for a correct solution, I don't think backing_dev_info
> would be the right place: we're talking about GFP_ needed for writepage,
> which should be specified in the struct address_space filled in by the
> FS: I think it's more a limitation of the FS than its backing device.

Good point.

> +			bdi = mapping->backing_dev_info;
> +			if (bdi->swap_backed)
> +				gfp_needed_for_writepage = __GFP_IO;
> +			else
> +				gfp_needed_for_writepage = __GFP_FS;
> +			if (!(gfp_mask & gfp_needed_for_writepage))

This is inaccurate?  shmem_writepage() performs no IO and could/should be
called even for GFP_NOIO allocations.

It's probably not very important but if we're going to make a change it may
as well be the right one.

Could you live with

	if (bdi->has_special_writepage)
		gfp_needed_for_writepage = bfi->gfp_needed_for_writepage;

?  So swap_backing_dev_info uses __GFP_IO and shmem_backing_dev_info() (which
is competely atomic) uses zero?

Yeah, it's a bit awkward.  I'm OK with the special-casing.  Both swap and
tmpfs _are_ special, and unique.  Recognising that fact in vmscan.c is
reasonable.  ->gfp_needed_for_writepage should probably be in the superblock,
but that's just too far away.

> -	int memory_backed;	/* Cannot clean pages with writepage */
> +	unsigned int
> +		memory_backed:1,/* Do not count its dirty pages in nr_dirty */
> +		swap_backed:1;	/* Its memory_backed writepage goes to swap */
>  };

Hard call.  It is a tradeoff between icache misses and dcache misses. 
Obviously that is trivia in this case.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-26 18:42       ` Hugh Dickins
@ 2003-03-27 14:55         ` Bill Davidsen
  0 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2003-03-27 14:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Wed, 26 Mar 2003, Hugh Dickins wrote:

> On Wed, 26 Mar 2003, Bill Davidsen wrote:
> > 
> > Unless there's a subtle difference in functionality here that I'm missing,
> > you are doing the same thing in a larger and slower way, and the logic is
> > less clear.
> > 
> > Is there some benefit I'm missing?
> 
> No, it's just that Andrew finds the logic clearer when written out his way.

Looking a code generated from fragments, I don't think gcc shares that
opinion ;-) Actually I find it more obvious with the ? notation, and it
prevents someone in the future changing the logic in one line of code when
it needs to change in both.

Oh well, I expect the ? form to stay, since it uses less cache.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] swap 13/13 may_enter_fs?
  2003-03-27  0:31       ` Andrew Morton
@ 2003-03-27 15:21         ` Hugh Dickins
  0 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2003-03-27 15:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed, 26 Mar 2003, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> But this is a much nicer patch.  Thanks for doing all this btw.  I was
> barfing at ?:, not your code ;)

Phew!  I'll make more use of the ;) operator in future then.

> > If we were looking for a correct solution, I don't think backing_dev_info
> > would be the right place: we're talking about GFP_ needed for writepage,
> > which should be specified in the struct address_space filled in by the
> > FS: I think it's more a limitation of the FS than its backing device.
> 
> Good point.

Well, I'm beginning to wonder.

> > +			bdi = mapping->backing_dev_info;
> > +			if (bdi->swap_backed)
> > +				gfp_needed_for_writepage = __GFP_IO;
> > +			else
> > +				gfp_needed_for_writepage = __GFP_FS;
> > +			if (!(gfp_mask & gfp_needed_for_writepage))
> 
> This is inaccurate?  shmem_writepage() performs no IO and could/should be
> called even for GFP_NOIO allocations.

Yes, you're right.  I was going to quibble that we might sometime
change shmem_writepage to call swap_writepage directly, instead of
giving the page another spin round the lru.  But that's not how it
is at present, and I think your GFP_NOIO shmem_writepage is better.

> It's probably not very important but if we're going to make a change it may
> as well be the right one.
> 
> Could you live with
> 
> 	if (bdi->has_special_writepage)
> 		gfp_needed_for_writepage = bfi->gfp_needed_for_writepage;
> 
> ?  So swap_backing_dev_info uses __GFP_IO and shmem_backing_dev_info() (which
> is competely atomic) uses zero?
> 
> Yeah, it's a bit awkward.  I'm OK with the special-casing.  Both swap and
> tmpfs _are_ special, and unique.  Recognising that fact in vmscan.c is
> reasonable.  ->gfp_needed_for_writepage should probably be in the superblock,
> but that's just too far away.

As you say, it's not very important, and I could live with it.
But I'm feeling we've not yet arrived at the right (even contingently
right) answer.  Let's hold off from making any such change for now.

Does a block device writepage need __GFP_FS?  Oh, looks like it uses
bufferheads, that's a sure sign it does, isn't it?

Did mempools make __GFP_IO redundant?  Perhaps for some devices and
not for others.  (mempool_alloc doesn't try the waiting allocation
if __GFP_FS is not set.)

I've a feeling both filesystem and backing device might like to
contribute to the mask.  And it's always worth considering where
loop might fit in with this too (I do have a bdi->over_loop in
one of my loop deadlock patches).

> > -	int memory_backed;	/* Cannot clean pages with writepage */
> > +	unsigned int
> > +		memory_backed:1,/* Do not count its dirty pages in nr_dirty */
> > +		swap_backed:1;	/* Its memory_backed writepage goes to swap */
> >  };
> 
> Hard call.  It is a tradeoff between icache misses and dcache misses. 
> Obviously that is trivia in this case.

Now you've lost me so completely, that I can't even tell if your
icache and dcache here are instruction and data or inode and dentry!

Hugh


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2003-03-27 15:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-25 22:10 [PATCH] swap 01/13 no SWAP_ERROR Hugh Dickins
2003-03-25 22:12 ` [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap Hugh Dickins
2003-03-25 22:13 ` [PATCH] swap 03/13 add_to_swap_cache Hugh Dickins
2003-03-25 22:13 ` [PATCH] swap 04/13 page_convert_anon -ENOMEM Hugh Dickins
2003-03-25 22:14 ` [PATCH] swap 05/13 page_convert_anon unlocking Hugh Dickins
2003-03-25 22:15 ` [PATCH] swap 06/13 wrap below vm_start Hugh Dickins
2003-03-25 22:17 ` [PATCH] swap 07/13 objrmap page_table_lock Hugh Dickins
2003-03-25 22:18 ` [PATCH] swap 08/13 rmap comments Hugh Dickins
2003-03-25 22:19 ` [PATCH] swap 09/13 tmpfs truncation Hugh Dickins
2003-03-25 22:20 ` [PATCH] swap 10/13 tmpfs atomics Hugh Dickins
2003-03-26  0:50   ` Andrew Morton
2003-03-25 22:21 ` [PATCH] swap 11/13 fix unuse_pmd fixme Hugh Dickins
2003-03-25 22:38   ` William Lee Irwin III
2003-03-25 22:22 ` [PATCH] swap 12/13 vm_enough_memory double counts Hugh Dickins
2003-03-25 22:23 ` [PATCH] swap 13/13 may_enter_fs? Hugh Dickins
2003-03-26  1:12   ` Andrew Morton
2003-03-26 17:22     ` Hugh Dickins
2003-03-27  0:31       ` Andrew Morton
2003-03-27 15:21         ` Hugh Dickins
2003-03-26 18:06     ` Bill Davidsen
2003-03-26 18:42       ` Hugh Dickins
2003-03-27 14:55         ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).