linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [rfc][patch] swap: virtual swap readahead
@ 2009-05-27 15:05 Johannes Weiner
  2009-05-27 17:32 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Johannes Weiner @ 2009-05-27 15:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Hugh Dickins, Rik van Riel

The current swap readahead implementation reads a physically
contiguous group of swap slots around the faulting page to take
advantage of the disk head's position and in the hope that the
surrounding pages will be needed soon as well.

This works as long as the physical swap slot order approximates the
LRU order decently, otherwise it wastes memory and IO bandwidth to
read in pages that are unlikely to be needed soon.

However, the physical swap slot layout diverges from the LRU order
with increasing swap activity, i.e. high memory pressure situations,
and this is exactly the situation where swapin should not waste any
memory or IO bandwidth as both are the most contended resources at
this point.

This patch makes swap-in base its readaround window on the virtual
proximity of pages in the faulting VMA, as an indicator for pages
needed in the near future, while still taking physical locality of
swap slots into account.

This has the advantage of reading in big batches when the LRU order
matches the swap slot order while automatically throttling readahead
when the system is thrashing and swap slots are no longer nicely
grouped by LRU order.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/swap_state.c |   80 +++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 57 insertions(+), 23 deletions(-)

qsbench, 20 runs each, 1.7GB RAM, 2GB swap, 4 cores:

         "mean (standard deviation) median"

All values are given in seconds.  I used a t-test to make sure there
is a statistical difference of at least 95% probability in the
compared runs for the given number of samples, arithmetic mean and
standard deviation.

1 x 2048M
vanilla: 391.25 ( 71.76) 384.56
vswapra: 445.55 ( 83.19) 415.41

	This is an actual regression.  I am not yet quite sure why
	this happens and I am undecided whether one humonguous active
	vma in the system is a common workload.

	It's also the only regression I found, with qsbench anyway.  I
	started out with powers of two and tweaked the parameters
	until the results between the two kernel versions differed.

2 x 1024M
vanilla: 384.25 ( 75.00) 423.08
vswapra: 290.26 ( 31.38) 299.51

4 x 540M
vanilla: 553.91 (100.02) 554.57
vswapra: 336.58 ( 52.49) 331.52

8 x 280M
vanilla: 561.08 ( 82.36) 583.12
vswapra: 319.13 ( 43.17) 307.69

16 x 128M
vanilla: 285.51 (113.20) 236.62
vswapra: 214.24 ( 62.37) 214.15

	All these show a nice improvement in performance and runtime
	stability.

The missing shmem support is a big TODO, I will try to find time to
tackle this when the overall idea is not refused in the first place.

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..8f8daaa 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,11 +336,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
- * Primitive swap readahead code. We simply read an aligned block of
- * (1 << page_cluster) entries in the swap area. This method is chosen
- * because it doesn't cost us any seek time.  We also make sure to queue
- * the 'original' request together with the readahead ones...
- *
  * This has been extended to use the NUMA policies from the mm triggering
  * the readahead.
  *
@@ -349,27 +344,66 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	int nr_pages;
-	struct page *page;
-	unsigned long offset;
-	unsigned long end_offset;
-
-	/*
-	 * Get starting offset for readaround, and number of pages to read.
-	 * Adjust starting address by readbehind (for NUMA interleave case)?
-	 * No, it's very unlikely that swap layout would follow vma layout,
-	 * more likely that neighbouring swap pages came from the same node:
-	 * so use the same "addr" to choose the same node for each swap read.
-	 */
-	nr_pages = valid_swaphandles(entry, &offset);
-	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
-		/* Ok, do the async read-ahead now */
-		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr);
+	int cluster = 1 << page_cluster;
+	int window = cluster << PAGE_SHIFT;
+	unsigned long start, pos, end;
+	unsigned long pmin, pmax;
+
+	/* XXX: fix this for shmem */
+	if (!vma || !vma->vm_mm)
+		goto nora;
+
+	/* Physical range to read from */
+	pmin = swp_offset(entry) & ~(cluster - 1);
+	pmax = pmin + cluster;
+
+	/* Virtual range to read from */
+	start = addr & ~(window - 1);
+	end = start + window;
+
+	for (pos = start; pos < end; pos += PAGE_SIZE) {
+		struct page *page;
+		swp_entry_t swp;
+		spinlock_t *ptl;
+		pgd_t *pgd;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, pos);
+		if (!pgd_present(*pgd))
+			continue;
+		pud = pud_offset(pgd, pos);
+		if (!pud_present(*pud))
+			continue;
+		pmd = pmd_offset(pud, pos);
+		if (!pmd_present(*pmd))
+			continue;
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
+		if (!is_swap_pte(*pte)) {
+			pte_unmap_unlock(pte, ptl);
+			continue;
+		}
+		swp = pte_to_swp_entry(*pte);
+		pte_unmap_unlock(pte, ptl);
+
+		if (swp_type(swp) != swp_type(entry))
+			continue;
+		/*
+		 * Dont move the disk head too far away.  This also
+		 * throttles readahead while thrashing, where virtual
+		 * order diverges more and more from physical order.
+		 */
+		if (swp_offset(swp) > pmax)
+			continue;
+		if (swp_offset(swp) < pmin)
+			continue;
+		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
 		if (!page)
-			break;
+			continue;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+nora:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
@ 2009-05-27 17:32 ` Rik van Riel
  2009-05-27 21:48 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Rik van Riel @ 2009-05-27 17:32 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins

Johannes Weiner wrote:

> This patch makes swap-in base its readaround window on the virtual
> proximity of pages in the faulting VMA, as an indicator for pages
> needed in the near future, while still taking physical locality of
> swap slots into account.
> 
> This has the advantage of reading in big batches when the LRU order
> matches the swap slot order while automatically throttling readahead
> when the system is thrashing and swap slots are no longer nicely
> grouped by LRU order.

This is a nice simple implementation of proper
swapin readahead.  The performance results are
surprisingly good.

I suspect the performance oddity you see with
single-process qsbench might be due to qsbench
having a weird access pattern that just happens
to benefit from getting pages back into memory
in LRU order - not something that I expect to
be common, so not a concern.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Rik van Riel <riel@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
  2009-05-27 17:32 ` Rik van Riel
@ 2009-05-27 21:48 ` Andrew Morton
  2009-05-28  0:14   ` Johannes Weiner
  2009-06-01  8:05 ` Andi Kleen
  2009-06-08  7:52 ` Wu Fengguang
  3 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2009-05-27 21:48 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, hugh.dickins, riel

On Wed, 27 May 2009 17:05:46 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> This patch makes swap-in base its readaround window on the virtual
> proximity of pages in the faulting VMA, as an indicator for pages
> needed in the near future, while still taking physical locality of
> swap slots into account.
> 
> This has the advantage of reading in big batches when the LRU order
> matches the swap slot order while automatically throttling readahead
> when the system is thrashing and swap slots are no longer nicely
> grouped by LRU order.
> 

Well.  It would be better to _not_ shrink readaround, but to make it
read the right pages (see below).

Or perhaps the readaround size is just too large.  I did spend some
time playing with its size back in the dark ages and ended up deciding
that the current setting is OK, but that was across a range of
workloads.

Did you try simply decreasing the cluster size and seeing if that had a
similar effect upon this workload?



Back in 2002 or thereabouts I had a patch <rummage, rummage.  Appended>
which does this the other way.  It attempts to ensure that swap space
is allocated so that virtually contiguous pages get physically
contiguous blocks on disk.  So that when swapspace readaround does its
thing, the blocks which it reads are populating pages which are
virtually "close" to the page which got the major fault.

Unfortunately I wasn't able to demonstrate much performance benefit
from it and didn't get around to working out why.

iirc, the way it worked was: divide swap into 1MB hunks.  When we
decide to add an anon page to swapcache, grab a 1MB hunk of swap and
then add the pages which are virtual neighbours of the target page to
swapcache as well.

Obviously the algorithm could be tweaked/tuned/fixed, but the idea
seems sound - the cost of reading a contiguous hunk of blocks is not a
lot more than reading the single block.

Maybe it's something you might like to have a think about.

> The missing shmem support is a big TODO, I will try to find time to
> tackle this when the overall idea is not refused in the first place.

heh, OK.

> - * Primitive swap readahead code. We simply read an aligned block of
> - * (1 << page_cluster) entries in the swap area. This method is chosen
> - * because it doesn't cost us any seek time.  We also make sure to queue
> - * the 'original' request together with the readahead ones...
> - *
> -	/*
> -	 * Get starting offset for readaround, and number of pages to read.
> -	 * Adjust starting address by readbehind (for NUMA interleave case)?
> -	 * No, it's very unlikely that swap layout would follow vma layout,
> -	 * more likely that neighbouring swap pages came from the same node:
> -	 * so use the same "addr" to choose the same node for each swap read.
> -	 */

The patch deletes the old design description but doesn't add a
description of the new design :(



 include/linux/swap.h  |    6 +--
 kernel/power/swsusp.c |    2 -
 mm/shmem.c            |    4 +-
 mm/swap_state.c       |    8 +++-
 mm/swapfile.c         |   98 ++++++++++++++++++++------------------------------
 mm/vmscan.c           |    5 ++
 6 files changed, 56 insertions(+), 67 deletions(-)

diff -puN include/linux/swap.h~swapspace-layout-improvements include/linux/swap.h
--- 25/include/linux/swap.h~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/include/linux/swap.h	2005-05-02 23:36:30.000000000 -0700
@@ -193,7 +193,7 @@ extern int rw_swap_page_sync(int, swp_en
 extern struct address_space swapper_space;
 #define total_swapcache_pages  swapper_space.nrpages
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *);
+extern int add_to_swap(struct page *page, void *cookie, pgoff_t index);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern int move_to_swap_cache(struct page *, swp_entry_t);
@@ -209,7 +209,7 @@ extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
 extern struct swap_info_struct swap_info[];
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(void *cookie, pgoff_t index);
 extern int swap_duplicate(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
@@ -276,7 +276,7 @@ static inline int remove_exclusive_swap_
 	return 0;
 }
 
-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(void *cookie, pgoff_t index)
 {
 	swp_entry_t entry;
 	entry.val = 0;
diff -puN kernel/power/swsusp.c~swapspace-layout-improvements kernel/power/swsusp.c
--- 25/kernel/power/swsusp.c~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/kernel/power/swsusp.c	2005-05-02 23:36:30.000000000 -0700
@@ -240,7 +240,7 @@ static int write_page(unsigned long addr
 	swp_entry_t entry;
 	int error = 0;
 
-	entry = get_swap_page();
+	entry = get_swap_page(NULL, swp_offset(*loc));
 	if (swp_offset(entry) && 
 	    swapfile_used[swp_type(entry)] == SWAPFILE_SUSPEND) {
 		error = rw_swap_page_sync(WRITE, entry,
diff -puN mm/shmem.c~swapspace-layout-improvements mm/shmem.c
--- 25/mm/shmem.c~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/mm/shmem.c	2005-05-02 23:36:30.000000000 -0700
@@ -812,7 +812,7 @@ static int shmem_writepage(struct page *
 	struct shmem_inode_info *info;
 	swp_entry_t *entry, swap;
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t index;
 	struct inode *inode;
 
 	BUG_ON(!PageLocked(page));
@@ -824,7 +824,7 @@ static int shmem_writepage(struct page *
 	info = SHMEM_I(inode);
 	if (info->flags & VM_LOCKED)
 		goto redirty;
-	swap = get_swap_page();
+	swap = get_swap_page(mapping, index);
 	if (!swap.val)
 		goto redirty;
 
diff -puN mm/swapfile.c~swapspace-layout-improvements mm/swapfile.c
--- 25/mm/swapfile.c~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/mm/swapfile.c	2005-05-02 23:36:30.000000000 -0700
@@ -13,6 +13,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/vmalloc.h>
+#include <linux/hash.h>
 #include <linux/pagemap.h>
 #include <linux/namei.h>
 #include <linux/shm.h>
@@ -84,71 +85,52 @@ void swap_unplug_io_fn(struct backing_de
 	up_read(&swap_unplug_sem);
 }
 
-static inline int scan_swap_map(struct swap_info_struct *si)
-{
-	unsigned long offset;
-	/* 
-	 * We try to cluster swap pages by allocating them
-	 * sequentially in swap.  Once we've allocated
-	 * SWAPFILE_CLUSTER pages this way, however, we resort to
-	 * first-free allocation, starting a new cluster.  This
-	 * prevents us from scattering swap pages all over the entire
-	 * swap partition, so that we reduce overall disk seek times
-	 * between swap pages.  -- sct */
-	if (si->cluster_nr) {
-		while (si->cluster_next <= si->highest_bit) {
-			offset = si->cluster_next++;
-			if (si->swap_map[offset])
-				continue;
-			si->cluster_nr--;
-			goto got_page;
-		}
-	}
-	si->cluster_nr = SWAPFILE_CLUSTER;
+int akpm;
 
-	/* try to find an empty (even not aligned) cluster. */
-	offset = si->lowest_bit;
- check_next_cluster:
-	if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
-	{
-		unsigned long nr;
-		for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
-			if (si->swap_map[nr])
-			{
-				offset = nr+1;
-				goto check_next_cluster;
-			}
-		/* We found a completly empty cluster, so start
-		 * using it.
-		 */
-		goto got_page;
-	}
-	/* No luck, so now go finegrined as usual. -Andrea */
-	for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) {
-		if (si->swap_map[offset])
+/*
+ * We divide the swapdev into 1024 kilobyte chunks.  We use the cookie and the
+ * upper bits of the index to select a chunk and the rest of the index as the
+ * offset into the selected chunk.
+ */
+#define CHUNK_SHIFT	(20 - PAGE_SHIFT)
+#define CHUNK_MASK	(-1UL << CHUNK_SHIFT)
+
+static int
+scan_swap_map(struct swap_info_struct *si, void *cookie, pgoff_t index)
+{
+	unsigned long chunk;
+	unsigned long nchunks;
+	unsigned long block;
+	unsigned long scan;
+
+	nchunks = si->max >> CHUNK_SHIFT;
+	chunk = 0;
+	if (nchunks)
+		chunk = hash_long((unsigned long)cookie + (index & CHUNK_MASK),
+					BITS_PER_LONG) % nchunks;
+
+	block = (chunk << CHUNK_SHIFT) + (index & ~CHUNK_MASK);
+
+	for (scan = 0; scan < si->max; scan++, block++) {
+		if (block == si->max)
+			block = 0;
+		if (block == 0)
 			continue;
-		si->lowest_bit = offset+1;
-	got_page:
-		if (offset == si->lowest_bit)
-			si->lowest_bit++;
-		if (offset == si->highest_bit)
-			si->highest_bit--;
-		if (si->lowest_bit > si->highest_bit) {
-			si->lowest_bit = si->max;
-			si->highest_bit = 0;
-		}
-		si->swap_map[offset] = 1;
+		if (si->swap_map[block])
+			continue;
+		si->swap_map[block] = 1;
 		si->inuse_pages++;
 		nr_swap_pages--;
-		si->cluster_next = offset+1;
-		return offset;
+		if (akpm)
+			printk("cookie:%p, index:%lu, chunk:%lu nchunks:%lu "
+				"block:%lu\n",
+				cookie, index, chunk, nchunks, block);
+		return block;
 	}
-	si->lowest_bit = si->max;
-	si->highest_bit = 0;
 	return 0;
 }
 
-swp_entry_t get_swap_page(void)
+swp_entry_t get_swap_page(void *cookie, pgoff_t index)
 {
 	struct swap_info_struct * p;
 	unsigned long offset;
@@ -167,7 +149,7 @@ swp_entry_t get_swap_page(void)
 		p = &swap_info[type];
 		if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) {
 			swap_device_lock(p);
-			offset = scan_swap_map(p);
+			offset = scan_swap_map(p, cookie, index);
 			swap_device_unlock(p);
 			if (offset) {
 				entry = swp_entry(type,offset);
diff -puN mm/swap_state.c~swapspace-layout-improvements mm/swap_state.c
--- 25/mm/swap_state.c~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/mm/swap_state.c	2005-05-02 23:36:30.000000000 -0700
@@ -139,8 +139,12 @@ void __delete_from_swap_cache(struct pag
  *
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
+ *
+ * We attempt to lay pages out on swap to that virtually-contiguous pages are
+ * contiguous on-disk.  To do this we utilise page->index (offset into vma) and
+ * page->mapping (the anon_vma's address).
  */
-int add_to_swap(struct page * page)
+int add_to_swap(struct page *page, void *cookie, pgoff_t index)
 {
 	swp_entry_t entry;
 	int err;
@@ -149,7 +153,7 @@ int add_to_swap(struct page * page)
 		BUG();
 
 	for (;;) {
-		entry = get_swap_page();
+		entry = get_swap_page(cookie, index);
 		if (!entry.val)
 			return 0;
 
diff -puN mm/vmscan.c~swapspace-layout-improvements mm/vmscan.c
--- 25/mm/vmscan.c~swapspace-layout-improvements	2005-05-02 23:36:30.000000000 -0700
+++ 25-akpm/mm/vmscan.c	2005-05-02 23:36:30.000000000 -0700
@@ -408,7 +408,10 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!add_to_swap(page))
+			void *cookie = page->mapping;
+			pgoff_t index = page->index;
+
+			if (!add_to_swap(page, cookie, index))
 				goto activate_locked;
 		}
 #endif /* CONFIG_SWAP */
_


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-05-27 21:48 ` Andrew Morton
@ 2009-05-28  0:14   ` Johannes Weiner
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2009-05-28  0:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, hugh.dickins, riel

On Wed, May 27, 2009 at 02:48:51PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 17:05:46 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The current swap readahead implementation reads a physically
> > contiguous group of swap slots around the faulting page to take
> > advantage of the disk head's position and in the hope that the
> > surrounding pages will be needed soon as well.
> > 
> > This works as long as the physical swap slot order approximates the
> > LRU order decently, otherwise it wastes memory and IO bandwidth to
> > read in pages that are unlikely to be needed soon.
> > 
> > However, the physical swap slot layout diverges from the LRU order
> > with increasing swap activity, i.e. high memory pressure situations,
> > and this is exactly the situation where swapin should not waste any
> > memory or IO bandwidth as both are the most contended resources at
> > this point.
> > 
> > This patch makes swap-in base its readaround window on the virtual
> > proximity of pages in the faulting VMA, as an indicator for pages
> > needed in the near future, while still taking physical locality of
> > swap slots into account.
> > 
> > This has the advantage of reading in big batches when the LRU order
> > matches the swap slot order while automatically throttling readahead
> > when the system is thrashing and swap slots are no longer nicely
> > grouped by LRU order.
> > 
> 
> Well.  It would be better to _not_ shrink readaround, but to make it
> read the right pages (see below).
> 
> Or perhaps the readaround size is just too large.  I did spend some
> time playing with its size back in the dark ages and ended up deciding
> that the current setting is OK, but that was across a range of
> workloads.
> 
> Did you try simply decreasing the cluster size and seeing if that had a
> similar effect upon this workload?

No, I will try that.

> Back in 2002 or thereabouts I had a patch <rummage, rummage.  Appended>
> which does this the other way.  It attempts to ensure that swap space
> is allocated so that virtually contiguous pages get physically
> contiguous blocks on disk.  So that when swapspace readaround does its
> thing, the blocks which it reads are populating pages which are
> virtually "close" to the page which got the major fault.
> 
> Unfortunately I wasn't able to demonstrate much performance benefit
> from it and didn't get around to working out why.

I did something similar once: broke down swap space into contiguous
clusters sized 1 << page_cluster and that were maintained on
free/partial/full lists per swap device.  Then, every anon vma got a
group of clusters that backed its pages.  I think it's best described
as extent-based backing of anon VMAs.  I didn't hash offsets into the
swap device, just allocated a new cluster from the freelist if there
wasn't already one for that particular part of the vma and then used
page->index & some_mask for the static in-cluster offset.

But IIRC it was defeated by the extra seeking that came with the
internal fragmentation of the extents and separation of lru-related
pages (see below).

> iirc, the way it worked was: divide swap into 1MB hunks.  When we
> decide to add an anon page to swapcache, grab a 1MB hunk of swap and
> then add the pages which are virtual neighbours of the target page to
> swapcache as well.
> 
> Obviously the algorithm could be tweaked/tuned/fixed, but the idea
> seems sound - the cost of reading a contiguous hunk of blocks is not a
> lot more than reading the single block.
> 
> Maybe it's something you might like to have a think about.

I gave up the idea because I think the VMA order is a good hint but
not useful to layout swap slots based exclusively on it.  The reason
for that being that one slice of LRU pages with the same access
frequency (in scan time granularity) doesn't come from one VMA only
but from several ones and doing strict VMA grouping separates
LRU-related pages physically on swap and thus unavoidably adds holes
between data that are used in similar frequencies.

I think this is a realistic memory state:

	vma 1:		 [a b c d e f g]
	vma 2:		 [h i j k l m n]
	LRU/swap layout: [c d e i j k l]

A major fault on page d would readahead [c d e] with my patch (and
maybe also i and j with the current readahead algorithm).

Having swap pages explicitely vma-grouped instead could now very well
read [a b c d] or [d e f g].  Which is fine unless we need that memory
for pages like i, j, k and l that are likely to be needed earlier than
a, b, f and g.

And due to the hashing into swap space by a rather arbitrary value
like the anon vma address, and the slots for [a b], [f g] and h with
different access frequency in between, you might now have quite some
distance between [c d e] and [i j k l] which are likely to be used
together.

I think our current swap allocation layout is great at keeping things
compact.  But it will not always keep the LRU order intact, which I
found very hard to fix without moving the performance hits someplace
else.

> > - * Primitive swap readahead code. We simply read an aligned block of
> > - * (1 << page_cluster) entries in the swap area. This method is chosen
> > - * because it doesn't cost us any seek time.  We also make sure to queue
> > - * the 'original' request together with the readahead ones...
> > - *
> > -	/*
> > -	 * Get starting offset for readaround, and number of pages to read.
> > -	 * Adjust starting address by readbehind (for NUMA interleave case)?
> > -	 * No, it's very unlikely that swap layout would follow vma layout,
> > -	 * more likely that neighbouring swap pages came from the same node:
> > -	 * so use the same "addr" to choose the same node for each swap read.
> > -	 */
> 
> The patch deletes the old design description but doesn't add a
> description of the new design :(

Bad indeed, I will fix it up.

Thanks for your time,

	Hannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
  2009-05-27 17:32 ` Rik van Riel
  2009-05-27 21:48 ` Andrew Morton
@ 2009-06-01  8:05 ` Andi Kleen
  2009-06-08  7:52 ` Wu Fengguang
  3 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2009-06-01  8:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel

Johannes Weiner <hannes@cmpxchg.org> writes:
>
> This patch makes swap-in base its readaround window on the virtual
> proximity of pages in the faulting VMA, as an indicator for pages
> needed in the near future, while still taking physical locality of
> swap slots into account.

I think it's a good idea, something that needed fixing in Linux forever.

Now if we can only start swapping out in larger cluster too.

> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);

It would be a good idea then to fix r_s_c_a() to pass down the VMA
and use alloc_page_vma() down below, so that NUMA Policy is preserved
over swapin. 

I originally tried this when I did the NUMA policy code, but then Hugh
pointed out it was useless because the prefetched pages are not
necessarily from this VMA anyways. With your virtual readahead it would
make sense again.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
                   ` (2 preceding siblings ...)
  2009-06-01  8:05 ` Andi Kleen
@ 2009-06-08  7:52 ` Wu Fengguang
  2009-06-08 17:58   ` Johannes Weiner
  3 siblings, 1 reply; 7+ messages in thread
From: Wu Fengguang @ 2009-06-08  7:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 6763 bytes --]

On Wed, May 27, 2009 at 05:05:46PM +0200, Johannes Weiner wrote:
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> This patch makes swap-in base its readaround window on the virtual
> proximity of pages in the faulting VMA, as an indicator for pages
> needed in the near future, while still taking physical locality of
> swap slots into account.
> 
> This has the advantage of reading in big batches when the LRU order
> matches the swap slot order while automatically throttling readahead
> when the system is thrashing and swap slots are no longer nicely
> grouped by LRU order.

Hi Johannes,

You may want to test the patch against a real desktop :)
The attached scripts can do that. I also have the setup to
test it out conveniently, so if you send me the latest patch..

Thanks,
Fengguang

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/swap_state.c |   80 +++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 57 insertions(+), 23 deletions(-)
> 
> qsbench, 20 runs each, 1.7GB RAM, 2GB swap, 4 cores:
> 
>          "mean (standard deviation) median"
> 
> All values are given in seconds.  I used a t-test to make sure there
> is a statistical difference of at least 95% probability in the
> compared runs for the given number of samples, arithmetic mean and
> standard deviation.
> 
> 1 x 2048M
> vanilla: 391.25 ( 71.76) 384.56
> vswapra: 445.55 ( 83.19) 415.41
> 
> 	This is an actual regression.  I am not yet quite sure why
> 	this happens and I am undecided whether one humonguous active
> 	vma in the system is a common workload.
> 
> 	It's also the only regression I found, with qsbench anyway.  I
> 	started out with powers of two and tweaked the parameters
> 	until the results between the two kernel versions differed.
> 
> 2 x 1024M
> vanilla: 384.25 ( 75.00) 423.08
> vswapra: 290.26 ( 31.38) 299.51
> 
> 4 x 540M
> vanilla: 553.91 (100.02) 554.57
> vswapra: 336.58 ( 52.49) 331.52
> 
> 8 x 280M
> vanilla: 561.08 ( 82.36) 583.12
> vswapra: 319.13 ( 43.17) 307.69
> 
> 16 x 128M
> vanilla: 285.51 (113.20) 236.62
> vswapra: 214.24 ( 62.37) 214.15
> 
> 	All these show a nice improvement in performance and runtime
> 	stability.
> 
> The missing shmem support is a big TODO, I will try to find time to
> tackle this when the overall idea is not refused in the first place.
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..8f8daaa 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -336,11 +336,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> - * Primitive swap readahead code. We simply read an aligned block of
> - * (1 << page_cluster) entries in the swap area. This method is chosen
> - * because it doesn't cost us any seek time.  We also make sure to queue
> - * the 'original' request together with the readahead ones...
> - *
>   * This has been extended to use the NUMA policies from the mm triggering
>   * the readahead.
>   *
> @@ -349,27 +344,66 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
> -	struct page *page;
> -	unsigned long offset;
> -	unsigned long end_offset;
> -
> -	/*
> -	 * Get starting offset for readaround, and number of pages to read.
> -	 * Adjust starting address by readbehind (for NUMA interleave case)?
> -	 * No, it's very unlikely that swap layout would follow vma layout,
> -	 * more likely that neighbouring swap pages came from the same node:
> -	 * so use the same "addr" to choose the same node for each swap read.
> -	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> -	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> -		/* Ok, do the async read-ahead now */
> -		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> -						gfp_mask, vma, addr);
> +	int cluster = 1 << page_cluster;
> +	int window = cluster << PAGE_SHIFT;
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +
> +	/* XXX: fix this for shmem */
> +	if (!vma || !vma->vm_mm)
> +		goto nora;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>  		if (!page)
> -			break;
> +			continue;
>  		page_cache_release(page);
>  	}
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> +nora:
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> -- 
> 1.6.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: run-many-x-apps.sh --]
[-- Type: application/x-sh, Size: 1707 bytes --]

[-- Attachment #3: test-mmap-exec-prot.sh --]
[-- Type: application/x-sh, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc][patch] swap: virtual swap readahead
  2009-06-08  7:52 ` Wu Fengguang
@ 2009-06-08 17:58   ` Johannes Weiner
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2009-06-08 17:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel

On Mon, Jun 08, 2009 at 03:52:46PM +0800, Wu Fengguang wrote:
> On Wed, May 27, 2009 at 05:05:46PM +0200, Johannes Weiner wrote:
> > The current swap readahead implementation reads a physically
> > contiguous group of swap slots around the faulting page to take
> > advantage of the disk head's position and in the hope that the
> > surrounding pages will be needed soon as well.
> > 
> > This works as long as the physical swap slot order approximates the
> > LRU order decently, otherwise it wastes memory and IO bandwidth to
> > read in pages that are unlikely to be needed soon.
> > 
> > However, the physical swap slot layout diverges from the LRU order
> > with increasing swap activity, i.e. high memory pressure situations,
> > and this is exactly the situation where swapin should not waste any
> > memory or IO bandwidth as both are the most contended resources at
> > this point.
> > 
> > This patch makes swap-in base its readaround window on the virtual
> > proximity of pages in the faulting VMA, as an indicator for pages
> > needed in the near future, while still taking physical locality of
> > swap slots into account.
> > 
> > This has the advantage of reading in big batches when the LRU order
> > matches the swap slot order while automatically throttling readahead
> > when the system is thrashing and swap slots are no longer nicely
> > grouped by LRU order.
> 
> Hi Johannes,
> 
> You may want to test the patch against a real desktop :)
> The attached scripts can do that. I also have the setup to
> test it out conveniently, so if you send me the latest patch..

Thanks a bunch for the offer!  I'm just now incorporating Hugh's
feedback and hope I will be back soon with the next version.  I will
let you know, for sure.

	Hannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-08 18:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-27 15:05 [rfc][patch] swap: virtual swap readahead Johannes Weiner
2009-05-27 17:32 ` Rik van Riel
2009-05-27 21:48 ` Andrew Morton
2009-05-28  0:14   ` Johannes Weiner
2009-06-01  8:05 ` Andi Kleen
2009-06-08  7:52 ` Wu Fengguang
2009-06-08 17:58   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).