linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm] make swapin readahead skip over holes
@ 2012-01-09 23:10 Rik van Riel
  2012-01-09 23:49 ` KOSAKI Motohiro
  2012-01-11 16:51 ` Mel Gorman
  0 siblings, 2 replies; 14+ messages in thread
From: Rik van Riel @ 2012-01-09 23:10 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, Mel Gorman, KOSAKI Motohiro, Johannes Weiner

Ever since abandoning the virtual scan of processes, for scalability
reasons, swap space has been a little more fragmented than before.
This can lead to the situation where a large memory user is killed,
swap space ends up full of "holes" and swapin readahead is totally
ineffective.

On my home system, after killing a leaky firefox it took over an
hour to page just under 2GB of memory back in, slowing the virtual
machines down to a crawl.

This patch makes swapin readahead simply skip over holes, instead
of stopping at them.  This allows the system to swap things back in
at rates of several MB/second, instead of a few hundred kB/second.

The checks done in valid_swaphandles are already done in 
read_swap_cache_async as well, allowing us to remove a fair amount
of code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/swap.h |    2 +-
 mm/swap_state.c      |   17 ++++----------
 mm/swapfile.c        |   56 +++++++++++--------------------------------------
 3 files changed, 19 insertions(+), 56 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e22e12..6e1282e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -328,7 +328,7 @@ extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
-extern int valid_swaphandles(swp_entry_t, unsigned long *);
+extern void get_swap_cluster(swp_entry_t, unsigned long *, unsigned long *);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 78cc4d1..36501b3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -373,25 +373,18 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	int nr_pages;
 	struct page *page;
-	unsigned long offset;
+	unsigned long offset = swp_offset(entry);
 	unsigned long end_offset;
 
-	/*
-	 * Get starting offset for readaround, and number of pages to read.
-	 * Adjust starting address by readbehind (for NUMA interleave case)?
-	 * No, it's very unlikely that swap layout would follow vma layout,
-	 * more likely that neighbouring swap pages came from the same node:
-	 * so use the same "addr" to choose the same node for each swap read.
-	 */
-	nr_pages = valid_swaphandles(entry, &offset);
-	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
+	get_swap_cluster(entry, &offset, &end_offset);
+
+	for (; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
 						gfp_mask, vma, addr);
 		if (!page)
-			break;
+			continue;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b1cd120..8dae1ca 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2289,55 +2289,25 @@ int swapcache_prepare(swp_entry_t entry)
 }
 
 /*
- * swap_lock prevents swap_map being freed. Don't grab an extra
- * reference on the swaphandle, it doesn't matter if it becomes unused.
+ * Return a swap cluster sized and aligned block around offset.
  */
-int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
+void get_swap_cluster(swp_entry_t entry, unsigned long *begin,
+			unsigned long *end)
 {
 	struct swap_info_struct *si;
-	int our_page_cluster = page_cluster;
-	pgoff_t target, toff;
-	pgoff_t base, end;
-	int nr_pages = 0;
-
-	if (!our_page_cluster)	/* no readahead */
-		return 0;
-
-	si = swap_info[swp_type(entry)];
-	target = swp_offset(entry);
-	base = (target >> our_page_cluster) << our_page_cluster;
-	end = base + (1 << our_page_cluster);
-	if (!base)		/* first page is swap header */
-		base++;
+	unsigned long offset = swp_offset(entry);
 
 	spin_lock(&swap_lock);
-	if (end > si->max)	/* don't go beyond end of map */
-		end = si->max;
-
-	/* Count contiguous allocated slots above our target */
-	for (toff = target; ++toff < end; nr_pages++) {
-		/* Don't read in free or bad pages */
-		if (!si->swap_map[toff])
-			break;
-		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
-			break;
-	}
-	/* Count contiguous allocated slots below our target */
-	for (toff = target; --toff >= base; nr_pages++) {
-		/* Don't read in free or bad pages */
-		if (!si->swap_map[toff])
-			break;
-		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
-			break;
-	}
+	si = swap_info[swp_type(entry)];
+	/* Round the begin down to a page_cluster boundary. */
+	offset = (offset >> page_cluster) << page_cluster;
+	*begin = offset;
+	/* Round the end up, but not beyond the end of the swap device. */
+	offset = offset + (1 << page_cluster);
+	if (offset > si->max)
+		offset = si->max;
+	*end = offset;
 	spin_unlock(&swap_lock);
-
-	/*
-	 * Indicate starting offset, and return number of pages to get:
-	 * if only 1, say 0, since there's then no readahead to be done.
-	 */
-	*offset = ++toff;
-	return nr_pages? ++nr_pages: 0;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-09 23:10 [PATCH -mm] make swapin readahead skip over holes Rik van Riel
@ 2012-01-09 23:49 ` KOSAKI Motohiro
  2012-01-10  3:09   ` Rik van Riel
  2012-01-11 16:51 ` Mel Gorman
  1 sibling, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2012-01-09 23:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

(1/9/12 6:10 PM), Rik van Riel wrote:
> Ever since abandoning the virtual scan of processes, for scalability
> reasons, swap space has been a little more fragmented than before.
> This can lead to the situation where a large memory user is killed,
> swap space ends up full of "holes" and swapin readahead is totally
> ineffective.
>
> On my home system, after killing a leaky firefox it took over an
> hour to page just under 2GB of memory back in, slowing the virtual
> machines down to a crawl.
>
> This patch makes swapin readahead simply skip over holes, instead
> of stopping at them.  This allows the system to swap things back in
> at rates of several MB/second, instead of a few hundred kB/second.

If I understand correctly, this patch have

Pros
  - increase IO throughput
Cons
  - increase a risk to pick up unrelated swap entries by swap readahead


The changelog explained former but doesn't explained latter. I'm a bit
hesitate now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-09 23:49 ` KOSAKI Motohiro
@ 2012-01-10  3:09   ` Rik van Riel
  2012-01-11  7:14     ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-10  3:09 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

On 01/09/2012 06:49 PM, KOSAKI Motohiro wrote:
> (1/9/12 6:10 PM), Rik van Riel wrote:
>> Ever since abandoning the virtual scan of processes, for scalability
>> reasons, swap space has been a little more fragmented than before.
>> This can lead to the situation where a large memory user is killed,
>> swap space ends up full of "holes" and swapin readahead is totally
>> ineffective.
>>
>> On my home system, after killing a leaky firefox it took over an
>> hour to page just under 2GB of memory back in, slowing the virtual
>> machines down to a crawl.
>>
>> This patch makes swapin readahead simply skip over holes, instead
>> of stopping at them. This allows the system to swap things back in
>> at rates of several MB/second, instead of a few hundred kB/second.
>
> If I understand correctly, this patch have
>
> Pros
> - increase IO throughput

By about a factor 3-10 in my tests here.

> Cons
> - increase a risk to pick up unrelated swap entries by swap readahead

I do not believe there is a very large risk of this, because
since we introduced rmap, we have been placing unrelated
pages right next to each other in swap.

This is also why, since 2.6.28, the kernel places newly swapped
in pages on the INACTIVE_ANON list, where they should not
displace the working set.

Another factor is that swapping on modern systems is often a
temporary thing. During a load spike, things get swapped out
and run slowly. After the load spike is over, or some memory
hog process got killed, we want the system to recover to normal
performance as soon as possible.  This often involves swapping
everything back into memory.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-10  3:09   ` Rik van Riel
@ 2012-01-11  7:14     ` KOSAKI Motohiro
  2012-01-11  8:01       ` KOSAKI Motohiro
  2012-01-11 23:15       ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11  7:14 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

2012/1/9 Rik van Riel <riel@redhat.com>:
> On 01/09/2012 06:49 PM, KOSAKI Motohiro wrote:
>>
>> (1/9/12 6:10 PM), Rik van Riel wrote:
>>>
>>> Ever since abandoning the virtual scan of processes, for scalability
>>> reasons, swap space has been a little more fragmented than before.
>>> This can lead to the situation where a large memory user is killed,
>>> swap space ends up full of "holes" and swapin readahead is totally
>>> ineffective.
>>>
>>> On my home system, after killing a leaky firefox it took over an
>>> hour to page just under 2GB of memory back in, slowing the virtual
>>> machines down to a crawl.
>>>
>>> This patch makes swapin readahead simply skip over holes, instead
>>> of stopping at them. This allows the system to swap things back in
>>> at rates of several MB/second, instead of a few hundred kB/second.
>>
>>
>> If I understand correctly, this patch have
>>
>> Pros
>> - increase IO throughput
>
>
> By about a factor 3-10 in my tests here.
>
>
>> Cons
>> - increase a risk to pick up unrelated swap entries by swap readahead
>
>
> I do not believe there is a very large risk of this, because
> since we introduced rmap, we have been placing unrelated
> pages right next to each other in swap.
>
> This is also why, since 2.6.28, the kernel places newly swapped
> in pages on the INACTIVE_ANON list, where they should not
> displace the working set.
>
> Another factor is that swapping on modern systems is often a
> temporary thing. During a load spike, things get swapped out
> and run slowly. After the load spike is over, or some memory
> hog process got killed, we want the system to recover to normal
> performance as soon as possible.  This often involves swapping
> everything back into memory.

Hmmm.... OK, I have to agree this.
But if so, to skip hole is not best way. I think we should always makes
one big IO, even if the swap cluster have some holes. one big IO is
usually faster than multiple small IOs. Isn't it?

Also, I doubt current swap_cluster default is best value on nowadays.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11  7:14     ` KOSAKI Motohiro
@ 2012-01-11  8:01       ` KOSAKI Motohiro
  2012-01-11  8:05         ` KOSAKI Motohiro
  2012-01-11 14:11         ` John Stoffel
  2012-01-11 23:15       ` Andrew Morton
  1 sibling, 2 replies; 14+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11  8:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

> Also, I doubt current swap_cluster default is best value on nowadays.

I meant, current average hdd spec is,
 - average seek time: 8.5ms
 - sequential access performance: about 60MB/sec

so, we can eat free lunch up to 7MB ~= 60(MB/sec) * 1000 / 8.5(ms).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11  8:01       ` KOSAKI Motohiro
@ 2012-01-11  8:05         ` KOSAKI Motohiro
  2012-01-11 14:11         ` John Stoffel
  1 sibling, 0 replies; 14+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11  8:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

2012/1/11 KOSAKI Motohiro <kosaki.motohiro@gmail.com>:
>> Also, I doubt current swap_cluster default is best value on nowadays.
>
> I meant, current average hdd spec is,
>  - average seek time: 8.5ms
>  - sequential access performance: about 60MB/sec
>
> so, we can eat free lunch up to 7MB ~= 60(MB/sec) * 1000 / 8.5(ms).

Bah! I'm moron.
the correct fomura is,

500KB = 60(MB/sec) / 1000(msec/ssec) *8.5 (ms)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11  8:01       ` KOSAKI Motohiro
  2012-01-11  8:05         ` KOSAKI Motohiro
@ 2012-01-11 14:11         ` John Stoffel
  2012-01-11 19:17           ` Christoph Lameter
  1 sibling, 1 reply; 14+ messages in thread
From: John Stoffel @ 2012-01-11 14:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, linux-mm, linux-kernel, akpm, Mel Gorman, Johannes Weiner

>>>>> "KOSAKI" == KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:

>> Also, I doubt current swap_cluster default is best value on nowadays.
KOSAKI> I meant, current average hdd spec is,
KOSAKI>  - average seek time: 8.5ms
KOSAKI>  - sequential access performance: about 60MB/sec

KOSAKI> so, we can eat free lunch up to 7MB ~= 60(MB/sec) * 1000 / 8.5(ms).

What if the disk is busy doing other writeout or readin during this
time?  You can't assume you have the full disk bandwidth available,
esp when you hit a swap storm like this.  

John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-09 23:10 [PATCH -mm] make swapin readahead skip over holes Rik van Riel
  2012-01-09 23:49 ` KOSAKI Motohiro
@ 2012-01-11 16:51 ` Mel Gorman
  2012-01-11 19:23   ` Rik van Riel
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2012-01-11 16:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, akpm, KOSAKI Motohiro, Johannes Weiner

On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:
> Ever since abandoning the virtual scan of processes, for scalability
> reasons, swap space has been a little more fragmented than before.
> This can lead to the situation where a large memory user is killed,
> swap space ends up full of "holes" and swapin readahead is totally
> ineffective.
> 
> On my home system, after killing a leaky firefox it took over an
> hour to page just under 2GB of memory back in, slowing the virtual
> machines down to a crawl.
> 
> This patch makes swapin readahead simply skip over holes, instead
> of stopping at them.  This allows the system to swap things back in
> at rates of several MB/second, instead of a few hundred kB/second.
> 

Nice.

> The checks done in valid_swaphandles are already done in 
> read_swap_cache_async as well, allowing us to remove a fair amount
> of code.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/swap.h |    2 +-
>  mm/swap_state.c      |   17 ++++----------
>  mm/swapfile.c        |   56 +++++++++++--------------------------------------
>  3 files changed, 19 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1e22e12..6e1282e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -328,7 +328,7 @@ extern long total_swap_pages;
>  extern void si_swapinfo(struct sysinfo *);
>  extern swp_entry_t get_swap_page(void);
>  extern swp_entry_t get_swap_page_of_type(int);
> -extern int valid_swaphandles(swp_entry_t, unsigned long *);
> +extern void get_swap_cluster(swp_entry_t, unsigned long *, unsigned long *);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 78cc4d1..36501b3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -373,25 +373,18 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
>  	struct page *page;
> -	unsigned long offset;
> +	unsigned long offset = swp_offset(entry);
>  	unsigned long end_offset;
>  
> -	/*
> -	 * Get starting offset for readaround, and number of pages to read.
> -	 * Adjust starting address by readbehind (for NUMA interleave case)?
> -	 * No, it's very unlikely that swap layout would follow vma layout,
> -	 * more likely that neighbouring swap pages came from the same node:
> -	 * so use the same "addr" to choose the same node for each swap read.
> -	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> -	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> +	get_swap_cluster(entry, &offset, &end_offset);
> +
> +	for (; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>  						gfp_mask, vma, addr);
>  		if (!page)
> -			break;
> +			continue;
>  		page_cache_release(page);
>  	}

For a heavily fragmented swap file, this will result in more IO and
the gamble is that pages nearby are needed soon. You say your virtual
machines swapin faster and that does not surprise me. I also expect
they need the data so it's a net win.

There is an possibility that under memory pressure that swapping in
more pages will cause more memory pressure (increased swapin causing
clean page cache discards and pageout) and be an overall loss. This may
be a net loss in some cases such as where the working set size is just
over physical memory and the increased swapin causes a problem. I doubt
this case is common but it is worth bearing in mind if future bug
reports complain about increased swap activity.

>  	lru_add_drain();	/* Push any new pages onto the LRU now */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b1cd120..8dae1ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2289,55 +2289,25 @@ int swapcache_prepare(swp_entry_t entry)
>  }
>  
>  /*
> - * swap_lock prevents swap_map being freed. Don't grab an extra
> - * reference on the swaphandle, it doesn't matter if it becomes unused.
> + * Return a swap cluster sized and aligned block around offset.
>   */
> -int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
> +void get_swap_cluster(swp_entry_t entry, unsigned long *begin,
> +			unsigned long *end)
>  {
>  	struct swap_info_struct *si;
> -	int our_page_cluster = page_cluster;
> -	pgoff_t target, toff;
> -	pgoff_t base, end;
> -	int nr_pages = 0;
> -
> -	if (!our_page_cluster)	/* no readahead */
> -		return 0;
> -
> -	si = swap_info[swp_type(entry)];
> -	target = swp_offset(entry);
> -	base = (target >> our_page_cluster) << our_page_cluster;
> -	end = base + (1 << our_page_cluster);
> -	if (!base)		/* first page is swap header */
> -		base++;
> +	unsigned long offset = swp_offset(entry);
>  
>  	spin_lock(&swap_lock);
> -	if (end > si->max)	/* don't go beyond end of map */
> -		end = si->max;
> -
> -	/* Count contiguous allocated slots above our target */
> -	for (toff = target; ++toff < end; nr_pages++) {
> -		/* Don't read in free or bad pages */
> -		if (!si->swap_map[toff])
> -			break;
> -		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
> -			break;
> -	}
> -	/* Count contiguous allocated slots below our target */
> -	for (toff = target; --toff >= base; nr_pages++) {
> -		/* Don't read in free or bad pages */
> -		if (!si->swap_map[toff])
> -			break;
> -		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
> -			break;
> -	}
> +	si = swap_info[swp_type(entry)];
> +	/* Round the begin down to a page_cluster boundary. */
> +	offset = (offset >> page_cluster) << page_cluster;

Minor nit but it would feel more natural to me to see

offset & ~((1 << page_cluster) - 1)

but I understand that you are reusing the existing code.

As an aside, page_cluster is statically calculated at swap_setup and
it is a trivial calculation. It is worth having it in bss? At the
very least it should be read_mostly. Currently it is oblivious to
memory hotplug but as the value only changes for machines with < 16M,
it doesn't matter. Would it make sense to just get rid of page_cluster
too?

> +	*begin = offset;
> +	/* Round the end up, but not beyond the end of the swap device. */
> +	offset = offset + (1 << page_cluster);
> +	if (offset > si->max)
> +		offset = si->max;
> +	*end = offset;
>  	spin_unlock(&swap_lock);
> -
> -	/*
> -	 * Indicate starting offset, and return number of pages to get:
> -	 * if only 1, say 0, since there's then no readahead to be done.
> -	 */
> -	*offset = ++toff;
> -	return nr_pages? ++nr_pages: 0;
>  }

This section deletes code which is nice but there is a
problem. Your changelog says that this is duplicating the effort of
read_swap_cache_async() which is true but what it does is

1. a swap cache lookup which will probably fail
2. alloc_page_vma()
3. radix_tree_preload()
4. swapcache_prepare
   - calls __swap_duplicate()
   - finds the hole, bails out

That's a lot of work before the hole is found. Would it be worth
doing a racy check in swapin_readahead without swap lock held before
calling read_swap_cache_async()?

>  
>  /*
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11 14:11         ` John Stoffel
@ 2012-01-11 19:17           ` Christoph Lameter
  2012-01-11 21:03             ` John Stoffel
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2012-01-11 19:17 UTC (permalink / raw)
  To: John Stoffel
  Cc: KOSAKI Motohiro, Rik van Riel, linux-mm, linux-kernel, akpm,
	Mel Gorman, Johannes Weiner

On Wed, 11 Jan 2012, John Stoffel wrote:

> KOSAKI> so, we can eat free lunch up to 7MB ~= 60(MB/sec) * 1000 / 8.5(ms).
>
> What if the disk is busy doing other writeout or readin during this
> time?  You can't assume you have the full disk bandwidth available,
> esp when you hit a swap storm like this.

The assumptions by Kosaki are quite conservative.

What if one did not get a disk from the garbage heap but instead has a
state of the art storage cluster or simply an SSD (in particular relevant
now since HDs are in short supply given the situation in Asia)?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11 16:51 ` Mel Gorman
@ 2012-01-11 19:23   ` Rik van Riel
  2012-01-12 14:33     ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-11 19:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, akpm, KOSAKI Motohiro, Johannes Weiner

On 01/11/2012 11:51 AM, Mel Gorman wrote:
> On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:

>> +	get_swap_cluster(entry,&offset,&end_offset);
>> +
>> +	for (; offset<= end_offset ; offset++) {
>>   		/* Ok, do the async read-ahead now */
>>   		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>>   						gfp_mask, vma, addr);
>>   		if (!page)
>> -			break;
>> +			continue;
>>   		page_cache_release(page);
>>   	}
>
> For a heavily fragmented swap file, this will result in more IO and
> the gamble is that pages nearby are needed soon. You say your virtual
> machines swapin faster and that does not surprise me. I also expect
> they need the data so it's a net win.

More IO operations, yes.  However, IO operations from nearby
blocks can often be done without incurring extra disk seeks,
because the disk head is already in the right place.

This seems to be born out by the fact that I saw swapin
rates increase from maybe 200-300kB/s to 5-15MB/s...

Even on some SSDs it could avoid some bank switches, though
of course there I would expect the effect to be much less
pronounced.

> There is an possibility that under memory pressure that swapping in
> more pages will cause more memory pressure (increased swapin causing
> clean page cache discards and pageout) and be an overall loss. This may
> be a net loss in some cases such as where the working set size is just
> over physical memory and the increased swapin causes a problem. I doubt
> this case is common but it is worth bearing in mind if future bug
> reports complain about increased swap activity.

True, there may be workloads that benefit from a smaller
page-cluster. The fact that the recently swapped in pages
are all put on the inactive anon list should help protect
the working set, too.

Another alternative may be a time based decision. If we
have swapped something out recently, go with a less
aggressive swapin readahead.

That would automatically give us fast swapin readahead
when in "memory hog just exited, let the system recover"
mode, and conservative swapin readahead in your situation.

However, that could still hurt badly if the system is just
moving the working set from one part of a program to another.

I suspect we will be faster off by having faster swap IO,
which this patch seems to provide.

>> -	si = swap_info[swp_type(entry)];
>> -	target = swp_offset(entry);
>> -	base = (target>>  our_page_cluster)<<  our_page_cluster;
>> -	end = base + (1<<  our_page_cluster);
>> -	if (!base)		/* first page is swap header */
>> -		base++;

>> +	si = swap_info[swp_type(entry)];
>> +	/* Round the begin down to a page_cluster boundary. */
>> +	offset = (offset>>  page_cluster)<<  page_cluster;
>
> Minor nit but it would feel more natural to me to see
>
> offset&  ~((1<<  page_cluster) - 1)
>
> but I understand that you are reusing the existing code.

Sure, I can do that.

While I'm there, I can also add that if (!base) base++
thing back in :)

>> +	*begin = offset;
>> +	/* Round the end up, but not beyond the end of the swap device. */
>> +	offset = offset + (1<<  page_cluster);
>> +	if (offset>  si->max)
>> +		offset = si->max;
>> +	*end = offset;
>>   	spin_unlock(&swap_lock);
>> -
>> -	/*
>> -	 * Indicate starting offset, and return number of pages to get:
>> -	 * if only 1, say 0, since there's then no readahead to be done.
>> -	 */
>> -	*offset = ++toff;
>> -	return nr_pages? ++nr_pages: 0;
>>   }
>
> This section deletes code which is nice but there is a
> problem. Your changelog says that this is duplicating the effort of
> read_swap_cache_async() which is true but what it does is
>
> 1. a swap cache lookup which will probably fail
> 2. alloc_page_vma()
> 3. radix_tree_preload()
> 4. swapcache_prepare
>     - calls __swap_duplicate()
>     - finds the hole, bails out
>
> That's a lot of work before the hole is found. Would it be worth
> doing a racy check in swapin_readahead without swap lock held before
> calling read_swap_cache_async()?

The problem is that without the swap_lock held, the swap_info
struct may disappear completely because of the swapin_readahead
happening concurrently with a swapoff.

I suspect that the CPU time spent doing 1-4 above will be
negligible compared to the amount of time spent doing disk IO,
but if there turns out to be a problem it should be possible
to move the swap hole identification closer to the top of
swap_cache_read_async().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11 19:17           ` Christoph Lameter
@ 2012-01-11 21:03             ` John Stoffel
  2012-01-11 22:25               ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: John Stoffel @ 2012-01-11 21:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: John Stoffel, KOSAKI Motohiro, Rik van Riel, linux-mm,
	linux-kernel, akpm, Mel Gorman, Johannes Weiner

>>>>> "Christoph" == Christoph Lameter <cl@linux.com> writes:

Christoph> On Wed, 11 Jan 2012, John Stoffel wrote:
KOSAKI> so, we can eat free lunch up to 7MB ~= 60(MB/sec) * 1000 / 8.5(ms).
>> 
>> What if the disk is busy doing other writeout or readin during this
>> time?  You can't assume you have the full disk bandwidth available,
>> esp when you hit a swap storm like this.

Christoph> The assumptions by Kosaki are quite conservative.

Just checking

Christoph> What if one did not get a disk from the garbage heap but
Christoph> instead has a state of the art storage cluster or simply an
Christoph> SSD (in particular relevant now since HDs are in short
Christoph> supply given the situation in Asia)?

I don't know, I was just trying to make sure he thinks about disks
which are slower than he expects, since there are lots of them still
out there.  

John


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11 21:03             ` John Stoffel
@ 2012-01-11 22:25               ` KOSAKI Motohiro
  0 siblings, 0 replies; 14+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 22:25 UTC (permalink / raw)
  To: John Stoffel
  Cc: Christoph Lameter, Rik van Riel, linux-mm, linux-kernel, akpm,
	Mel Gorman, Johannes Weiner

> Christoph> The assumptions by Kosaki are quite conservative.
>
> Just checking
>
> Christoph> What if one did not get a disk from the garbage heap but
> Christoph> instead has a state of the art storage cluster or simply an
> Christoph> SSD (in particular relevant now since HDs are in short
> Christoph> supply given the situation in Asia)?
>
> I don't know, I was just trying to make sure he thinks about disks
> which are slower than he expects, since there are lots of them still
> out there.

If you have a rotate disk, a bottoleneck is almost always IOPS, not
disk bandwidth.
at least when the systems are under swap-in, I can't imagine the system is under
disk bandwidth neck. Therefore we can eat free lunch if and only if we
don't increase
number of IOs.

In opposite, if you have much rich IO devices, that's more simple. You
don't need
worry about a few MB/s swap IO.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11  7:14     ` KOSAKI Motohiro
  2012-01-11  8:01       ` KOSAKI Motohiro
@ 2012-01-11 23:15       ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2012-01-11 23:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, linux-mm, linux-kernel, Mel Gorman, Johannes Weiner

On Wed, 11 Jan 2012 02:14:32 -0500
KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:

> > Another factor is that swapping on modern systems is often a
> > temporary thing. During a load spike, things get swapped out
> > and run slowly. After the load spike is over, or some memory
> > hog process got killed, we want the system to recover to normal
> > performance as soon as possible. __This often involves swapping
> > everything back into memory.
> 
> Hmmm.... OK, I have to agree this.
> But if so, to skip hole is not best way. I think we should always makes
> one big IO, even if the swap cluster have some holes. one big IO is
> usually faster than multiple small IOs. Isn't it?

Not necessarily.  If we have two requests in the disk for blocks 0-3
and 8-11, one would hope that the disk is smart enough to read both
blocks within a single rotation.

If the kernel were to recognise this situation and request the entire
12 blocks then we'd see lower command overhead but higher transfer
costs.

Still, Rik's testing shows that either approach would be superior to
what we have at present, which is to not read blocks 8-11 at all!


It sounds like Rik will be doing a v2 with some minor updates?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -mm] make swapin readahead skip over holes
  2012-01-11 19:23   ` Rik van Riel
@ 2012-01-12 14:33     ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2012-01-12 14:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, akpm, KOSAKI Motohiro, Johannes Weiner

On Wed, Jan 11, 2012 at 02:23:58PM -0500, Rik van Riel wrote:
> On 01/11/2012 11:51 AM, Mel Gorman wrote:
> >On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:
> 
> >>+	get_swap_cluster(entry,&offset,&end_offset);
> >>+
> >>+	for (; offset<= end_offset ; offset++) {
> >>  		/* Ok, do the async read-ahead now */
> >>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> >>  						gfp_mask, vma, addr);
> >>  		if (!page)
> >>-			break;
> >>+			continue;
> >>  		page_cache_release(page);
> >>  	}
> >
> >For a heavily fragmented swap file, this will result in more IO and
> >the gamble is that pages nearby are needed soon. You say your virtual
> >machines swapin faster and that does not surprise me. I also expect
> >they need the data so it's a net win.
> 
> More IO operations, yes.  However, IO operations from nearby
> blocks can often be done without incurring extra disk seeks,
> because the disk head is already in the right place.
> 

I understand that, the result still has to go somewhere in memory
though. The cost in terms of IO speed might be effectively 0 but it
still consumes memory resources.

> This seems to be born out by the fact that I saw swapin
> rates increase from maybe 200-300kB/s to 5-15MB/s...
> 
> Even on some SSDs it could avoid some bank switches, though
> of course there I would expect the effect to be much less
> pronounced.
> 
> >There is an possibility that under memory pressure that swapping in
> >more pages will cause more memory pressure (increased swapin causing
> >clean page cache discards and pageout) and be an overall loss. This may
> >be a net loss in some cases such as where the working set size is just
> >over physical memory and the increased swapin causes a problem. I doubt
> >this case is common but it is worth bearing in mind if future bug
> >reports complain about increased swap activity.
> 
> True, there may be workloads that benefit from a smaller
> page-cluster. The fact that the recently swapped in pages
> are all put on the inactive anon list should help protect
> the working set, too.
> 

True to some extent.

> Another alternative may be a time based decision. If we
> have swapped something out recently, go with a less
> aggressive swapin readahead.
> 

Maybe, but while I know I brought up the problem of swapin might
increase swapout, it would be best to have an example of how that can
happen before looking at heuristics.

> That would automatically give us fast swapin readahead
> when in "memory hog just exited, let the system recover"
> mode, and conservative swapin readahead in your situation.
> 
> However, that could still hurt badly if the system is just
> moving the working set from one part of a program to another.
> 
> I suspect we will be faster off by having faster swap IO,
> which this patch seems to provide.
> 

More than likely.

> >>-	si = swap_info[swp_type(entry)];
> >>-	target = swp_offset(entry);
> >>-	base = (target>>  our_page_cluster)<<  our_page_cluster;
> >>-	end = base + (1<<  our_page_cluster);
> >>-	if (!base)		/* first page is swap header */
> >>-		base++;
> 
> >>+	si = swap_info[swp_type(entry)];
> >>+	/* Round the begin down to a page_cluster boundary. */
> >>+	offset = (offset>>  page_cluster)<<  page_cluster;
> >
> >Minor nit but it would feel more natural to me to see
> >
> >offset&  ~((1<<  page_cluster) - 1)
> >
> >but I understand that you are reusing the existing code.
> 
> Sure, I can do that.
> 
> While I'm there, I can also add that if (!base) base++
> thing back in :)
> 

Whoops, I missed that!

> >>+	*begin = offset;
> >>+	/* Round the end up, but not beyond the end of the swap device. */
> >>+	offset = offset + (1<<  page_cluster);
> >>+	if (offset>  si->max)
> >>+		offset = si->max;
> >>+	*end = offset;
> >>  	spin_unlock(&swap_lock);
> >>-
> >>-	/*
> >>-	 * Indicate starting offset, and return number of pages to get:
> >>-	 * if only 1, say 0, since there's then no readahead to be done.
> >>-	 */
> >>-	*offset = ++toff;
> >>-	return nr_pages? ++nr_pages: 0;
> >>  }
> >
> >This section deletes code which is nice but there is a
> >problem. Your changelog says that this is duplicating the effort of
> >read_swap_cache_async() which is true but what it does is
> >
> >1. a swap cache lookup which will probably fail
> >2. alloc_page_vma()
> >3. radix_tree_preload()
> >4. swapcache_prepare
> >    - calls __swap_duplicate()
> >    - finds the hole, bails out
> >
> >That's a lot of work before the hole is found. Would it be worth
> >doing a racy check in swapin_readahead without swap lock held before
> >calling read_swap_cache_async()?
> 
> The problem is that without the swap_lock held, the swap_info
> struct may disappear completely because of the swapin_readahead
> happening concurrently with a swapoff.
> 

hmm, I considered that when I wrote the suggestion and was making
an assumption that the presense of swap pages on the area would
effectively pin the swap_info struct.

I did think of the race that swapoff is operating on the very last page
that swapin is reading and removes the map at the same time. We make a
racy check against effectively random memory and one of three results
might happen

1. DEBUG_PAGE_ALLOC was enabled and we blow up, ok, this one is bad if
   somewhat corner case

2. False positive count - we fall through and try to read the page,
   find the swap page no longer exists and bail gracefully

3. False zero count - we do not readin the swap page. It's not a big
   deal if swapin_readahead fails to read in a page it should have
   (swapoff has brought it in anyway).

Basically, I did not see any major downside with doing a racy check. If
1 above is a serious problem, then yes, we have to take the swap_lock
but that's still cheaper than allocating needlessly.

> I suspect that the CPU time spent doing 1-4 above will be
> negligible compared to the amount of time spent doing disk IO,

Negligible in comparison to the IO sure, but it still seems like we
could be doing a lot of allocation work (alloc_page_vma and
radix_tree_preload) for no good reason.

> but if there turns out to be a problem it should be possible
> to move the swap hole identification closer to the top of
> swap_cache_read_async().
> 

There is that, it'll be tricky to spot that we're needless allocating
pages though. Not the end of the world, what you have still improves
things.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-01-12 14:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-09 23:10 [PATCH -mm] make swapin readahead skip over holes Rik van Riel
2012-01-09 23:49 ` KOSAKI Motohiro
2012-01-10  3:09   ` Rik van Riel
2012-01-11  7:14     ` KOSAKI Motohiro
2012-01-11  8:01       ` KOSAKI Motohiro
2012-01-11  8:05         ` KOSAKI Motohiro
2012-01-11 14:11         ` John Stoffel
2012-01-11 19:17           ` Christoph Lameter
2012-01-11 21:03             ` John Stoffel
2012-01-11 22:25               ` KOSAKI Motohiro
2012-01-11 23:15       ` Andrew Morton
2012-01-11 16:51 ` Mel Gorman
2012-01-11 19:23   ` Rik van Riel
2012-01-12 14:33     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).