All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] swap: improve swap I/O rate
@ 2012-05-14 11:58 ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Ehrhardt Christian

From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In an memory overcommitment scneario with KVM I ran into a lot of wiats for
swap. While checking the I/O done on the swap disks I found almost all I/Os
to be done as single page 4k request. Despite the fact that swap in is a
batch of 1<<page-cluster pages as swap readahead and swap out is a list of
pages written in shrink_page_list.

[1/2 swap in improvment]
The read patch shows improvements of up to 50% swap throughput, much happier
guest systems and even when running with comparable throughput a lot I/O per
seconds saved leaving resources in the SAN for other consumers.

[2/2 documentation]
While doing so I also realized that the documentation for
proc/sys/vm/page-cluster is no more matching the code

[missing patch #3]
I tried to get a similar patch working for swap out in shrink_page_list. And
it worked in functional terms, but the additional mergin was negligible.
Maybe the cond_resched triggers much mor often than I expected, I'm open for
suggestions regarding improving the pagout I/O sizes as well.

Kind regards,
Christian Ehrhardt


Christian Ehrhardt (2):
  swap: allow swap readahead to be merged
  documentation: update how page-cluster affects swap I/O

 Documentation/sysctl/vm.txt |   12 ++++++++++--
 mm/swap_state.c             |    5 +++++
 2 files changed, 15 insertions(+), 2 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
  2012-05-15  4:38   ` Minchan Kim
  2012-05-15 17:43   ` Rik van Riel
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.

On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
 mm/swap_state.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
+#include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = swp_offset(entry);
 	unsigned long start_offset, end_offset;
 	unsigned long mask = (1UL << page_cluster) - 1;
+	struct blk_plug plug;
 
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!start_offset)	/* First page is swap header. */
 		start_offset++;
 
+	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			continue;
 		page_cache_release(page);
 	}
+	blk_finish_plug(&plug);
+
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
  2012-05-15  4:48   ` Minchan Kim
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
  2012-05-15 18:24 ` Jens Axboe
  3 siblings, 1 reply; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
 Documentation/sysctl/vm.txt |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM.  See above.
 
 page-cluster
 
-page-cluster controls the number of pages which are written to swap in
-a single attempt.  The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages (if
+available) are read in from swap in a single attempt. This is the swap
+counterpart to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
 
 It is a logarithmic value - setting it to zero means "1 page", setting
 it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
 
 The default value is three (eight pages at a time).  There may be some
 small benefits in tuning this to a different value if your workload is
 swap-intensive.
 
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
 =============================================================
 
 panic_on_oom
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-15  4:38   ` Minchan Kim
  2012-05-15 17:43   ` Rik van Riel
  1 sibling, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-15  4:38 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
> 
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> 
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
> 
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
> Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
> Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
> IO unplugs:       149,614               Timer unplugs:       2,940
> 
> With the patch:
> Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
> Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
> Reads Requeued:         0               Writes Requeued:         0
> Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
> Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
> IO unplugs:       337,130               Timer unplugs:      11,184
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Reviewed-by: Minchan Kim <minchan@kernel.org>

It does make sense to me.

> ---
>  mm/swap_state.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4c5ff7f..c85b559 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/pagemap.h>
>  #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
>  #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	unsigned long offset = swp_offset(entry);
>  	unsigned long start_offset, end_offset;
>  	unsigned long mask = (1UL << page_cluster) - 1;
> +	struct blk_plug plug;
>  
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	if (!start_offset)	/* First page is swap header. */
>  		start_offset++;
>  
> +	blk_start_plug(&plug);
>  	for (offset = start_offset; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			continue;
>  		page_cache_release(page);
>  	}
> +	blk_finish_plug(&plug);
> +
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15  4:48   ` Minchan Kim
  2012-05-21  7:24     ` Christian Ehrhardt
  0 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2012-05-15  4:48 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> ---
>  Documentation/sysctl/vm.txt |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 96f0ee8..4d87dc0 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -574,16 +574,24 @@ of physical RAM.  See above.
>  
>  page-cluster
>  
> -page-cluster controls the number of pages which are written to swap in
> -a single attempt.  The swap I/O size.
> +page-cluster controls the number of pages up to which consecutive pages (if
> +available) are read in from swap in a single attempt. This is the swap


"If available" would be wrong in next kernel because recently Rik submit following patch,

mm: make swapin readahead skip over holes
http://marc.info/?l=linux-mm&m=132743264912987&w=4


> +counterpart to page cache readahead.
> +The mentioned consecutivity is not in terms of virtual/physical addresses,
> +but consecutive on swap space - that means they were swapped out together.
>  
>  It is a logarithmic value - setting it to zero means "1 page", setting
>  it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
> +Zero disables swap readahead completely.
>  
>  The default value is three (eight pages at a time).  There may be some
>  small benefits in tuning this to a different value if your workload is
>  swap-intensive.
>  
> +Lower values mean lower latencies for initial faults, but at the same time
> +extra faults and I/O delays for following faults if they would have been part of
> +that consecutive pages readahead would have brought in.
> +
>  =============================================================
>  
>  panic_on_oom


Otherwise, Looks good to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
  2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15  4:59 ` Minchan Kim
  2012-05-21  7:51   ` Christian Ehrhardt
  2012-05-15 18:24 ` Jens Axboe
  3 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2012-05-15  4:59 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
> 
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
> 
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
> 
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
> 
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.


I think we have already done it.
Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
I/O plugging. 

> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.


We could enhance write out by batch like ext4_bio_write_page.

> 
> Kind regards,
> Christian Ehrhardt
> 
> 
> Christian Ehrhardt (2):
>   swap: allow swap readahead to be merged
>   documentation: update how page-cluster affects swap I/O
> 
>  Documentation/sysctl/vm.txt |   12 ++++++++++--
>  mm/swap_state.c             |    5 +++++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] swap: allow swap readahead to be merged
  2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
  2012-05-15  4:38   ` Minchan Kim
@ 2012-05-15 17:43   ` Rik van Riel
  1 sibling, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe

On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.

> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
                   ` (2 preceding siblings ...)
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-15 18:24 ` Jens Axboe
  3 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2012-05-15 18:24 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm

On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
> 
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
> 
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
> 
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
> 
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
> 
> Kind regards,
> Christian Ehrhardt
> 
> 
> Christian Ehrhardt (2):
>   swap: allow swap readahead to be merged
>   documentation: update how page-cluster affects swap I/O

Looks good to me, you can add my acked-by to both of them.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-15  4:48   ` Minchan Kim
@ 2012-05-21  7:24     ` Christian Ehrhardt
  0 siblings, 0 replies; 14+ messages in thread
From: Christian Ehrhardt @ 2012-05-21  7:24 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton



On 05/15/2012 06:48 AM, Minchan Kim wrote:
> On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
>> the code and add some comments about what the tunable will change in that
>> behavior.
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> ---
>>   Documentation/sysctl/vm.txt |   12 ++++++++++--
>>   1 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 96f0ee8..4d87dc0 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -574,16 +574,24 @@ of physical RAM.  See above.
>>
>>   page-cluster
>>
>> -page-cluster controls the number of pages which are written to swap in
>> -a single attempt.  The swap I/O size.
>> +page-cluster controls the number of pages up to which consecutive pages (if
>> +available) are read in from swap in a single attempt. This is the swap
>
>
> "If available" would be wrong in next kernel because recently Rik submit following patch,
>
> mm: make swapin readahead skip over holes
> http://marc.info/?l=linux-mm&m=132743264912987&w=4
>
>

You're right - its not severely wrong, but if we are fixing the 
documentation we can do it right.
I'll send a 2nd version of the patch series with this adapted and all 
the acks I got so far added.

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-21  7:51   ` Christian Ehrhardt
  2012-05-21  8:46     ` Minchan Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Ehrhardt @ 2012-05-21  7:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

[...]

>> [missing patch #3]
>> I tried to get a similar patch working for swap out in shrink_page_list. And
>> it worked in functional terms, but the additional mergin was negligible.
>
>
> I think we have already done it.
> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
> I/O plugging.
>

I saw that code and it is part of the kernel I used to test my patches.
But despite that code and my additional experiments of plug/unplug in 
shrink_page_list the effective I/O size of swap write stays at almost 4k.

Thereby so far I can tell you that the plugs in shrink_page_list and 
shrink_mem_cgroup_zone aren't sufficient - at least for my case.
You saw the blocktrace summaries in my first mail, an excerpt of a write 
submission stream looks like that:

  94,4   10      465     0.023520923   116  A   W 28868648 + 8 <- (94,5) 
28868456
  94,5   10      466     0.023521173   116  Q   W 28868648 + 8 [kswapd0]
  94,5   10      467     0.023522048   116  G   W 28868648 + 8 [kswapd0]
  94,5   10      468     0.023522235   116  P   N [kswapd0]
  94,5   10      469     0.023759892   116  I   W 28868648 + 8 ( 237844) 
[kswapd0]
  94,5   10      470     0.023760079   116  U   N [kswapd0] 1
  94,5   10      471     0.023760360   116  D   W 28868648 + 8 ( 468) 
[kswapd0]
  94,4   10      472     0.023891235   116  A   W 28868656 + 8 <- (94,5) 
28868464
  94,5   10      473     0.023891454   116  Q   W 28868656 + 8 [kswapd0]
  94,5   10      474     0.023892110   116  G   W 28868656 + 8 [kswapd0]
  94,5   10      475     0.023944610   116  I   W 28868656 + 8 ( 52500) 
[kswapd0]
  94,5   10      476     0.023944735   116  U   N [kswapd0] 1
  94,5   10      477     0.023944892   116  D   W 28868656 + 8 ( 282) 
[kswapd0]
  94,5   16       19     0.024023192 16033  C   W 28868648 + 8 ( 262832) [0]
  94,5   24       37     0.024196752 14526  C   W 28868656 + 8 ( 251860) [0]
[...]

But we can split this discussion from my other two patches and I would 
be happy to provide my test environment for further tests if there are 
new suggestions/patches/...

>> Maybe the cond_resched triggers much mor often than I expected, I'm open for
>> suggestions regarding improving the pagout I/O sizes as well.
>
>
> We could enhance write out by batch like ext4_bio_write_page.
>

Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead 
of buffer layer in mpage_da_submit_io" ?



-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] swap: improve swap I/O rate
  2012-05-21  7:51   ` Christian Ehrhardt
@ 2012-05-21  8:46     ` Minchan Kim
  0 siblings, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-21  8:46 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel

On 05/21/2012 04:51 PM, Christian Ehrhardt wrote:

> [...]
> 
>>> [missing patch #3]
>>> I tried to get a similar patch working for swap out in
>>> shrink_page_list. And
>>> it worked in functional terms, but the additional mergin was negligible.
>>
>>
>> I think we have already done it.
>> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list
>> so we already have applied
>> I/O plugging.
>>
> 
> I saw that code and it is part of the kernel I used to test my patches.
> But despite that code and my additional experiments of plug/unplug in
> shrink_page_list the effective I/O size of swap write stays at almost 4k.


I meant your plugging in shrink_page_list is redundant 

> 
> Thereby so far I can tell you that the plugs in shrink_page_list and
> shrink_mem_cgroup_zone aren't sufficient - at least for my case.


Yeb.

> You saw the blocktrace summaries in my first mail, an excerpt of a write
> submission stream looks like that:
> 
>  94,4   10      465     0.023520923   116  A   W 28868648 + 8 <- (94,5)
> 28868456
>  94,5   10      466     0.023521173   116  Q   W 28868648 + 8 [kswapd0]
>  94,5   10      467     0.023522048   116  G   W 28868648 + 8 [kswapd0]
>  94,5   10      468     0.023522235   116  P   N [kswapd0]
>  94,5   10      469     0.023759892   116  I   W 28868648 + 8 ( 237844)
> [kswapd0]
>  94,5   10      470     0.023760079   116  U   N [kswapd0] 1
>  94,5   10      471     0.023760360   116  D   W 28868648 + 8 ( 468)
> [kswapd0]
>  94,4   10      472     0.023891235   116  A   W 28868656 + 8 <- (94,5)
> 28868464
>  94,5   10      473     0.023891454   116  Q   W 28868656 + 8 [kswapd0]
>  94,5   10      474     0.023892110   116  G   W 28868656 + 8 [kswapd0]
>  94,5   10      475     0.023944610   116  I   W 28868656 + 8 ( 52500)
> [kswapd0]
>  94,5   10      476     0.023944735   116  U   N [kswapd0] 1
>  94,5   10      477     0.023944892   116  D   W 28868656 + 8 ( 282)
> [kswapd0]
>  94,5   16       19     0.024023192 16033  C   W 28868648 + 8 ( 262832) [0]
>  94,5   24       37     0.024196752 14526  C   W 28868656 + 8 ( 251860) [0]
> [...]
> 
> But we can split this discussion from my other two patches and I would
> be happy to provide my test environment for further tests if there are
> new suggestions/patches/...
> 
>>> Maybe the cond_resched triggers much mor often than I expected, I'm
>>> open for
>>> suggestions regarding improving the pagout I/O sizes as well.
>>
>>
>> We could enhance write out by batch like ext4_bio_write_page.
>>
> 
> Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
> of buffer layer in mpage_da_submit_io" ?


Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO.

> 
> 
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-06-04  8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-06-04  8:33 ` ehrhardt
  0 siblings, 0 replies; 14+ messages in thread
From: ehrhardt @ 2012-06-04  8:33 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>

---
 Documentation/sysctl/vm.txt |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM.  See above.
 
 page-cluster
 
-page-cluster controls the number of pages which are written to swap in
-a single attempt.  The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
 
 It is a logarithmic value - setting it to zero means "1 page", setting
 it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
 
 The default value is three (eight pages at a time).  There may be some
 small benefits in tuning this to a different value if your workload is
 swap-intensive.
 
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
 =============================================================
 
 panic_on_oom
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-21  8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-21  8:48   ` Minchan Kim
  0 siblings, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-21  8:48 UTC (permalink / raw)
  To: ehrhardt; +Cc: linux-mm, axboe

On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:

> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> 
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
> 
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>


Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 

Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
  2012-05-21  8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-05-21  8:09 ` ehrhardt
  2012-05-21  8:48   ` Minchan Kim
  0 siblings, 1 reply; 14+ messages in thread
From: ehrhardt @ 2012-05-21  8:09 UTC (permalink / raw)
  To: linux-mm; +Cc: axboe, Christian Ehrhardt

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/sysctl/vm.txt |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM.  See above.
 
 page-cluster
 
-page-cluster controls the number of pages which are written to swap in
-a single attempt.  The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
 
 It is a logarithmic value - setting it to zero means "1 page", setting
 it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
 
 The default value is three (eight pages at a time).  There may be some
 small benefits in tuning this to a different value if your workload is
 swap-intensive.
 
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
 =============================================================
 
 panic_on_oom
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-06-04  8:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15  4:38   ` Minchan Kim
2012-05-15 17:43   ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-15  4:48   ` Minchan Kim
2012-05-21  7:24     ` Christian Ehrhardt
2012-05-15  4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-21  7:51   ` Christian Ehrhardt
2012-05-21  8:46     ` Minchan Kim
2012-05-15 18:24 ` Jens Axboe
2012-05-21  8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-05-21  8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-21  8:48   ` Minchan Kim
2012-06-04  8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04  8:33 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.