* [PATCH 0/2] swap: improve swap I/O rate
@ 2012-05-14 11:58 ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Ehrhardt Christian
From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
In an memory overcommitment scneario with KVM I ran into a lot of wiats for
swap. While checking the I/O done on the swap disks I found almost all I/Os
to be done as single page 4k request. Despite the fact that swap in is a
batch of 1<<page-cluster pages as swap readahead and swap out is a list of
pages written in shrink_page_list.
[1/2 swap in improvment]
The read patch shows improvements of up to 50% swap throughput, much happier
guest systems and even when running with comparable throughput a lot I/O per
seconds saved leaving resources in the SAN for other consumers.
[2/2 documentation]
While doing so I also realized that the documentation for
proc/sys/vm/page-cluster is no more matching the code
[missing patch #3]
I tried to get a similar patch working for swap out in shrink_page_list. And
it worked in functional terms, but the additional mergin was negligible.
Maybe the cond_resched triggers much mor often than I expected, I'm open for
suggestions regarding improving the pagout I/O sizes as well.
Kind regards,
Christian Ehrhardt
Christian Ehrhardt (2):
swap: allow swap readahead to be merged
documentation: update how page-cluster affects swap I/O
Documentation/sysctl/vm.txt | 12 ++++++++++--
mm/swap_state.c | 5 +++++
2 files changed, 15 insertions(+), 2 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
` (2 subsequent siblings)
3 siblings, 2 replies; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Swap readahead works fine, but the I/O to disk is almost always done in page
size requests, despite the fact that readahead submits 1<<page-cluster pages
at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more I/Os
than required.
On a single device this might not be an issue, but as soon as a server runs
on shared san resources savin I/Os not only improves swapin throughput but
also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
With the patch:
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
mm/swap_state.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4c5ff7f..c85b559 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
@@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = swp_offset(entry);
unsigned long start_offset, end_offset;
unsigned long mask = (1UL << page_cluster) - 1;
+ struct blk_plug plug;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
@@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!start_offset) /* First page is swap header. */
start_offset++;
+ blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
continue;
page_cache_release(page);
}
+ blk_finish_plug(&plug);
+
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-14 11:58 ` ehrhardt
2012-05-15 4:48 ` Minchan Kim
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-15 18:24 ` Jens Axboe
3 siblings, 1 reply; 14+ messages in thread
From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---
Documentation/sysctl/vm.txt | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages (if
+available) are read in from swap in a single attempt. This is the swap
+counterpart to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
@ 2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-15 4:38 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
>
> With a load running KVM in a lot of memory overcommitment (the hot memory
> is 1.5 times the host memory) swapping throughput improves significantly
> and the lead feels more responsive as well as achieves more throughput.
>
> In a test setup with 16 swap disks running blocktrace on one of those disks
> shows the improved merging:
> Prior:
> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
> IO unplugs: 149,614 Timer unplugs: 2,940
>
> With the patch:
> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
> IO unplugs: 337,130 Timer unplugs: 11,184
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
It does make sense to me.
> ---
> mm/swap_state.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4c5ff7f..c85b559 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -14,6 +14,7 @@
> #include <linux/init.h>
> #include <linux/pagemap.h>
> #include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> unsigned long offset = swp_offset(entry);
> unsigned long start_offset, end_offset;
> unsigned long mask = (1UL << page_cluster) - 1;
> + struct blk_plug plug;
>
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!start_offset) /* First page is swap header. */
> start_offset++;
>
> + blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> continue;
> page_cache_release(page);
> }
> + blk_finish_plug(&plug);
> +
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15 4:48 ` Minchan Kim
2012-05-21 7:24 ` Christian Ehrhardt
0 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2012-05-15 4:48 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> ---
> Documentation/sysctl/vm.txt | 12 ++++++++++--
> 1 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 96f0ee8..4d87dc0 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -574,16 +574,24 @@ of physical RAM. See above.
>
> page-cluster
>
> -page-cluster controls the number of pages which are written to swap in
> -a single attempt. The swap I/O size.
> +page-cluster controls the number of pages up to which consecutive pages (if
> +available) are read in from swap in a single attempt. This is the swap
"If available" would be wrong in next kernel because recently Rik submit following patch,
mm: make swapin readahead skip over holes
http://marc.info/?l=linux-mm&m=132743264912987&w=4
> +counterpart to page cache readahead.
> +The mentioned consecutivity is not in terms of virtual/physical addresses,
> +but consecutive on swap space - that means they were swapped out together.
>
> It is a logarithmic value - setting it to zero means "1 page", setting
> it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
> +Zero disables swap readahead completely.
>
> The default value is three (eight pages at a time). There may be some
> small benefits in tuning this to a different value if your workload is
> swap-intensive.
>
> +Lower values mean lower latencies for initial faults, but at the same time
> +extra faults and I/O delays for following faults if they would have been part of
> +that consecutive pages readahead would have brought in.
> +
> =============================================================
>
> panic_on_oom
Otherwise, Looks good to me.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-15 4:59 ` Minchan Kim
2012-05-21 7:51 ` Christian Ehrhardt
2012-05-15 18:24 ` Jens Axboe
3 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2012-05-15 4:59 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
>
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
>
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
>
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
>
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
I think we have already done it.
Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
I/O plugging.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
We could enhance write out by batch like ext4_bio_write_page.
>
> Kind regards,
> Christian Ehrhardt
>
>
> Christian Ehrhardt (2):
> swap: allow swap readahead to be merged
> documentation: update how page-cluster affects swap I/O
>
> Documentation/sysctl/vm.txt | 12 ++++++++++--
> mm/swap_state.c | 5 +++++
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
@ 2012-05-15 17:43 ` Rik van Riel
1 sibling, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>
> Swap readahead works fine, but the I/O to disk is almost always done in page
> size requests, despite the fact that readahead submits 1<<page-cluster pages
> at a time.
> On older kernels the old per device plugging behavior might have captured
> this and merged the requests, but currently all comes down to much more I/Os
> than required.
>
> On a single device this might not be an issue, but as soon as a server runs
> on shared san resources savin I/Os not only improves swapin throughput but
> also provides a lower resource utilization.
> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
` (2 preceding siblings ...)
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-15 18:24 ` Jens Axboe
3 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2012-05-15 18:24 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm
On 2012-05-14 13:58, ehrhardt@linux.vnet.ibm.com wrote:
> From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com>
>
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> In an memory overcommitment scneario with KVM I ran into a lot of wiats for
> swap. While checking the I/O done on the swap disks I found almost all I/Os
> to be done as single page 4k request. Despite the fact that swap in is a
> batch of 1<<page-cluster pages as swap readahead and swap out is a list of
> pages written in shrink_page_list.
>
> [1/2 swap in improvment]
> The read patch shows improvements of up to 50% swap throughput, much happier
> guest systems and even when running with comparable throughput a lot I/O per
> seconds saved leaving resources in the SAN for other consumers.
>
> [2/2 documentation]
> While doing so I also realized that the documentation for
> proc/sys/vm/page-cluster is no more matching the code
>
> [missing patch #3]
> I tried to get a similar patch working for swap out in shrink_page_list. And
> it worked in functional terms, but the additional mergin was negligible.
> Maybe the cond_resched triggers much mor often than I expected, I'm open for
> suggestions regarding improving the pagout I/O sizes as well.
>
> Kind regards,
> Christian Ehrhardt
>
>
> Christian Ehrhardt (2):
> swap: allow swap readahead to be merged
> documentation: update how page-cluster affects swap I/O
Looks good to me, you can add my acked-by to both of them.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-15 4:48 ` Minchan Kim
@ 2012-05-21 7:24 ` Christian Ehrhardt
0 siblings, 0 replies; 14+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 7:24 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe, Rik van Riel, Hugh Dickins, Andrew Morton
On 05/15/2012 06:48 AM, Minchan Kim wrote:
> On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote:
>
>> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>>
>> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
>> the code and add some comments about what the tunable will change in that
>> behavior.
>>
>> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
>> ---
>> Documentation/sysctl/vm.txt | 12 ++++++++++--
>> 1 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 96f0ee8..4d87dc0 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -574,16 +574,24 @@ of physical RAM. See above.
>>
>> page-cluster
>>
>> -page-cluster controls the number of pages which are written to swap in
>> -a single attempt. The swap I/O size.
>> +page-cluster controls the number of pages up to which consecutive pages (if
>> +available) are read in from swap in a single attempt. This is the swap
>
>
> "If available" would be wrong in next kernel because recently Rik submit following patch,
>
> mm: make swapin readahead skip over holes
> http://marc.info/?l=linux-mm&m=132743264912987&w=4
>
>
You're right - its not severely wrong, but if we are fixing the
documentation we can do it right.
I'll send a 2nd version of the patch series with this adapted and all
the acks I got so far added.
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
@ 2012-05-21 7:51 ` Christian Ehrhardt
2012-05-21 8:46 ` Minchan Kim
0 siblings, 1 reply; 14+ messages in thread
From: Christian Ehrhardt @ 2012-05-21 7:51 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
[...]
>> [missing patch #3]
>> I tried to get a similar patch working for swap out in shrink_page_list. And
>> it worked in functional terms, but the additional mergin was negligible.
>
>
> I think we have already done it.
> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list so we already have applied
> I/O plugging.
>
I saw that code and it is part of the kernel I used to test my patches.
But despite that code and my additional experiments of plug/unplug in
shrink_page_list the effective I/O size of swap write stays at almost 4k.
Thereby so far I can tell you that the plugs in shrink_page_list and
shrink_mem_cgroup_zone aren't sufficient - at least for my case.
You saw the blocktrace summaries in my first mail, an excerpt of a write
submission stream looks like that:
94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5)
28868456
94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0]
94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0]
94,5 10 468 0.023522235 116 P N [kswapd0]
94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844)
[kswapd0]
94,5 10 470 0.023760079 116 U N [kswapd0] 1
94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468)
[kswapd0]
94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5)
28868464
94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0]
94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0]
94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500)
[kswapd0]
94,5 10 476 0.023944735 116 U N [kswapd0] 1
94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282)
[kswapd0]
94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0]
94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0]
[...]
But we can split this discussion from my other two patches and I would
be happy to provide my test environment for further tests if there are
new suggestions/patches/...
>> Maybe the cond_resched triggers much mor often than I expected, I'm open for
>> suggestions regarding improving the pagout I/O sizes as well.
>
>
> We could enhance write out by batch like ext4_bio_write_page.
>
Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
of buffer layer in mpage_da_submit_io" ?
--
GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] swap: improve swap I/O rate
2012-05-21 7:51 ` Christian Ehrhardt
@ 2012-05-21 8:46 ` Minchan Kim
0 siblings, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-21 8:46 UTC (permalink / raw)
To: Christian Ehrhardt
Cc: linux-mm, axboe, Andrew Morton, Hugh Dickins, Rik van Riel
On 05/21/2012 04:51 PM, Christian Ehrhardt wrote:
> [...]
>
>>> [missing patch #3]
>>> I tried to get a similar patch working for swap out in
>>> shrink_page_list. And
>>> it worked in functional terms, but the additional mergin was negligible.
>>
>>
>> I think we have already done it.
>> Look at shrink_mem_cgroup_zone which ends up calling shrink_page_list
>> so we already have applied
>> I/O plugging.
>>
>
> I saw that code and it is part of the kernel I used to test my patches.
> But despite that code and my additional experiments of plug/unplug in
> shrink_page_list the effective I/O size of swap write stays at almost 4k.
I meant your plugging in shrink_page_list is redundant
>
> Thereby so far I can tell you that the plugs in shrink_page_list and
> shrink_mem_cgroup_zone aren't sufficient - at least for my case.
Yeb.
> You saw the blocktrace summaries in my first mail, an excerpt of a write
> submission stream looks like that:
>
> 94,4 10 465 0.023520923 116 A W 28868648 + 8 <- (94,5)
> 28868456
> 94,5 10 466 0.023521173 116 Q W 28868648 + 8 [kswapd0]
> 94,5 10 467 0.023522048 116 G W 28868648 + 8 [kswapd0]
> 94,5 10 468 0.023522235 116 P N [kswapd0]
> 94,5 10 469 0.023759892 116 I W 28868648 + 8 ( 237844)
> [kswapd0]
> 94,5 10 470 0.023760079 116 U N [kswapd0] 1
> 94,5 10 471 0.023760360 116 D W 28868648 + 8 ( 468)
> [kswapd0]
> 94,4 10 472 0.023891235 116 A W 28868656 + 8 <- (94,5)
> 28868464
> 94,5 10 473 0.023891454 116 Q W 28868656 + 8 [kswapd0]
> 94,5 10 474 0.023892110 116 G W 28868656 + 8 [kswapd0]
> 94,5 10 475 0.023944610 116 I W 28868656 + 8 ( 52500)
> [kswapd0]
> 94,5 10 476 0.023944735 116 U N [kswapd0] 1
> 94,5 10 477 0.023944892 116 D W 28868656 + 8 ( 282)
> [kswapd0]
> 94,5 16 19 0.024023192 16033 C W 28868648 + 8 ( 262832) [0]
> 94,5 24 37 0.024196752 14526 C W 28868656 + 8 ( 251860) [0]
> [...]
>
> But we can split this discussion from my other two patches and I would
> be happy to provide my test environment for further tests if there are
> new suggestions/patches/...
>
>>> Maybe the cond_resched triggers much mor often than I expected, I'm
>>> open for
>>> suggestions regarding improving the pagout I/O sizes as well.
>>
>>
>> We could enhance write out by batch like ext4_bio_write_page.
>>
>
> Do you mean the changes brought by "bd2d0210 ext4: use bio layer instead
> of buffer layer in mpage_da_submit_io" ?
Yeb, I think it's helpful for your case but it's not trivial to implement it, IMHO.
>
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-06-04 8:33 ` ehrhardt
0 siblings, 0 replies; 14+ messages in thread
From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw)
To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
Documentation/sysctl/vm.txt | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
@ 2012-05-21 8:48 ` Minchan Kim
0 siblings, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2012-05-21 8:48 UTC (permalink / raw)
To: ehrhardt; +Cc: linux-mm, axboe
On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote:
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
> the code and add some comments about what the tunable will change in that
> behavior.
>
> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
@ 2012-05-21 8:09 ` ehrhardt
2012-05-21 8:48 ` Minchan Kim
0 siblings, 1 reply; 14+ messages in thread
From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw)
To: linux-mm; +Cc: axboe, Christian Ehrhardt
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of
the code and add some comments about what the tunable will change in that
behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
Documentation/sysctl/vm.txt | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 96f0ee8..4d87dc0 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -574,16 +574,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
--
1.7.0.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-06-04 8:34 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt
2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt
2012-05-15 4:38 ` Minchan Kim
2012-05-15 17:43 ` Rik van Riel
2012-05-14 11:58 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-15 4:48 ` Minchan Kim
2012-05-21 7:24 ` Christian Ehrhardt
2012-05-15 4:59 ` [PATCH 0/2] swap: improve swap I/O rate Minchan Kim
2012-05-21 7:51 ` Christian Ehrhardt
2012-05-21 8:46 ` Minchan Kim
2012-05-15 18:24 ` Jens Axboe
2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
2012-05-21 8:48 ` Minchan Kim
2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt
2012-06-04 8:33 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.