* [PATCH 0/2] swap: improve swap I/O rate - V2 @ 2012-05-21 8:09 ehrhardt 2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt 0 siblings, 2 replies; 12+ messages in thread From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Ehrhardt Christian From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> * Update in V2 * - Adapted the documentation patch according to feedback of Minchan Kim - Added the Acks I got to V1 so far In an memory overcommitment scneario with KVM I ran into a lot of waits for swap. While checking the I/O done on the swap disks I found almost all I/Os to be done as single page 4k request. Despite the fact that swap in is a batch of 1<<page-cluster pages as swap readahead and swap out is a list of pages written in shrink_page_list. [1/2 swap in improvment] The read patch shows improvements of up to 50% swap throughput, much happier guest systems and even when running with comparable throughput a lot I/O per seconds saved leaving resources in the SAN for other consumers. [2/2 documentation] While doing so I also realized that the documentation for proc/sys/vm/page-cluster is no more matching the code Kind regards, Christian Ehrhardt Christian Ehrhardt (2): swap: allow swap readahead to be merged documentation: update how page-cluster affects swap I/O Documentation/sysctl/vm.txt | 12 ++++++++++-- mm/swap_state.c | 5 +++++ 2 files changed, 15 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt @ 2012-05-21 8:09 ` ehrhardt 2012-05-21 8:51 ` Minchan Kim 2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt 1 sibling, 1 reply; 12+ messages in thread From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/pagemap.h> #include <linux/backing-dev.h> +#include <linux/blkdev.h> #include <linux/pagevec.h> #include <linux/migrate.h> #include <linux/page_cgroup.h> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-05-21 8:51 ` Minchan Kim 2012-05-21 9:07 ` Christian Ehrhardt 0 siblings, 1 reply; 12+ messages in thread From: Minchan Kim @ 2012-05-21 8:51 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 > > Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > Acked-by: Rik van Riel <riel@redhat.com> > Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Didn't I add my Reviewed-by on your previous version? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-21 8:51 ` Minchan Kim @ 2012-05-21 9:07 ` Christian Ehrhardt 0 siblings, 0 replies; 12+ messages in thread From: Christian Ehrhardt @ 2012-05-21 9:07 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, axboe On 05/21/2012 10:51 AM, Minchan Kim wrote: > On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> >> [...] >> >> Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> >> Acked-by: Rik van Riel<riel@redhat.com> >> Acked-by: Jens Axboe<axboe@kernel.dk> > > > Reviewed-by: Minchan Kim<minchan@kernel.org> > > Didn't I add my Reviewed-by on your previous version? > Sorry I missed it since you provided the good feedback on all three mails. I had your "otherwise looks good to me to mail #2" still in mind and didn't want to be so offensive to convert that to a review or ack statement. -- GrA 1/4 sse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] documentation: update how page-cluster affects swap I/O 2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt 2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-05-21 8:09 ` ehrhardt 2012-05-21 8:48 ` Minchan Kim 1 sibling, 1 reply; 12+ messages in thread From: ehrhardt @ 2012-05-21 8:09 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Jens Axboe <axboe@kernel.dk> --- Documentation/sysctl/vm.txt | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 96f0ee8..4d87dc0 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -574,16 +574,24 @@ of physical RAM. See above. page-cluster -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. +page-cluster controls the number of pages up to which consecutive pages +are read in from swap in a single attempt. This is the swap counterpart +to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. It is a logarithmic value - setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. The default value is three (eight pages at a time). There may be some small benefits in tuning this to a different value if your workload is swap-intensive. +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + ============================================================= panic_on_oom -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] documentation: update how page-cluster affects swap I/O 2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt @ 2012-05-21 8:48 ` Minchan Kim 0 siblings, 0 replies; 12+ messages in thread From: Minchan Kim @ 2012-05-21 8:48 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe On 05/21/2012 05:09 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of > the code and add some comments about what the tunable will change in that > behavior. > > Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 0/2] swap: improve swap I/O rate - V2 @ 2012-06-04 8:33 ehrhardt 2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 0 siblings, 1 reply; 12+ messages in thread From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw) To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Ehrhardt Christian From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> * Update in V3 * - Added another reviewed by - should be ready for upstream inclusion now * Update in V2 * - Adapted the documentation patch according to feedback of Minchan Kim - Added the Acks I got to V1 so far In an memory overcommitment scneario with KVM I ran into a lot of waits for swap. While checking the I/O done on the swap disks I found almost all I/Os to be done as single page 4k request. Despite the fact that swap in is a batch of 1<<page-cluster pages as swap readahead and swap out is a list of pages written in shrink_page_list. [1/2 swap in improvment] The read patch shows improvements of up to 50% swap throughput, much happier guest systems and even when running with comparable throughput a lot I/O per seconds saved leaving resources in the SAN for other consumers. [2/2 documentation] While doing so I also realized that the documentation for proc/sys/vm/page-cluster is no more matching the code Kind regards, Christian Ehrhardt Christian Ehrhardt (2): swap: allow swap readahead to be merged documentation: update how page-cluster affects swap I/O Documentation/sysctl/vm.txt | 12 ++++++++++-- mm/swap_state.c | 5 +++++ 2 files changed, 15 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged 2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt @ 2012-06-04 8:33 ` ehrhardt 2012-06-05 23:44 ` Andrew Morton 0 siblings, 1 reply; 12+ messages in thread From: ehrhardt @ 2012-06-04 8:33 UTC (permalink / raw) To: linux-mm, akpm; +Cc: axboe, hughd, minchan, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/pagemap.h> #include <linux/backing-dev.h> +#include <linux/blkdev.h> #include <linux/pagevec.h> #include <linux/migrate.h> #include <linux/page_cgroup.h> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-06-05 23:44 ` Andrew Morton 2012-06-20 15:58 ` Christian Ehrhardt 0 siblings, 1 reply; 12+ messages in thread From: Andrew Morton @ 2012-06-05 23:44 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe, hughd, minchan On Mon, 4 Jun 2012 10:33:22 +0200 ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently common to bother doing any fancy high-level aggregation: just toss it at the queue and use the general BIO merging. > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 This is rather hard to understand. How much faster did it get? > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -14,6 +14,7 @@ > #include <linux/init.h> > #include <linux/pagemap.h> > #include <linux/backing-dev.h> > +#include <linux/blkdev.h> > #include <linux/pagevec.h> > #include <linux/migrate.h> > #include <linux/page_cgroup.h> > @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > unsigned long offset = swp_offset(entry); > unsigned long start_offset, end_offset; > unsigned long mask = (1UL << page_cluster) - 1; > + struct blk_plug plug; > > /* Read a page_cluster sized and aligned cluster around offset. */ > start_offset = offset & ~mask; > @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > if (!start_offset) /* First page is swap header. */ > start_offset++; > > + blk_start_plug(&plug); > for (offset = start_offset; offset <= end_offset ; offset++) { > /* Ok, do the async read-ahead now */ > page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > continue; > page_cache_release(page); > } > + blk_finish_plug(&plug); > + > lru_add_drain(); /* Push any new pages onto the LRU now */ > return read_swap_cache_async(entry, gfp_mask, vma, addr); AFACIT this affects tmpfs as well, and it would be interesting/useful/diligent to check for performance improvements or regressions in that area. And the patch doesn't help swapoff, in try_to_unuse(). Or any other callers of swap_readpage(), if they exist. The switch to explicit plugging might have caused swap regressions in other areas so perhaps a more extensive patch is needed. But swapin_readahead() covers most cases and a more extensive patch will work OK with this one, so I guess we run witht he simple patch for now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-06-05 23:44 ` Andrew Morton @ 2012-06-20 15:58 ` Christian Ehrhardt 0 siblings, 0 replies; 12+ messages in thread From: Christian Ehrhardt @ 2012-06-20 15:58 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, axboe, hughd, minchan On 06/06/2012 01:44 AM, Andrew Morton wrote: > On Mon, 4 Jun 2012 10:33:22 +0200 > ehrhardt@linux.vnet.ibm.com wrote: > >> From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> >> >> Swap readahead works fine, but the I/O to disk is almost always done in page >> size requests, despite the fact that readahead submits 1<<page-cluster pages >> at a time. >> On older kernels the old per device plugging behavior might have captured >> this and merged the requests, but currently all comes down to much more I/Os >> than required. > > Yes, long ago we (ie: I) decided that swap I/O isn't sufficiently > common to bother doing any fancy high-level aggregation: just toss it > at the queue and use the general BIO merging. > >> On a single device this might not be an issue, but as soon as a server runs >> on shared san resources savin I/Os not only improves swapin throughput but >> also provides a lower resource utilization. >> >> With a load running KVM in a lot of memory overcommitment (the hot memory >> is 1.5 times the host memory) swapping throughput improves significantly >> and the lead feels more responsive as well as achieves more throughput. >> >> In a test setup with 16 swap disks running blocktrace on one of those disks >> shows the improved merging: >> Prior: >> Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB >> Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB >> Reads Requeued: 0 Writes Requeued: 0 >> Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB >> Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB >> IO unplugs: 149,614 Timer unplugs: 2,940 >> >> With the patch: >> Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB >> Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB >> Reads Requeued: 0 Writes Requeued: 0 >> Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB >> Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB >> IO unplugs: 337,130 Timer unplugs: 11,184 > > This is rather hard to understand. How much faster did it get? I got ~10% to ~40% more throughput in my cases and at the same time much lower cpu consumption when broken down per transferred kilobyte (the majority of that due to saved interrupts and better cache handling). In a shared SAN others might get an additional benefit as well, because this now causes less protocol overhead. >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -14,6 +14,7 @@ >> #include<linux/init.h> >> #include<linux/pagemap.h> >> #include<linux/backing-dev.h> >> +#include<linux/blkdev.h> >> #include<linux/pagevec.h> >> #include<linux/migrate.h> >> #include<linux/page_cgroup.h> >> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> unsigned long offset = swp_offset(entry); >> unsigned long start_offset, end_offset; >> unsigned long mask = (1UL<< page_cluster) - 1; >> + struct blk_plug plug; >> >> /* Read a page_cluster sized and aligned cluster around offset. */ >> start_offset = offset& ~mask; >> @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> if (!start_offset) /* First page is swap header. */ >> start_offset++; >> >> + blk_start_plug(&plug); >> for (offset = start_offset; offset<= end_offset ; offset++) { >> /* Ok, do the async read-ahead now */ >> page = read_swap_cache_async(swp_entry(swp_type(entry), offset), >> @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, >> continue; >> page_cache_release(page); >> } >> + blk_finish_plug(&plug); >> + >> lru_add_drain(); /* Push any new pages onto the LRU now */ >> return read_swap_cache_async(entry, gfp_mask, vma, addr); > > AFACIT this affects tmpfs as well, and it would be > interesting/useful/diligent to check for performance improvements or > regressions in that area. > A quick test with fio doing 256k sequential write showed some improvements of 9.1%, but since I'm not sure how big noise is in this test I'd be cautions with these results. Unfortunately I didn't check cpu consumption - it might be possible that with tmpfs thats the area where a bigger improvement could be seen. Well at least it didn't break - so thats a good result as well. > And the patch doesn't help swapoff, in try_to_unuse(). Or any other > callers of swap_readpage(), if they exist. > > The switch to explicit plugging might have caused swap regressions in > other areas so perhaps a more extensive patch is needed. But > swapin_readahead() covers most cases and a more extensive patch will > work OK with this one, so I guess we run witht he simple patch for now. > Yeah all the other swap areas might need re-tuning after the plugging changes as well, but for example swapoff shouldn't be too performance critical right? As discussed before I'd more interested in the swap writeout path to merge stuff better as well. Eventually - as you said - a later more complex patch can follow and take all these into account. -- Grusse / regards, Christian Ehrhardt IBM Linux Technology Center, System z Linux Performance -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 0/2] swap: improve swap I/O rate @ 2012-05-14 11:58 ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 0 siblings, 1 reply; 12+ messages in thread From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Ehrhardt Christian From: Ehrhardt Christian <ehrhardt@linux.vnet.ibm.com> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> In an memory overcommitment scneario with KVM I ran into a lot of wiats for swap. While checking the I/O done on the swap disks I found almost all I/Os to be done as single page 4k request. Despite the fact that swap in is a batch of 1<<page-cluster pages as swap readahead and swap out is a list of pages written in shrink_page_list. [1/2 swap in improvment] The read patch shows improvements of up to 50% swap throughput, much happier guest systems and even when running with comparable throughput a lot I/O per seconds saved leaving resources in the SAN for other consumers. [2/2 documentation] While doing so I also realized that the documentation for proc/sys/vm/page-cluster is no more matching the code [missing patch #3] I tried to get a similar patch working for swap out in shrink_page_list. And it worked in functional terms, but the additional mergin was negligible. Maybe the cond_resched triggers much mor often than I expected, I'm open for suggestions regarding improving the pagout I/O sizes as well. Kind regards, Christian Ehrhardt Christian Ehrhardt (2): swap: allow swap readahead to be merged documentation: update how page-cluster affects swap I/O Documentation/sysctl/vm.txt | 12 ++++++++++-- mm/swap_state.c | 5 +++++ 2 files changed, 15 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt @ 2012-05-14 11:58 ` ehrhardt 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel 0 siblings, 2 replies; 12+ messages in thread From: ehrhardt @ 2012-05-14 11:58 UTC (permalink / raw) To: linux-mm; +Cc: axboe, Christian Ehrhardt From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> --- mm/swap_state.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 4c5ff7f..c85b559 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/pagemap.h> #include <linux/backing-dev.h> +#include <linux/blkdev.h> #include <linux/pagevec.h> #include <linux/migrate.h> #include <linux/page_cgroup.h> @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long offset = swp_offset(entry); unsigned long start_offset, end_offset; unsigned long mask = (1UL << page_cluster) - 1; + struct blk_plug plug; /* Read a page_cluster sized and aligned cluster around offset. */ start_offset = offset & ~mask; @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!start_offset) /* First page is swap header. */ start_offset++; + blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, continue; page_cache_release(page); } + blk_finish_plug(&plug); + lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt @ 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel 1 sibling, 0 replies; 12+ messages in thread From: Minchan Kim @ 2012-05-15 4:38 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe, Hugh Dickins, Rik van Riel On 05/14/2012 08:58 PM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > > With a load running KVM in a lot of memory overcommitment (the hot memory > is 1.5 times the host memory) swapping throughput improves significantly > and the lead feels more responsive as well as achieves more throughput. > > In a test setup with 16 swap disks running blocktrace on one of those disks > shows the improved merging: > Prior: > Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB > Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB > Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB > IO unplugs: 149,614 Timer unplugs: 2,940 > > With the patch: > Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB > Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB > Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB > IO unplugs: 337,130 Timer unplugs: 11,184 > > Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> It does make sense to me. > --- > mm/swap_state.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4c5ff7f..c85b559 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -14,6 +14,7 @@ > #include <linux/init.h> > #include <linux/pagemap.h> > #include <linux/backing-dev.h> > +#include <linux/blkdev.h> > #include <linux/pagevec.h> > #include <linux/migrate.h> > #include <linux/page_cgroup.h> > @@ -376,6 +377,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > unsigned long offset = swp_offset(entry); > unsigned long start_offset, end_offset; > unsigned long mask = (1UL << page_cluster) - 1; > + struct blk_plug plug; > > /* Read a page_cluster sized and aligned cluster around offset. */ > start_offset = offset & ~mask; > @@ -383,6 +385,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > if (!start_offset) /* First page is swap header. */ > start_offset++; > > + blk_start_plug(&plug); > for (offset = start_offset; offset <= end_offset ; offset++) { > /* Ok, do the async read-ahead now */ > page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > @@ -391,6 +394,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > continue; > page_cache_release(page); > } > + blk_finish_plug(&plug); > + > lru_add_drain(); /* Push any new pages onto the LRU now */ > return read_swap_cache_async(entry, gfp_mask, vma, addr); > } -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] swap: allow swap readahead to be merged 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-15 4:38 ` Minchan Kim @ 2012-05-15 17:43 ` Rik van Riel 1 sibling, 0 replies; 12+ messages in thread From: Rik van Riel @ 2012-05-15 17:43 UTC (permalink / raw) To: ehrhardt; +Cc: linux-mm, axboe On 05/14/2012 07:58 AM, ehrhardt@linux.vnet.ibm.com wrote: > From: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> > > Swap readahead works fine, but the I/O to disk is almost always done in page > size requests, despite the fact that readahead submits 1<<page-cluster pages > at a time. > On older kernels the old per device plugging behavior might have captured > this and merged the requests, but currently all comes down to much more I/Os > than required. > > On a single device this might not be an issue, but as soon as a server runs > on shared san resources savin I/Os not only improves swapin throughput but > also provides a lower resource utilization. > Signed-off-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-06-20 15:58 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-05-21 8:09 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt 2012-05-21 8:09 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-21 8:51 ` Minchan Kim 2012-05-21 9:07 ` Christian Ehrhardt 2012-05-21 8:09 ` [PATCH 2/2] documentation: update how page-cluster affects swap I/O ehrhardt 2012-05-21 8:48 ` Minchan Kim -- strict thread matches above, loose matches on Subject: below -- 2012-06-04 8:33 [PATCH 0/2] swap: improve swap I/O rate - V2 ehrhardt 2012-06-04 8:33 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-06-05 23:44 ` Andrew Morton 2012-06-20 15:58 ` Christian Ehrhardt 2012-05-14 11:58 [PATCH 0/2] swap: improve swap I/O rate ehrhardt 2012-05-14 11:58 ` [PATCH 1/2] swap: allow swap readahead to be merged ehrhardt 2012-05-15 4:38 ` Minchan Kim 2012-05-15 17:43 ` Rik van Riel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.