linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead
@ 2017-07-25  1:51 Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang, Ying, Johannes Weiner,
	Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins,
	Fengguang Wu, Tim Chen, Dave Hansen

The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks
in swap device are readahead based on the global space locality
estimation.  But the consecutive blocks in swap device just reflect
the order of page reclaiming, don't necessarily reflect the access
pattern in virtual memory space.  And the different tasks in the
system may have different access patterns, which makes the global
space locality estimation incorrect.

In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the
fault swap slot in swap device.  This avoid to readahead the unrelated
swap slots.  At the same time, the swap readahead is changed to work
on per-VMA from globally.  So that the different access patterns of
the different VMAs could be distinguished, and the different readahead
policy could be applied accordingly.  The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.

This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat
the potential increased overhead, this is also illustrated in the test
results below.  But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random
swap-in in the background.

This is a combined workload with sequential and random memory
accessing at the same time.  The result (for sequential workload) is
as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But
the elapsed time reduced and readahead hit rate improved, so the
optimized kernel runs better for startup and tear down stages.  And
the absolute value of readahead hit rate is high, shows that the space
locality is still valid in some practical workloads.

Changelogs:

v3:

- Rebased on latest -mm tree
- Use percpu_counter for swap readahead statistics per Dave Hansen's comment.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25 20:42   ` Andrew Morton
  2017-07-25 21:05   ` Rik van Riel
  2017-07-25  1:51 ` [PATCH -mm -v3 2/6] mm, swap: Add swap readahead hit statistics Huang, Ying
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

The swap cache stats could be gotten only via sysrq, which isn't
convenient in some situation.  So the sysfs interface of swap cache
stats is added for that.  The added sysfs directories/files are as
follow,

/sys/kernel/mm/swap
/sys/kernel/mm/swap/cache_find_total
/sys/kernel/mm/swap/cache_find_success
/sys/kernel/mm/swap/cache_add
/sys/kernel/mm/swap/cache_del
/sys/kernel/mm/swap/cache_pages

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/swap_state.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b68c93014f50..a13bbf504e93 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -561,3 +561,81 @@ void exit_swap_address_space(unsigned int type)
 	synchronize_rcu();
 	kvfree(spaces);
 }
+
+#ifdef CONFIG_SYSFS
+static ssize_t swap_cache_pages_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", total_swapcache_pages());
+}
+static struct kobj_attribute swap_cache_pages_attr =
+	__ATTR(cache_pages, 0444, swap_cache_pages_show, NULL);
+
+static ssize_t swap_cache_add_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", swap_cache_info.add_total);
+}
+static struct kobj_attribute swap_cache_add_attr =
+	__ATTR(cache_add, 0444, swap_cache_add_show, NULL);
+
+static ssize_t swap_cache_del_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", swap_cache_info.del_total);
+}
+static struct kobj_attribute swap_cache_del_attr =
+	__ATTR(cache_del, 0444, swap_cache_del_show, NULL);
+
+static ssize_t swap_cache_find_success_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", swap_cache_info.find_success);
+}
+static struct kobj_attribute swap_cache_find_success_attr =
+	__ATTR(cache_find_success, 0444, swap_cache_find_success_show, NULL);
+
+static ssize_t swap_cache_find_total_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", swap_cache_info.find_total);
+}
+static struct kobj_attribute swap_cache_find_total_attr =
+	__ATTR(cache_find_total, 0444, swap_cache_find_total_show, NULL);
+
+static struct attribute *swap_attrs[] = {
+	&swap_cache_pages_attr.attr,
+	&swap_cache_add_attr.attr,
+	&swap_cache_del_attr.attr,
+	&swap_cache_find_success_attr.attr,
+	&swap_cache_find_total_attr.attr,
+	NULL,
+};
+
+static struct attribute_group swap_attr_group = {
+	.attrs = swap_attrs,
+};
+
+static int __init swap_init_sysfs(void)
+{
+	int err;
+	struct kobject *swap_kobj;
+
+	swap_kobj = kobject_create_and_add("swap", mm_kobj);
+	if (!swap_kobj) {
+		pr_err("failed to create swap kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(swap_kobj, &swap_attr_group);
+	if (err) {
+		pr_err("failed to register swap group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(swap_kobj);
+	return err;
+}
+subsys_initcall(swap_init_sysfs);
+#endif
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 2/6] mm, swap: Add swap readahead hit statistics
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 3/6] mm, swap: Fix swap readahead marking Huang, Ying
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.

/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total

With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index a13bbf504e93..8be7153967ed 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -20,6 +20,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
+#include <linux/percpu_counter.h>
 
 #include <asm/pgtable.h>
 
@@ -74,6 +75,15 @@ unsigned long total_swapcache_pages(void)
 }
 
 static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+static struct percpu_counter swapin_readahead_hits_total;
+static struct percpu_counter swapin_readahead_total;
+
+static int __init swap_init(void)
+{
+	percpu_counter_init(&swapin_readahead_hits_total, 0, GFP_KERNEL);
+	percpu_counter_init(&swapin_readahead_total, 0, GFP_KERNEL);
+}
+subsys_initcall(swap_init);
 
 void show_swap_cache_info(void)
 {
@@ -305,8 +315,10 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	if (page && likely(!PageTransCompound(page))) {
 		INC_CACHE_INFO(find_success);
-		if (TestClearPageReadahead(page))
+		if (TestClearPageReadahead(page)) {
 			atomic_inc(&swapin_readahead_hits);
+			percpu_counter_inc(&swapin_readahead_hits_total);
+		}
 	}
 
 	INC_CACHE_INFO(find_total);
@@ -516,8 +528,11 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 						gfp_mask, vma, addr, false);
 		if (!page)
 			continue;
-		if (offset != entry_offset && likely(!PageTransCompound(page)))
+		if (offset != entry_offset &&
+		    likely(!PageTransCompound(page))) {
 			SetPageReadahead(page);
+			percpu_counter_inc(&swapin_readahead_total);
+		}
 		put_page(page);
 	}
 	blk_finish_plug(&plug);
@@ -603,12 +618,31 @@ static ssize_t swap_cache_find_total_show(
 static struct kobj_attribute swap_cache_find_total_attr =
 	__ATTR(cache_find_total, 0444, swap_cache_find_total_show, NULL);
 
+static ssize_t swap_readahead_hits_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lld\n",
+		       percpu_counter_sum(&swapin_readahead_hits_total));
+}
+static struct kobj_attribute swap_readahead_hits_attr =
+	__ATTR(ra_hits, 0444, swap_readahead_hits_show, NULL);
+
+static ssize_t swap_readahead_total_show(
+	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lld\n", percpu_counter_sum(&swapin_readahead_total));
+}
+static struct kobj_attribute swap_readahead_total_attr =
+	__ATTR(ra_total, 0444, swap_readahead_total_show, NULL);
+
 static struct attribute *swap_attrs[] = {
 	&swap_cache_pages_attr.attr,
 	&swap_cache_add_attr.attr,
 	&swap_cache_del_attr.attr,
 	&swap_cache_find_success_attr.attr,
 	&swap_cache_find_total_attr.attr,
+	&swap_readahead_hits_attr.attr,
+	&swap_readahead_total_attr.attr,
 	NULL,
 };
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 3/6] mm, swap: Fix swap readahead marking
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 2/6] mm, swap: Add swap readahead hit statistics Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 4/6] mm, swap: VMA based swap readahead Huang, Ying
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

In the original implementation, it is possible that the existing pages
in the swap cache (not newly readahead) could be marked as the
readahead pages.  This will cause the statistics of swap readahead be
wrong and influence the swap readahead algorithm too.

This is fixed via marking a page as the readahead page only if it is
newly allocated and read from the disk.

When testing with linpack, after the fixing the swap readahead hit
rate increased from ~66% to ~86%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/swap_state.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8be7153967ed..d4d33c43ed36 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -508,7 +508,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
 	struct blk_plug plug;
-	bool do_poll = true;
+	bool do_poll = true, page_allocated;
 
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
@@ -524,14 +524,18 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
-		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr, false);
+		page = __read_swap_cache_async(
+			swp_entry(swp_type(entry), offset),
+			gfp_mask, vma, addr, &page_allocated);
 		if (!page)
 			continue;
-		if (offset != entry_offset &&
-		    likely(!PageTransCompound(page))) {
-			SetPageReadahead(page);
-			percpu_counter_inc(&swapin_readahead_total);
+		if (page_allocated) {
+			swap_readpage(page, false);
+			if (offset != entry_offset &&
+			    likely(!PageTransCompound(page))) {
+				SetPageReadahead(page);
+				percpu_counter_inc(&swapin_readahead_total);
+			}
 		}
 		put_page(page);
 	}
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 4/6] mm, swap: VMA based swap readahead
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
                   ` (2 preceding siblings ...)
  2017-07-25  1:51 ` [PATCH -mm -v3 3/6] mm, swap: Fix swap readahead marking Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 5/6] mm, swap: Add sysfs interface for " Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying
  5 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks
in swap device are readahead based on the global space locality
estimation.  But the consecutive blocks in swap device just reflect
the order of page reclaiming, don't necessarily reflect the access
pattern in virtual memory.  And the different tasks in the system may
have different access patterns, which makes the global space locality
estimation incorrect.

In this patch, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the
fault swap slot in swap device.  This avoid to readahead the unrelated
swap slots.  At the same time, the swap readahead is changed to work
on per-VMA from globally.  So that the different access patterns of
the different VMAs could be distinguished, and the different readahead
policy could be applied accordingly.  The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random
swap-in in the background.

This is a combined workload with sequential and random memory
accessing at the same time.  The result (for sequential workload) is
as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But
the elapsed time reduced and readahead hit rate improved, so the
optimized kernel runs better for startup and tear down stages.  And
the absolute value of readahead hit rate is high, shows that the space
locality is still valid in some practical workloads.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/linux/mm_types.h |   1 +
 include/linux/swap.h     |  57 ++++++++++++-
 mm/memory.c              |  23 +++--
 mm/shmem.c               |   2 +-
 mm/swap_state.c          | 215 +++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 273 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ff151814a02d..b8e623430a17 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 
+	atomic_long_t swap_readahead_info;
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..9e5cfb64b5e8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,25 @@ struct swap_info_struct {
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
 };
 
+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_CEILING	5
+#else
+/* Avoid stack overflow, because we need to save part of page table */
+#define SWAP_RA_ORDER_CEILING	3
+#define SWAP_RA_PTE_CACHE_SIZE	(1 << SWAP_RA_ORDER_CEILING)
+#endif
+
+struct vma_swap_readahead {
+	unsigned short win;
+	unsigned short offset;
+	unsigned short nr_pte;
+#ifdef CONFIG_64BIT
+	pte_t *ptes;
+#else
+	pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
+#endif
+};
+
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
@@ -349,6 +368,7 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 extern struct address_space *swapper_spaces[];
+extern bool swap_vma_readahead;
 #define swap_address_space(entry)			    \
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
@@ -361,7 +381,9 @@ extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page *lookup_swap_cache(swp_entry_t);
+extern struct page *lookup_swap_cache(swp_entry_t entry,
+				      struct vm_area_struct *vma,
+				      unsigned long addr);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool do_poll);
@@ -371,6 +393,17 @@ extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 
+extern struct page *swap_readahead_detect(struct vm_fault *vmf,
+					  struct vma_swap_readahead *swap_ra);
+extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask,
+					   struct vm_fault *vmf,
+					   struct vma_swap_readahead *swap_ra);
+
+static inline bool swap_use_vma_readahead(void)
+{
+	return READ_ONCE(swap_vma_readahead);
+}
+
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -465,12 +498,32 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline bool swap_use_vma_readahead(void)
+{
+	return false;
+}
+
+static inline struct page *swap_readahead_detect(
+	struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
+{
+	return NULL;
+}
+
+static inline struct page *do_swap_page_readahead(
+	swp_entry_t fentry, gfp_t gfp_mask,
+	struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
+{
+	return NULL;
+}
+
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
 }
 
-static inline struct page *lookup_swap_cache(swp_entry_t swp)
+static inline struct page *lookup_swap_cache(swp_entry_t swp,
+					     struct vm_area_struct *vma,
+					     unsigned long addr)
 {
 	return NULL;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 4de8280cc46f..7ffbf617839c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2683,16 +2683,23 @@ noinline int swap_lock_page_or_retry(struct page *page, struct mm_struct *mm,
 int do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct page *page, *swapcache;
+	struct page *page = NULL, *swapcache;
 	struct mem_cgroup *memcg;
+	struct vma_swap_readahead swap_ra;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
 	int exclusive = 0;
 	int ret = 0;
+	bool vma_readahead = swap_use_vma_readahead();
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+	if (vma_readahead)
+		page = swap_readahead_detect(vmf, &swap_ra);
+	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
+		if (page)
+			put_page(page);
 		goto out;
+	}
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
@@ -2708,10 +2715,16 @@ int do_swap_page(struct vm_fault *vmf)
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
-	page = lookup_swap_cache(entry);
+	if (!page)
+		page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
+					 vmf->address);
 	if (!page) {
-		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma,
-					vmf->address);
+		if (vma_readahead)
+			page = do_swap_page_readahead(entry,
+				GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
+		else
+			page = swapin_readahead(entry,
+				GFP_HIGHUSER_MOVABLE, vma, vmf->address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
diff --git a/mm/shmem.c b/mm/shmem.c
index b0aa6075d164..42c9f039b868 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1640,7 +1640,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (swap.val) {
 		/* Look it up and read it in.. */
-		page = lookup_swap_cache(swap);
+		page = lookup_swap_cache(swap, NULL, 0);
 		if (!page) {
 			/* Or update major stats only when swapin succeeds?? */
 			if (fault_type) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d4d33c43ed36..730477d86f15 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -38,6 +38,29 @@ static const struct address_space_operations swap_aops = {
 
 struct address_space *swapper_spaces[MAX_SWAPFILES];
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
+bool swap_vma_readahead = true;
+
+#define SWAP_RA_MAX_ORDER_DEFAULT	3
+
+static int swap_ra_max_order = SWAP_RA_MAX_ORDER_DEFAULT;
+
+#define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
+#define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
+#define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
+#define SWAP_RA_WIN_MASK	(~PAGE_MASK & ~SWAP_RA_HITS_MASK)
+
+#define SWAP_RA_HITS(v)		((v) & SWAP_RA_HITS_MASK)
+#define SWAP_RA_WIN(v)		(((v) & SWAP_RA_WIN_MASK) >> SWAP_RA_WIN_SHIFT)
+#define SWAP_RA_ADDR(v)		((v) & PAGE_MASK)
+
+#define SWAP_RA_VAL(addr, win, hits)				\
+	(((addr) & PAGE_MASK) |					\
+	 (((win) << SWAP_RA_WIN_SHIFT) & SWAP_RA_WIN_MASK) |	\
+	 ((hits) & SWAP_RA_HITS_MASK))
+
+/* Initial readahead hits is 4 to start up with a small window */
+#define GET_SWAP_RA_VAL(vma)					\
+	(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 #define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
@@ -307,21 +330,36 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
  * lock getting page table operations atomic even if we drop the page
  * lock before returning.
  */
-struct page * lookup_swap_cache(swp_entry_t entry)
+struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
+			       unsigned long addr)
 {
 	struct page *page;
+	unsigned long ra_info;
+	int win, hits, readahead;
 
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
 
-	if (page && likely(!PageTransCompound(page))) {
+	INC_CACHE_INFO(find_total);
+	if (page) {
 		INC_CACHE_INFO(find_success);
-		if (TestClearPageReadahead(page)) {
-			atomic_inc(&swapin_readahead_hits);
+		if (unlikely(PageTransCompound(page)))
+			return page;
+		readahead = TestClearPageReadahead(page);
+		if (vma) {
+			ra_info = GET_SWAP_RA_VAL(vma);
+			win = SWAP_RA_WIN(ra_info);
+			hits = SWAP_RA_HITS(ra_info);
+			if (readahead)
+				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+			atomic_long_set(&vma->swap_readahead_info,
+					SWAP_RA_VAL(addr, win, hits));
+		}
+		if (readahead) {
 			percpu_counter_inc(&swapin_readahead_hits_total);
+			if (!vma)
+				atomic_inc(&swapin_readahead_hits);
 		}
 	}
-
-	INC_CACHE_INFO(find_total);
 	return page;
 }
 
@@ -436,22 +474,20 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	return retpage;
 }
 
-static unsigned long swapin_nr_pages(unsigned long offset)
+static unsigned int __swapin_nr_pages(unsigned long prev_offset,
+				      unsigned long offset,
+				      int hits,
+				      int max_pages,
+				      int prev_win)
 {
-	static unsigned long prev_offset;
-	unsigned int pages, max_pages, last_ra;
-	static atomic_t last_readahead_pages;
-
-	max_pages = 1 << READ_ONCE(page_cluster);
-	if (max_pages <= 1)
-		return 1;
+	unsigned int pages, last_ra;
 
 	/*
 	 * This heuristic has been found to work well on both sequential and
 	 * random loads, swapping to hard disk or to SSD: please don't ask
 	 * what the "+ 2" means, it just happens to work well, that's all.
 	 */
-	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+	pages = hits + 2;
 	if (pages == 2) {
 		/*
 		 * We can have no readahead hits to judge by: but must not get
@@ -460,7 +496,6 @@ static unsigned long swapin_nr_pages(unsigned long offset)
 		 */
 		if (offset != prev_offset + 1 && offset != prev_offset - 1)
 			pages = 1;
-		prev_offset = offset;
 	} else {
 		unsigned int roundup = 4;
 		while (roundup < pages)
@@ -472,9 +507,28 @@ static unsigned long swapin_nr_pages(unsigned long offset)
 		pages = max_pages;
 
 	/* Don't shrink readahead too fast */
-	last_ra = atomic_read(&last_readahead_pages) / 2;
+	last_ra = prev_win / 2;
 	if (pages < last_ra)
 		pages = last_ra;
+
+	return pages;
+}
+
+static unsigned long swapin_nr_pages(unsigned long offset)
+{
+	static unsigned long prev_offset;
+	unsigned int hits, pages, max_pages;
+	static atomic_t last_readahead_pages;
+
+	max_pages = 1 << READ_ONCE(page_cluster);
+	if (max_pages <= 1)
+		return 1;
+
+	hits = atomic_xchg(&swapin_readahead_hits, 0);
+	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
+				  atomic_read(&last_readahead_pages));
+	if (!hits)
+		prev_offset = offset;
 	atomic_set(&last_readahead_pages, pages);
 
 	return pages;
@@ -581,6 +635,133 @@ void exit_swap_address_space(unsigned int type)
 	kvfree(spaces);
 }
 
+static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
+				     unsigned long faddr,
+				     unsigned long lpfn,
+				     unsigned long rpfn,
+				     unsigned long *start,
+				     unsigned long *end)
+{
+	*start = max3(lpfn, PFN_DOWN(vma->vm_start),
+		      PFN_DOWN(faddr & PMD_MASK));
+	*end = min3(rpfn, PFN_DOWN(vma->vm_end),
+		    PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
+}
+
+struct page *swap_readahead_detect(struct vm_fault *vmf,
+				   struct vma_swap_readahead *swap_ra)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long swap_ra_info;
+	struct page *page;
+	swp_entry_t entry;
+	unsigned long faddr, pfn, fpfn;
+	unsigned long start, end;
+	pte_t *pte;
+	unsigned int max_win, hits, prev_win, win, left;
+#ifndef CONFIG_64BIT
+	pte_t *tpte;
+#endif
+
+	faddr = vmf->address;
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	if ((unlikely(non_swap_entry(entry))))
+		return NULL;
+	page = lookup_swap_cache(entry, vma, faddr);
+	if (page)
+		return page;
+
+	max_win = 1 << READ_ONCE(swap_ra_max_order);
+	if (max_win == 1) {
+		swap_ra->win = 1;
+		return NULL;
+	}
+
+	fpfn = PFN_DOWN(faddr);
+	swap_ra_info = GET_SWAP_RA_VAL(vma);
+	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
+	prev_win = SWAP_RA_WIN(swap_ra_info);
+	hits = SWAP_RA_HITS(swap_ra_info);
+	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
+					       max_win, prev_win);
+	atomic_long_set(&vma->swap_readahead_info,
+			SWAP_RA_VAL(faddr, win, 0));
+
+	if (win == 1)
+		return NULL;
+
+	/* Copy the PTEs because the page table may be unmapped */
+	if (fpfn == pfn + 1)
+		swap_ra_clamp_pfn(vma, faddr, fpfn, fpfn + win, &start, &end);
+	else if (pfn == fpfn + 1)
+		swap_ra_clamp_pfn(vma, faddr, fpfn - win + 1, fpfn + 1,
+				  &start, &end);
+	else {
+		left = (win - 1) / 2;
+		swap_ra_clamp_pfn(vma, faddr, fpfn - left, fpfn + win - left,
+				  &start, &end);
+	}
+	swap_ra->nr_pte = end - start;
+	swap_ra->offset = fpfn - start;
+	pte = vmf->pte - swap_ra->offset;
+#ifdef CONFIG_64BIT
+	swap_ra->ptes = pte;
+#else
+	tpte = swap_ra->ptes;
+	for (pfn = start; pfn != end; pfn++)
+		*tpte++ = *pte++;
+#endif
+
+	return NULL;
+}
+
+struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask,
+				    struct vm_fault *vmf,
+				    struct vma_swap_readahead *swap_ra)
+{
+	struct blk_plug plug;
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page;
+	pte_t *pte, pentry;
+	swp_entry_t entry;
+	unsigned int i;
+	bool page_allocated;
+
+	if (swap_ra->win == 1)
+		goto skip;
+
+	blk_start_plug(&plug);
+	for (i = 0, pte = swap_ra->ptes; i < swap_ra->nr_pte;
+	     i++, pte++) {
+		pentry = *pte;
+		if (pte_none(pentry))
+			continue;
+		if (pte_present(pentry))
+			continue;
+		entry = pte_to_swp_entry(pentry);
+		if (unlikely(non_swap_entry(entry)))
+			continue;
+		page = __read_swap_cache_async(entry, gfp_mask, vma,
+					       vmf->address, &page_allocated);
+		if (!page)
+			continue;
+		if (page_allocated) {
+			swap_readpage(page, false);
+			if (i != swap_ra->offset &&
+			    likely(!PageTransCompound(page))) {
+				SetPageReadahead(page);
+				percpu_counter_inc(&swapin_readahead_total);
+			}
+		}
+		put_page(page);
+	}
+	blk_finish_plug(&plug);
+	lru_add_drain();
+skip:
+	return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
+				     swap_ra->win == 1);
+}
+
 #ifdef CONFIG_SYSFS
 static ssize_t swap_cache_pages_show(
 	struct kobject *kobj, struct kobj_attribute *attr, char *buf)
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 5/6] mm, swap: Add sysfs interface for VMA based swap readahead
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
                   ` (3 preceding siblings ...)
  2017-07-25  1:51 ` [PATCH -mm -v3 4/6] mm, swap: VMA based swap readahead Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25  1:51 ` [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying
  5 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

The sysfs interface to control the VMA based swap readahead is added
as follow,

/sys/kernel/mm/swap/vma_ra_enabled

Enable the VMA based swap readahead algorithm, or use the original
global swap readahead algorithm.

/sys/kernel/mm/swap/vma_ra_max_order

Set the max order of the readahead window size for the VMA based swap
readahead algorithm.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 730477d86f15..46717088bd9f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -820,6 +820,51 @@ static ssize_t swap_readahead_total_show(
 static struct kobj_attribute swap_readahead_total_attr =
 	__ATTR(ra_total, 0444, swap_readahead_total_show, NULL);
 
+static ssize_t vma_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%s\n", swap_vma_readahead ? "true" : "false");
+}
+static ssize_t vma_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
+		swap_vma_readahead = true;
+	else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
+		swap_vma_readahead = false;
+	else
+		return -EINVAL;
+
+	return count;
+}
+static struct kobj_attribute vma_ra_enabled_attr =
+	__ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show,
+	       vma_ra_enabled_store);
+
+static ssize_t vma_ra_max_order_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", swap_ra_max_order);
+}
+static ssize_t vma_ra_max_order_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err, v;
+
+	err = kstrtoint(buf, 10, &v);
+	if (err || v > SWAP_RA_ORDER_CEILING || v <= 0)
+		return -EINVAL;
+
+	swap_ra_max_order = v;
+
+	return count;
+}
+static struct kobj_attribute vma_ra_max_order_attr =
+	__ATTR(vma_ra_max_order, 0644, vma_ra_max_order_show,
+	       vma_ra_max_order_store);
+
 static struct attribute *swap_attrs[] = {
 	&swap_cache_pages_attr.attr,
 	&swap_cache_add_attr.attr,
@@ -828,6 +873,8 @@ static struct attribute *swap_attrs[] = {
 	&swap_cache_find_total_attr.attr,
 	&swap_readahead_hits_attr.attr,
 	&swap_readahead_total_attr.attr,
+	&vma_ra_enabled_attr.attr,
+	&vma_ra_max_order_attr.attr,
 	NULL,
 };
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap
  2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
                   ` (4 preceding siblings ...)
  2017-07-25  1:51 ` [PATCH -mm -v3 5/6] mm, swap: Add sysfs interface for " Huang, Ying
@ 2017-07-25  1:51 ` Huang, Ying
  2017-07-25 20:50   ` Andrew Morton
  5 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2017-07-25  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

From: Huang Ying <ying.huang@intel.com>

VMA based swap readahead will readahead the virtual pages that is
continuous in the virtual address space.  While the original swap
readahead will readahead the swap slots that is continuous in the swap
device.  Although VMA based swap readahead is more correct for the
swap slots to be readahead, it will trigger more small random
readings, which may cause the performance of HDD (hard disk) to
degrade heavily, and may finally exceed the benefit.

To avoid the issue, in this patch, if the HDD is used as swap, the VMA
based swap readahead will be disabled, and the original swap readahead
will be used instead.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/linux/swap.h | 11 ++++++-----
 mm/swapfile.c        |  8 +++++++-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9e5cfb64b5e8..633bf6c5bac8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -399,16 +399,17 @@ extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 					   struct vm_fault *vmf,
 					   struct vma_swap_readahead *swap_ra);
 
-static inline bool swap_use_vma_readahead(void)
-{
-	return READ_ONCE(swap_vma_readahead);
-}
-
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
+extern atomic_t nr_rotate_swap;
 extern bool has_usable_swap(void);
 
+static inline bool swap_use_vma_readahead(void)
+{
+	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
+}
+
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6ba4aab2db0b..2685b9951cc1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -96,6 +96,8 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
 /* Activity counter to indicate that a swapon or swapoff has occurred */
 static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
+atomic_t nr_rotate_swap = ATOMIC_INIT(0);
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
@@ -2387,6 +2389,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (p->flags & SWP_CONTINUED)
 		free_swap_count_continuations(p);
 
+	if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev)))
+		atomic_dec(&nr_rotate_swap);
+
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
@@ -2963,7 +2968,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
 			cluster_set_null(&cluster->index);
 		}
-	}
+	} else
+		atomic_inc(&nr_rotate_swap);
 
 	error = swap_cgroup_swapon(p->type, maxpages);
 	if (error)
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface
  2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
@ 2017-07-25 20:42   ` Andrew Morton
  2017-07-26  1:29     ` Huang, Ying
  2017-07-25 21:05   ` Rik van Riel
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2017-07-25 20:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

On Tue, 25 Jul 2017 09:51:46 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> The swap cache stats could be gotten only via sysrq, which isn't
> convenient in some situation.  So the sysfs interface of swap cache
> stats is added for that.  The added sysfs directories/files are as
> follow,
> 
> /sys/kernel/mm/swap
> /sys/kernel/mm/swap/cache_find_total
> /sys/kernel/mm/swap/cache_find_success
> /sys/kernel/mm/swap/cache_add
> /sys/kernel/mm/swap/cache_del
> /sys/kernel/mm/swap/cache_pages

We should document this somewhere.  Documentation/ABI/ is the formal
place for sysfs files, but nobody will think to look there for VM
things, so perhaps place a pointer to the Documentation/ABI/ files
within Documentation/vm somewhere, only there isn't an appropriate
Documentation/vm file ;)

Or just put all these things in debugfs.  These are pretty specialized
things and appear to be developer-only files of short-term interest?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap
  2017-07-25  1:51 ` [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying
@ 2017-07-25 20:50   ` Andrew Morton
  2017-07-26  1:17     ` Huang, Ying
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2017-07-25 20:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim,
	Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen,
	Dave Hansen

On Tue, 25 Jul 2017 09:51:51 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> From: Huang Ying <ying.huang@intel.com>
> 
> VMA based swap readahead will readahead the virtual pages that is
> continuous in the virtual address space.  While the original swap
> readahead will readahead the swap slots that is continuous in the swap
> device.  Although VMA based swap readahead is more correct for the
> swap slots to be readahead, it will trigger more small random
> readings, which may cause the performance of HDD (hard disk) to
> degrade heavily, and may finally exceed the benefit.
> 
> To avoid the issue, in this patch, if the HDD is used as swap, the VMA
> based swap readahead will be disabled, and the original swap readahead
> will be used instead.
>
> ...
> 
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -399,16 +399,17 @@ extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask,
>  					   struct vm_fault *vmf,
>  					   struct vma_swap_readahead *swap_ra);
>  
> -static inline bool swap_use_vma_readahead(void)
> -{
> -	return READ_ONCE(swap_vma_readahead);
> -}
> -
>  /* linux/mm/swapfile.c */
>  extern atomic_long_t nr_swap_pages;
>  extern long total_swap_pages;
> +extern atomic_t nr_rotate_swap;

This is rather ugly.  If the system is swapping to both an SSD and to a
spinning disk, we'll treat the spinning disk as SSD.

Surely this decision can be made in a per-device fashion?

>  extern bool has_usable_swap(void);
>  
> +static inline bool swap_use_vma_readahead(void)
> +{
> +	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> +}
> +
>  /* Swap 50% full? Release swapcache more aggressively.. */
>  static inline bool vm_swap_full(void)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6ba4aab2db0b..2685b9951cc1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -96,6 +96,8 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
>  /* Activity counter to indicate that a swapon or swapoff has occurred */
>  static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
> +atomic_t nr_rotate_swap = ATOMIC_INIT(0);
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>  	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
> @@ -2387,6 +2389,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	if (p->flags & SWP_CONTINUED)
>  		free_swap_count_continuations(p);
>  
> +	if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev)))
> +		atomic_dec(&nr_rotate_swap);

What's that p->bdev test for?  It's not symmetrical with the
sys_swapon() change and one wonders if the counter can get out of sync.


>  	mutex_lock(&swapon_mutex);
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> @@ -2963,7 +2968,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>  			cluster_set_null(&cluster->index);
>  		}
> -	}
> +	} else
> +		atomic_inc(&nr_rotate_swap);
>  
>  	error = swap_cgroup_swapon(p->type, maxpages);
>  	if (error)
> -- 
> 2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface
  2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
  2017-07-25 20:42   ` Andrew Morton
@ 2017-07-25 21:05   ` Rik van Riel
  2017-07-26  1:30     ` Huang, Ying
  1 sibling, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2017-07-25 21:05 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim, Shaohua Li,
	Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Tue, 2017-07-25 at 09:51 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The swap cache stats could be gotten only via sysrq, which isn't
> convenient in some situation.  So the sysfs interface of swap cache
> stats is added for that.  The added sysfs directories/files are as
> follow,
> 
> /sys/kernel/mm/swap
> /sys/kernel/mm/swap/cache_find_total
> /sys/kernel/mm/swap/cache_find_success
> /sys/kernel/mm/swap/cache_add
> /sys/kernel/mm/swap/cache_del
> /sys/kernel/mm/swap/cache_pages
> 
What is the advantage of this vs new fields in
/proc/vmstat, which is where most of the VM
statistics seem to live?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap
  2017-07-25 20:50   ` Andrew Morton
@ 2017-07-26  1:17     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-26  1:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner,
	Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins,
	Fengguang Wu, Tim Chen, Dave Hansen

Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 25 Jul 2017 09:51:51 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> VMA based swap readahead will readahead the virtual pages that is
>> continuous in the virtual address space.  While the original swap
>> readahead will readahead the swap slots that is continuous in the swap
>> device.  Although VMA based swap readahead is more correct for the
>> swap slots to be readahead, it will trigger more small random
>> readings, which may cause the performance of HDD (hard disk) to
>> degrade heavily, and may finally exceed the benefit.
>> 
>> To avoid the issue, in this patch, if the HDD is used as swap, the VMA
>> based swap readahead will be disabled, and the original swap readahead
>> will be used instead.
>>
>> ...
>> 
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -399,16 +399,17 @@ extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask,
>>  					   struct vm_fault *vmf,
>>  					   struct vma_swap_readahead *swap_ra);
>>  
>> -static inline bool swap_use_vma_readahead(void)
>> -{
>> -	return READ_ONCE(swap_vma_readahead);
>> -}
>> -
>>  /* linux/mm/swapfile.c */
>>  extern atomic_long_t nr_swap_pages;
>>  extern long total_swap_pages;
>> +extern atomic_t nr_rotate_swap;
>
> This is rather ugly.  If the system is swapping to both an SSD and to a
> spinning disk, we'll treat the spinning disk as SSD.

In this patch, if the system is swapping to both a SSD and to a spinning
disk, we'll treat SSD as spinning disk.  That is, to use original swap
readahead algorithm instead of new proposed VMA based swap readahead
algorithm.

> Surely this decision can be made in a per-device fashion?

It's hard for VMA based swap readahead algorithm.  With that algorithm,
the PTEs near the fault address will be checked, some of them may come
from SSD and the others come from spinning disk, it is hard to choose
which algorithm to use for this situation.

So I choose a simple solution to use original swap readahead algorithm
if there is one spinning disk is used as swap, and hope most people
will not use both the spinning disk and SSD as swap at the same time.

>>  extern bool has_usable_swap(void);
>>  
>> +static inline bool swap_use_vma_readahead(void)
>> +{
>> +	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> +}
>> +
>>  /* Swap 50% full? Release swapcache more aggressively.. */
>>  static inline bool vm_swap_full(void)
>>  {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 6ba4aab2db0b..2685b9951cc1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -96,6 +96,8 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
>>  /* Activity counter to indicate that a swapon or swapoff has occurred */
>>  static atomic_t proc_poll_event = ATOMIC_INIT(0);
>>  
>> +atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>> +
>>  static inline unsigned char swap_count(unsigned char ent)
>>  {
>>  	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
>> @@ -2387,6 +2389,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>  	if (p->flags & SWP_CONTINUED)
>>  		free_swap_count_continuations(p);
>>  
>> +	if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev)))
>> +		atomic_dec(&nr_rotate_swap);
>
> What's that p->bdev test for?  It's not symmetrical with the
> sys_swapon() change and one wonders if the counter can get out of sync.

There is such test in sys_swapon()

	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
                ...
        } else
                atomic_inc(&nr_rotate_swap);



I use it in swapoff to try to make counting symmetrical.  Do I
misunderstand the code?

Best Regards,
Huang, Ying

>
>>  	mutex_lock(&swapon_mutex);
>>  	spin_lock(&swap_lock);
>>  	spin_lock(&p->lock);
>> @@ -2963,7 +2968,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>>  			cluster_set_null(&cluster->index);
>>  		}
>> -	}
>> +	} else
>> +		atomic_inc(&nr_rotate_swap);
>>  
>>  	error = swap_cgroup_swapon(p->type, maxpages);
>>  	if (error)
>> -- 
>> 2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface
  2017-07-25 20:42   ` Andrew Morton
@ 2017-07-26  1:29     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-26  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner,
	Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins,
	Fengguang Wu, Tim Chen, Dave Hansen

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 25 Jul 2017 09:51:46 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> The swap cache stats could be gotten only via sysrq, which isn't
>> convenient in some situation.  So the sysfs interface of swap cache
>> stats is added for that.  The added sysfs directories/files are as
>> follow,
>> 
>> /sys/kernel/mm/swap
>> /sys/kernel/mm/swap/cache_find_total
>> /sys/kernel/mm/swap/cache_find_success
>> /sys/kernel/mm/swap/cache_add
>> /sys/kernel/mm/swap/cache_del
>> /sys/kernel/mm/swap/cache_pages
>
> We should document this somewhere.  Documentation/ABI/ is the formal
> place for sysfs files, but nobody will think to look there for VM
> things, so perhaps place a pointer to the Documentation/ABI/ files
> within Documentation/vm somewhere, only there isn't an appropriate
> Documentation/vm file ;)
>
> Or just put all these things in debugfs.  These are pretty specialized
> things and appear to be developer-only files of short-term interest?

Yes.  Debugfs should be better place for these.  Will update it in the
next version.

And I also introduced sysfs interface in [2/6] and [5/6]

/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total

/sys/kernel/mm/swap/vma_ra_enabled
/sys/kernel/mm/swap/vma_ra_max_order

Will add ABI document for them.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface
  2017-07-25 21:05   ` Rik van Riel
@ 2017-07-26  1:30     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2017-07-26  1:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Minchan Kim, Shaohua Li, Hugh Dickins,
	Fengguang Wu, Tim Chen, Dave Hansen

Hi, Rik,

Rik van Riel <riel@redhat.com> writes:

> On Tue, 2017-07-25 at 09:51 +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The swap cache stats could be gotten only via sysrq, which isn't
>> convenient in some situation.A A So the sysfs interface of swap cache
>> stats is added for that.A A The added sysfs directories/files are as
>> follow,
>> 
>> /sys/kernel/mm/swap
>> /sys/kernel/mm/swap/cache_find_total
>> /sys/kernel/mm/swap/cache_find_success
>> /sys/kernel/mm/swap/cache_add
>> /sys/kernel/mm/swap/cache_del
>> /sys/kernel/mm/swap/cache_pages
>> 
> What is the advantage of this vs new fields in
> /proc/vmstat, which is where most of the VM
> statistics seem to live?

As proposed by Andrew, will use debugfs for them, because they are
mostly developer related.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-07-26  1:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-25  1:51 [PATCH -mm -v3 0/6] mm, swap: VMA based swap readahead Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface Huang, Ying
2017-07-25 20:42   ` Andrew Morton
2017-07-26  1:29     ` Huang, Ying
2017-07-25 21:05   ` Rik van Riel
2017-07-26  1:30     ` Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 2/6] mm, swap: Add swap readahead hit statistics Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 3/6] mm, swap: Fix swap readahead marking Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 4/6] mm, swap: VMA based swap readahead Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 5/6] mm, swap: Add sysfs interface for " Huang, Ying
2017-07-25  1:51 ` [PATCH -mm -v3 6/6] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying
2017-07-25 20:50   ` Andrew Morton
2017-07-26  1:17     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).