linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once
@ 2017-09-12  2:37 Minchan Kim
  2017-09-12  2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
	Sergey Senozhatsky

[1] fixed weird thing(i.e., reset BDI_CAP_STABLE_WRITES flag
unconditionally whenever revalidat_disk is called) so zram doesn't
need to reset the flag any more whenever revalidating the bdev.
Instead, set the flag just once when the zram device is created.

It shouldn't change any behavior.

[1] 19b7ccf8651d, block: get rid of blk_integrity_revalidate()
Cc: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 2981c27d3aae..ba009675fdc0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -122,14 +122,6 @@ static inline bool is_partial_io(struct bio_vec *bvec)
 }
 #endif
 
-static void zram_revalidate_disk(struct zram *zram)
-{
-	revalidate_disk(zram->disk);
-	/* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
-	zram->disk->queue->backing_dev_info->capabilities |=
-		BDI_CAP_STABLE_WRITES;
-}
-
 /*
  * Check if request is within bounds and aligned on zram logical blocks.
  */
@@ -1385,7 +1377,8 @@ static ssize_t disksize_store(struct device *dev,
 	zram->comp = comp;
 	zram->disksize = disksize;
 	set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
-	zram_revalidate_disk(zram);
+
+	revalidate_disk(zram->disk);
 	up_write(&zram->init_lock);
 
 	return len;
@@ -1432,7 +1425,7 @@ static ssize_t reset_store(struct device *dev,
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
 	zram_reset_device(zram);
-	zram_revalidate_disk(zram);
+	revalidate_disk(zram->disk);
 	bdput(bdev);
 
 	mutex_lock(&bdev->bd_mutex);
@@ -1551,6 +1544,7 @@ static int zram_add(void)
 	/* zram devices sort of resembles non-rotational disks */
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue);
 	queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, zram->disk->queue);
+
 	/*
 	 * To ensure that we always get PAGE_SIZE aligned
 	 * and n*PAGE_SIZED sized I/O requests.
@@ -1575,6 +1569,8 @@ static int zram_add(void)
 	if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
+	zram->disk->queue->backing_dev_info->capabilities |=
+					BDI_CAP_STABLE_WRITES;
 	add_disk(zram->disk);
 
 	ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO
  2017-09-12  2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
@ 2017-09-12  2:37 ` Minchan Kim
  2017-09-12  2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
	Sergey Senozhatsky, Christoph Hellwig, Dan Williams,
	Ross Zwisler

By discussion[1], someday we will remove rw_page function. If so, we need
something to detect such super-fast storage which synchronous IO operation
like current rw_page is always win.

This patch introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices.
With it, we could use various optimization techniques.

[1] lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/brd.c           | 2 ++
 drivers/block/zram/zram_drv.c | 2 +-
 drivers/nvdimm/btt.c          | 3 +++
 drivers/nvdimm/pmem.c         | 2 ++
 include/linux/backing-dev.h   | 8 ++++++++
 5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bbd0d186cfc0..1fdb736aa882 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,6 +20,7 @@
 #include <linux/radix-tree.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/backing-dev.h>
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include <linux/pfn_t.h>
 #include <linux/dax.h>
@@ -449,6 +450,7 @@ static struct brd_device *brd_alloc(int i)
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
+	disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index ba009675fdc0..91a72df41ab0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1570,7 +1570,7 @@ static int zram_add(void)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	zram->disk->queue->backing_dev_info->capabilities |=
-					BDI_CAP_STABLE_WRITES;
+			(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
 	add_disk(zram->disk);
 
 	ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index d5612bd1cc81..e949e3302af4 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -23,6 +23,7 @@
 #include <linux/ndctl.h>
 #include <linux/fs.h>
 #include <linux/nd.h>
+#include <linux/backing-dev.h>
 #include "btt.h"
 #include "nd.h"
 
@@ -1402,6 +1403,8 @@ static int btt_blk_init(struct btt *btt)
 	btt->btt_disk->private_data = btt;
 	btt->btt_disk->queue = btt->btt_queue;
 	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+	btt->btt_disk->queue->backing_dev_info->capabilities |=
+			BDI_CAP_SYNCHRONOUS_IO;
 
 	blk_queue_make_request(btt->btt_queue, btt_make_request);
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e9aa453da50c..5fc5999dca6a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -31,6 +31,7 @@
 #include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
+#include <linux/backing-dev.h>
 #include "pmem.h"
 #include "pfn.h"
 #include "nd.h"
@@ -401,6 +402,7 @@ static int pmem_attach_disk(struct device *dev,
 	disk->fops		= &pmem_fops;
 	disk->queue		= q;
 	disk->flags		= GENHD_FL_EXT_DEVT;
+	disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
 	nvdimm_namespace_disk_name(ndns, disk->disk_name);
 	set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
 			/ 512);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..cd41617c6594 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -123,6 +123,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
  * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
  *
  * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
+ * BDI_CAP_SYNCHRONOUS_IO: Device is so fast that asynchronous IO would be
+ *			   inefficient.
  */
 #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
 #define BDI_CAP_NO_WRITEBACK	0x00000002
@@ -130,6 +132,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_STABLE_WRITES	0x00000008
 #define BDI_CAP_STRICTLIMIT	0x00000010
 #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
+#define BDI_CAP_SYNCHRONOUS_IO	0x00000040
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
 	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +180,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_SYNCHRONOUS_IO;
+}
+
 static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
 {
 	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO
  2017-09-12  2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
  2017-09-12  2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12  2:37 ` Minchan Kim
  2017-09-12  4:48   ` Sergey Senozhatsky
  2017-09-12  2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
  2017-09-12  2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
  3 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
	Sergey Senozhatsky

If rw-page based fast storage is used for swap devices, we need to
detect it to enhance swap IO operations.
This patch is preparation for optimizing of swap-in operation with
next patch.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h | 3 ++-
 mm/swapfile.c        | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8a807292037f..0f54b491e118 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -170,8 +170,9 @@ enum {
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
+	SWP_SYNCHRONOUS_IO = (1<<11),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bf91dc9e7a79..1305591cde4d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3168,6 +3168,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
 		p->flags |= SWP_STABLE_WRITES;
 
+	if (bdi_cap_synchronous_io(inode_to_bdi(inode)))
+		p->flags |= SWP_SYNCHRONOUS_IO;
+
 	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
 		int cpu;
 		unsigned long ci, nr_cluster;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
  2017-09-12  2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
  2017-09-12  2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12  2:37 ` Minchan Kim
  2017-09-12  5:23   ` Huang, Ying
  2017-09-12  2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
  3 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
	Sergey Senozhatsky, Huang, Ying

page_cluster 0 means "we don't want readahead" so in the case,
let's skip the readahead detection logic.

Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0f54b491e118..739d94397c47 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
 
 static inline bool swap_use_vma_readahead(void)
 {
-	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
+	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
+				&& !atomic_read(&nr_rotate_swap);
 }
 
 /* Swap 50% full? Release swapcache more aggressively.. */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
  2017-09-12  2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
                   ` (2 preceding siblings ...)
  2017-09-12  2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
@ 2017-09-12  2:37 ` Minchan Kim
  2017-09-12  6:04   ` Sergey Senozhatsky
  2017-09-12 20:22   ` kbuild test robot
  3 siblings, 2 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
	Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins

With fast swap storage, platform want to use swap more aggressively
and swap-in is crucial to application latency.

The rw_page based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4 decompress
test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm.

This patch aims for reducing swap-in latency via skipping swapcache
if swap device is synchronous device like rw_page based device.
It enhances 45% my swapin test(5G sequential swapin, no readahead,
from 2.41sec to 1.64sec).

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/memory.c          | 48 +++++++++++++++++++++++++++++++++---------------
 mm/page_io.c         |  6 +++---
 mm/swapfile.c        | 11 +++++++----
 4 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 739d94397c47..fb46a3b55d65 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,6 +462,7 @@ extern int page_swapcount(struct page *);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
+extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..130698cd2d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2842,7 +2842,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
 int do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct page *page = NULL, *swapcache;
+	struct page *page = NULL, *swapcache = NULL;
 	struct mem_cgroup *memcg;
 	struct vma_swap_readahead swap_ra;
 	swp_entry_t entry;
@@ -2881,17 +2881,33 @@ int do_swap_page(struct vm_fault *vmf)
 		}
 		goto out;
 	}
+
+
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	if (!page)
 		page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
 					 vmf->address);
 	if (!page) {
-		if (vma_readahead)
-			page = do_swap_page_readahead(entry,
-				GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
-		else
-			page = swapin_readahead(entry,
-				GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+		struct swap_info_struct *si = swp_swap_info(entry);
+
+		if (!(si->flags & SWP_SYNCHRONOUS_IO)) {
+			if (vma_readahead)
+				page = do_swap_page_readahead(entry,
+					GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
+			else
+				page = swapin_readahead(entry,
+					GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+			swapcache = page;
+		} else {
+			/* skip swapcache */
+			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+			__SetPageLocked(page);
+			__SetPageSwapBacked(page);
+			set_page_private(page, entry.val);
+			lru_cache_add_anon(page);
+			swap_readpage(page, true);
+		}
+
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -2920,7 +2936,6 @@ int do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	swapcache = page;
 	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2935,7 +2950,8 @@ int do_swap_page(struct vm_fault *vmf)
 	 * test below, are not enough to exclude that.  Even if it is still
 	 * swapcache, we need to check that the page's swap has not changed.
 	 */
-	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
+	if (unlikely((!PageSwapCache(page) ||
+			page_private(page) != entry.val)) && swapcache)
 		goto out_page;
 
 	page = ksm_might_need_to_copy(page, vma, vmf->address);
@@ -2988,14 +3004,16 @@ int do_swap_page(struct vm_fault *vmf)
 		pte = pte_mksoft_dirty(pte);
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	vmf->orig_pte = pte;
-	if (page == swapcache) {
-		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
-		mem_cgroup_commit_charge(page, memcg, true, false);
-		activate_page(page);
-	} else { /* ksm created a completely new copy */
+
+	/* ksm created a completely new copy */
+	if (unlikely(page != swapcache && swapcache)) {
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+	} else {
+		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		mem_cgroup_commit_charge(page, memcg, true, false);
+		activate_page(page);
 	}
 
 	swap_free(entry);
@@ -3003,7 +3021,7 @@ int do_swap_page(struct vm_fault *vmf)
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		try_to_free_swap(page);
 	unlock_page(page);
-	if (page != swapcache) {
+	if (page != swapcache && swapcache) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
diff --git a/mm/page_io.c b/mm/page_io.c
index 21502d341a67..d4a98e1f6608 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -346,7 +346,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return ret;
 }
 
-int swap_readpage(struct page *page, bool do_poll)
+int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -354,7 +354,7 @@ int swap_readpage(struct page *page, bool do_poll)
 	blk_qc_t qc;
 	struct gendisk *disk;
 
-	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageUptodate(page), page);
 	if (frontswap_load(page) == 0) {
@@ -402,7 +402,7 @@ int swap_readpage(struct page *page, bool do_poll)
 	count_vm_event(PSWPIN);
 	bio_get(bio);
 	qc = submit_bio(bio);
-	while (do_poll) {
+	while (synchronous) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(bio->bi_private))
 			break;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1305591cde4d..64a3d85226ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3454,10 +3454,15 @@ int swapcache_prepare(swp_entry_t entry)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }
 
+struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+{
+	return swap_info[swp_type(entry)];
+}
+
 struct swap_info_struct *page_swap_info(struct page *page)
 {
-	swp_entry_t swap = { .val = page_private(page) };
-	return swap_info[swp_type(swap)];
+	swp_entry_t entry = { .val = page_private(page) };
+	return swp_swap_info(entry);
 }
 
 /*
@@ -3465,7 +3470,6 @@ struct swap_info_struct *page_swap_info(struct page *page)
  */
 struct address_space *__page_file_mapping(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	return page_swap_info(page)->swap_file->f_mapping;
 }
 EXPORT_SYMBOL_GPL(__page_file_mapping);
@@ -3473,7 +3477,6 @@ EXPORT_SYMBOL_GPL(__page_file_mapping);
 pgoff_t __page_file_index(struct page *page)
 {
 	swp_entry_t swap = { .val = page_private(page) };
-	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	return swp_offset(swap);
 }
 EXPORT_SYMBOL_GPL(__page_file_index);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO
  2017-09-12  2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12  4:48   ` Sergey Senozhatsky
  0 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2017-09-12  4:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On (09/12/17 11:37), Minchan Kim wrote:
[..]
> If rw-page based fast storage is used for swap devices, we need to
> detect it to enhance swap IO operations.
> This patch is preparation for optimizing of swap-in operation with
> next patch.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h | 3 ++-
>  mm/swapfile.c        | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8a807292037f..0f54b491e118 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -170,8 +170,9 @@ enum {
>  	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
>  	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
>  	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
> +	SWP_SYNCHRONOUS_IO = (1<<11),	/* synchronous IO is efficient */
a nitpick:                  (1 << 11)

	-ss

>  					/* add others here before... */
> -	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
> +	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
>  };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
@ 2017-09-12  5:23   ` Huang, Ying
  2017-09-12  6:25     ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12  5:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky, Huang, Ying

Minchan Kim <minchan@kernel.org> writes:

> page_cluster 0 means "we don't want readahead" so in the case,
> let's skip the readahead detection logic.
>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0f54b491e118..739d94397c47 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>  
>  static inline bool swap_use_vma_readahead(void)
>  {
> -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> +				&& !atomic_read(&nr_rotate_swap);
>  }
>  
>  /* Swap 50% full? Release swapcache more aggressively.. */

Now the readahead window size of the VMA based swap readahead is
controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
original swap readahead is controlled by sysctl page_cluster.  It is
possible for anonymous memory to use VMA based swap readahead and tmpfs
to use original swap readahead algorithm at the same time.  So that, I
think it is necessary to use different control knob to control these two
algorithm.  So if we want to disable readahead for tmpfs, but keep it
for VMA based readahead, we can set 0 to page_cluster but non-zero to
/sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
impossible.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
  2017-09-12  2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
@ 2017-09-12  6:04   ` Sergey Senozhatsky
  2017-09-12  6:31     ` Minchan Kim
  2017-09-12 20:22   ` kbuild test robot
  1 sibling, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2017-09-12  6:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins

On (09/12/17 11:37), Minchan Kim wrote:
> +		} else {
> +			/* skip swapcache */
> +			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);

what if alloc_page_vma() fails?

> +			__SetPageLocked(page);
> +			__SetPageSwapBacked(page);
> +			set_page_private(page, entry.val);
> +			lru_cache_add_anon(page);
> +			swap_readpage(page, true);
> +		}

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  5:23   ` Huang, Ying
@ 2017-09-12  6:25     ` Minchan Kim
  2017-09-12  6:44       ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  6:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > page_cluster 0 means "we don't want readahead" so in the case,
> > let's skip the readahead detection logic.
> >
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0f54b491e118..739d94397c47 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >  
> >  static inline bool swap_use_vma_readahead(void)
> >  {
> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> > +				&& !atomic_read(&nr_rotate_swap);
> >  }
> >  
> >  /* Swap 50% full? Release swapcache more aggressively.. */
> 
> Now the readahead window size of the VMA based swap readahead is
> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> original swap readahead is controlled by sysctl page_cluster.  It is
> possible for anonymous memory to use VMA based swap readahead and tmpfs
> to use original swap readahead algorithm at the same time.  So that, I
> think it is necessary to use different control knob to control these two
> algorithm.  So if we want to disable readahead for tmpfs, but keep it
> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
> impossible.

For a long time, page-cluster have been used as controlling swap readahead.
One of example, zram users have been disabled readahead via 0 page-cluster.
However, with your change, it would be regressed if it doesn't disable
vma_ra_max_order.

As well, all of swap users should be aware of vma_ra_max_order as well as
page-cluster to control swap readahead but I didn't see any document about
that. Acutaully, I don't like it but want to unify it with page-cluster.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
  2017-09-12  6:04   ` Sergey Senozhatsky
@ 2017-09-12  6:31     ` Minchan Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  6:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins

On Tue, Sep 12, 2017 at 03:04:56PM +0900, Sergey Senozhatsky wrote:
> On (09/12/17 11:37), Minchan Kim wrote:
> > +		} else {
> > +			/* skip swapcache */
> > +			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
> 
> what if alloc_page_vma() fails?

Several modifications during development finally makes me remove the NULL check.
Thanks for catching it as well as style fix-up of other patch, Sergey.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  6:25     ` Minchan Kim
@ 2017-09-12  6:44       ` Huang, Ying
  2017-09-12  6:52         ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12  6:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > page_cluster 0 means "we don't want readahead" so in the case,
>> > let's skip the readahead detection logic.
>> >
>> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> >  include/linux/swap.h | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 0f54b491e118..739d94397c47 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >  
>> >  static inline bool swap_use_vma_readahead(void)
>> >  {
>> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> > +				&& !atomic_read(&nr_rotate_swap);
>> >  }
>> >  
>> >  /* Swap 50% full? Release swapcache more aggressively.. */
>> 
>> Now the readahead window size of the VMA based swap readahead is
>> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> original swap readahead is controlled by sysctl page_cluster.  It is
>> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> to use original swap readahead algorithm at the same time.  So that, I
>> think it is necessary to use different control knob to control these two
>> algorithm.  So if we want to disable readahead for tmpfs, but keep it
>> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
>> impossible.
>
> For a long time, page-cluster have been used as controlling swap readahead.
> One of example, zram users have been disabled readahead via 0 page-cluster.
> However, with your change, it would be regressed if it doesn't disable
> vma_ra_max_order.
>
> As well, all of swap users should be aware of vma_ra_max_order as well as
> page-cluster to control swap readahead but I didn't see any document about
> that. Acutaully, I don't like it but want to unify it with page-cluster.

The document is in

Documentation/ABI/testing/sysfs-kernel-mm-swap

The concern of unifying it with page-cluster is as following.

Original swap readahead on tmpfs may not work well because the combined
workload is running, so we want to disable or constrain it.  But at the
same time, the VMA based swap readahead may work better.  So I think it
may be necessary to control them separately.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  6:44       ` Huang, Ying
@ 2017-09-12  6:52         ` Minchan Kim
  2017-09-12  7:29           ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  6:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > page_cluster 0 means "we don't want readahead" so in the case,
> >> > let's skip the readahead detection logic.
> >> >
> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> > ---
> >> >  include/linux/swap.h | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 0f54b491e118..739d94397c47 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >> >  
> >> >  static inline bool swap_use_vma_readahead(void)
> >> >  {
> >> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> >> > +				&& !atomic_read(&nr_rotate_swap);
> >> >  }
> >> >  
> >> >  /* Swap 50% full? Release swapcache more aggressively.. */
> >> 
> >> Now the readahead window size of the VMA based swap readahead is
> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> >> original swap readahead is controlled by sysctl page_cluster.  It is
> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
> >> to use original swap readahead algorithm at the same time.  So that, I
> >> think it is necessary to use different control knob to control these two
> >> algorithm.  So if we want to disable readahead for tmpfs, but keep it
> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> >> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
> >> impossible.
> >
> > For a long time, page-cluster have been used as controlling swap readahead.
> > One of example, zram users have been disabled readahead via 0 page-cluster.
> > However, with your change, it would be regressed if it doesn't disable
> > vma_ra_max_order.
> >
> > As well, all of swap users should be aware of vma_ra_max_order as well as
> > page-cluster to control swap readahead but I didn't see any document about
> > that. Acutaully, I don't like it but want to unify it with page-cluster.
> 
> The document is in
> 
> Documentation/ABI/testing/sysfs-kernel-mm-swap
> 
> The concern of unifying it with page-cluster is as following.
> 
> Original swap readahead on tmpfs may not work well because the combined
> workload is running, so we want to disable or constrain it.  But at the
> same time, the VMA based swap readahead may work better.  So I think it
> may be necessary to control them separately.

My concern is users have been disabled swap readahead by page-cluster would
be regressed. Please take care of them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  6:52         ` Minchan Kim
@ 2017-09-12  7:29           ` Huang, Ying
  2017-09-12  7:56             ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12  7:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > page_cluster 0 means "we don't want readahead" so in the case,
>> >> > let's skip the readahead detection logic.
>> >> >
>> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> > ---
>> >> >  include/linux/swap.h | 3 ++-
>> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 0f54b491e118..739d94397c47 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >> >  
>> >> >  static inline bool swap_use_vma_readahead(void)
>> >> >  {
>> >> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> >> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> >> > +				&& !atomic_read(&nr_rotate_swap);
>> >> >  }
>> >> >  
>> >> >  /* Swap 50% full? Release swapcache more aggressively.. */
>> >> 
>> >> Now the readahead window size of the VMA based swap readahead is
>> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> >> original swap readahead is controlled by sysctl page_cluster.  It is
>> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> >> to use original swap readahead algorithm at the same time.  So that, I
>> >> think it is necessary to use different control knob to control these two
>> >> algorithm.  So if we want to disable readahead for tmpfs, but keep it
>> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> >> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
>> >> impossible.
>> >
>> > For a long time, page-cluster have been used as controlling swap readahead.
>> > One of example, zram users have been disabled readahead via 0 page-cluster.
>> > However, with your change, it would be regressed if it doesn't disable
>> > vma_ra_max_order.
>> >
>> > As well, all of swap users should be aware of vma_ra_max_order as well as
>> > page-cluster to control swap readahead but I didn't see any document about
>> > that. Acutaully, I don't like it but want to unify it with page-cluster.
>> 
>> The document is in
>> 
>> Documentation/ABI/testing/sysfs-kernel-mm-swap
>> 
>> The concern of unifying it with page-cluster is as following.
>> 
>> Original swap readahead on tmpfs may not work well because the combined
>> workload is running, so we want to disable or constrain it.  But at the
>> same time, the VMA based swap readahead may work better.  So I think it
>> may be necessary to control them separately.
>
> My concern is users have been disabled swap readahead by page-cluster would
> be regressed. Please take care of them.

How about disable VMA based swap readahead if zram used as swap?  Like
we have done for hard disk?

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  7:29           ` Huang, Ying
@ 2017-09-12  7:56             ` Minchan Kim
  2017-09-12  8:07               ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  7:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On Tue, Sep 12, 2017 at 03:29:45PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >> 
> >> >> > page_cluster 0 means "we don't want readahead" so in the case,
> >> >> > let's skip the readahead detection logic.
> >> >> >
> >> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> > ---
> >> >> >  include/linux/swap.h | 3 ++-
> >> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >> >
> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> > index 0f54b491e118..739d94397c47 100644
> >> >> > --- a/include/linux/swap.h
> >> >> > +++ b/include/linux/swap.h
> >> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >> >> >  
> >> >> >  static inline bool swap_use_vma_readahead(void)
> >> >> >  {
> >> >> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >> >> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> >> >> > +				&& !atomic_read(&nr_rotate_swap);
> >> >> >  }
> >> >> >  
> >> >> >  /* Swap 50% full? Release swapcache more aggressively.. */
> >> >> 
> >> >> Now the readahead window size of the VMA based swap readahead is
> >> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> >> >> original swap readahead is controlled by sysctl page_cluster.  It is
> >> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
> >> >> to use original swap readahead algorithm at the same time.  So that, I
> >> >> think it is necessary to use different control knob to control these two
> >> >> algorithm.  So if we want to disable readahead for tmpfs, but keep it
> >> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> >> >> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
> >> >> impossible.
> >> >
> >> > For a long time, page-cluster have been used as controlling swap readahead.
> >> > One of example, zram users have been disabled readahead via 0 page-cluster.
> >> > However, with your change, it would be regressed if it doesn't disable
> >> > vma_ra_max_order.
> >> >
> >> > As well, all of swap users should be aware of vma_ra_max_order as well as
> >> > page-cluster to control swap readahead but I didn't see any document about
> >> > that. Acutaully, I don't like it but want to unify it with page-cluster.
> >> 
> >> The document is in
> >> 
> >> Documentation/ABI/testing/sysfs-kernel-mm-swap
> >> 
> >> The concern of unifying it with page-cluster is as following.
> >> 
> >> Original swap readahead on tmpfs may not work well because the combined
> >> workload is running, so we want to disable or constrain it.  But at the
> >> same time, the VMA based swap readahead may work better.  So I think it
> >> may be necessary to control them separately.
> >
> > My concern is users have been disabled swap readahead by page-cluster would
> > be regressed. Please take care of them.
> 
> How about disable VMA based swap readahead if zram used as swap?  Like
> we have done for hard disk?

It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
no seek cost swap devices if this patchset is merged so VM automatically
disables readahead. It is in my TODO but it's orthogonal work.

The problem I raised is "Why shouldn't we obey user's decision?",
not zram sepcific issue.

A user has used SSD as swap devices decided to disable swap readahead
by some reason(e.g., small memory system). Anyway, it has worked
via page-cluster for a several years but with vma-based swap devices,
it doesn't work any more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  7:56             ` Minchan Kim
@ 2017-09-12  8:07               ` Huang, Ying
  2017-09-12  8:22                 ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12  8:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Sep 12, 2017 at 03:29:45PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >> 
>> >> >> > page_cluster 0 means "we don't want readahead" so in the case,
>> >> >> > let's skip the readahead detection logic.
>> >> >> >
>> >> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> > ---
>> >> >> >  include/linux/swap.h | 3 ++-
>> >> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >> >
>> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> > index 0f54b491e118..739d94397c47 100644
>> >> >> > --- a/include/linux/swap.h
>> >> >> > +++ b/include/linux/swap.h
>> >> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >> >> >  
>> >> >> >  static inline bool swap_use_vma_readahead(void)
>> >> >> >  {
>> >> >> > -	return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> >> >> > +	return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> >> >> > +				&& !atomic_read(&nr_rotate_swap);
>> >> >> >  }
>> >> >> >  
>> >> >> >  /* Swap 50% full? Release swapcache more aggressively.. */
>> >> >> 
>> >> >> Now the readahead window size of the VMA based swap readahead is
>> >> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> >> >> original swap readahead is controlled by sysctl page_cluster.  It is
>> >> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> >> >> to use original swap readahead algorithm at the same time.  So that, I
>> >> >> think it is necessary to use different control knob to control these two
>> >> >> algorithm.  So if we want to disable readahead for tmpfs, but keep it
>> >> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> >> >> /sys/kernel/mm/swap/vma_ra_max_order.  With your change, this will be
>> >> >> impossible.
>> >> >
>> >> > For a long time, page-cluster have been used as controlling swap readahead.
>> >> > One of example, zram users have been disabled readahead via 0 page-cluster.
>> >> > However, with your change, it would be regressed if it doesn't disable
>> >> > vma_ra_max_order.
>> >> >
>> >> > As well, all of swap users should be aware of vma_ra_max_order as well as
>> >> > page-cluster to control swap readahead but I didn't see any document about
>> >> > that. Acutaully, I don't like it but want to unify it with page-cluster.
>> >> 
>> >> The document is in
>> >> 
>> >> Documentation/ABI/testing/sysfs-kernel-mm-swap
>> >> 
>> >> The concern of unifying it with page-cluster is as following.
>> >> 
>> >> Original swap readahead on tmpfs may not work well because the combined
>> >> workload is running, so we want to disable or constrain it.  But at the
>> >> same time, the VMA based swap readahead may work better.  So I think it
>> >> may be necessary to control them separately.
>> >
>> > My concern is users have been disabled swap readahead by page-cluster would
>> > be regressed. Please take care of them.
>> 
>> How about disable VMA based swap readahead if zram used as swap?  Like
>> we have done for hard disk?
>
> It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> no seek cost swap devices if this patchset is merged so VM automatically
> disables readahead. It is in my TODO but it's orthogonal work.
>
> The problem I raised is "Why shouldn't we obey user's decision?",
> not zram sepcific issue.
>
> A user has used SSD as swap devices decided to disable swap readahead
> by some reason(e.g., small memory system). Anyway, it has worked
> via page-cluster for a several years but with vma-based swap devices,
> it doesn't work any more.

Can they add one more line to their configuration scripts?

echo 0 > /sys/kernel/mm/swap/vma_ra_max_order

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  8:07               ` Huang, Ying
@ 2017-09-12  8:22                 ` Minchan Kim
  2017-09-12  8:32                   ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12  8:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
< snip >
> >> > My concern is users have been disabled swap readahead by page-cluster would
> >> > be regressed. Please take care of them.
> >> 
> >> How about disable VMA based swap readahead if zram used as swap?  Like
> >> we have done for hard disk?
> >
> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> > no seek cost swap devices if this patchset is merged so VM automatically
> > disables readahead. It is in my TODO but it's orthogonal work.
> >
> > The problem I raised is "Why shouldn't we obey user's decision?",
> > not zram sepcific issue.
> >
> > A user has used SSD as swap devices decided to disable swap readahead
> > by some reason(e.g., small memory system). Anyway, it has worked
> > via page-cluster for a several years but with vma-based swap devices,
> > it doesn't work any more.
> 
> Can they add one more line to their configuration scripts?
> 
> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order

We call it as "regression", don't we?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  8:22                 ` Minchan Kim
@ 2017-09-12  8:32                   ` Huang, Ying
  2017-09-12 23:34                     ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12  8:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
> < snip >
>> >> > My concern is users have been disabled swap readahead by page-cluster would
>> >> > be regressed. Please take care of them.
>> >> 
>> >> How about disable VMA based swap readahead if zram used as swap?  Like
>> >> we have done for hard disk?
>> >
>> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
>> > no seek cost swap devices if this patchset is merged so VM automatically
>> > disables readahead. It is in my TODO but it's orthogonal work.
>> >
>> > The problem I raised is "Why shouldn't we obey user's decision?",
>> > not zram sepcific issue.
>> >
>> > A user has used SSD as swap devices decided to disable swap readahead
>> > by some reason(e.g., small memory system). Anyway, it has worked
>> > via page-cluster for a several years but with vma-based swap devices,
>> > it doesn't work any more.
>> 
>> Can they add one more line to their configuration scripts?
>> 
>> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
>
> We call it as "regression", don't we?

I think this always happen when we switch default algorithm.  For
example, if we had switched default IO scheduler, then the user scripts
to configure the parameters of old default IO scheduler will fail.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
  2017-09-12  2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
  2017-09-12  6:04   ` Sergey Senozhatsky
@ 2017-09-12 20:22   ` kbuild test robot
  2017-09-13  0:16     ` Minchan Kim
  1 sibling, 1 reply; 21+ messages in thread
From: kbuild test robot @ 2017-09-12 20:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kbuild-all, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky, Dan Williams, Ross Zwisler,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 12424 bytes --]

Hi Minchan,

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20170912]
[cannot apply to linus/master v4.13]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Minchan-Kim/zram-set-BDI_CAP_STABLE_WRITES-once/20170913-025838
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x016-201737 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:2891:33: error: implicit declaration of function 'swp_swap_info' [-Werror=implicit-function-declaration]
      struct swap_info_struct *si = swp_swap_info(entry);
                                    ^~~~~~~~~~~~~
>> mm/memory.c:2891:33: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
>> mm/memory.c:2908:4: error: implicit declaration of function 'swap_readpage' [-Werror=implicit-function-declaration]
       swap_readpage(page, true);
       ^~~~~~~~~~~~~
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/smp.h:10,
                    from include/linux/kernel_stat.h:4,
                    from mm/memory.c:41:
   mm/memory.c: At top level:
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:451:2: note: in expansion of macro 'if'
     if (dest_len > count) {
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:449:2: note: in expansion of macro 'if'
     if (dest_size < dest_len)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:446:8: note: in expansion of macro 'if'
      else if (src_size < dest_len && src_size < count)
           ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:444:3: note: in expansion of macro 'if'
      if (dest_size < dest_len && dest_size < count)
      ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:443:2: note: in expansion of macro 'if'
     if (__builtin_constant_p(dest_len) && __builtin_constant_p(count)) {
     ^~
   cc1: some warnings being treated as errors

vim +/swp_swap_info +2891 mm/memory.c

  2833	
  2834	/*
  2835	 * We enter with non-exclusive mmap_sem (to exclude vma changes,
  2836	 * but allow concurrent faults), and pte mapped but not yet locked.
  2837	 * We return with pte unmapped and unlocked.
  2838	 *
  2839	 * We return with the mmap_sem locked or unlocked in the same cases
  2840	 * as does filemap_fault().
  2841	 */
  2842	int do_swap_page(struct vm_fault *vmf)
  2843	{
  2844		struct vm_area_struct *vma = vmf->vma;
  2845		struct page *page = NULL, *swapcache = NULL;
  2846		struct mem_cgroup *memcg;
  2847		struct vma_swap_readahead swap_ra;
  2848		swp_entry_t entry;
  2849		pte_t pte;
  2850		int locked;
  2851		int exclusive = 0;
  2852		int ret = 0;
  2853		bool vma_readahead = swap_use_vma_readahead();
  2854	
  2855		if (vma_readahead)
  2856			page = swap_readahead_detect(vmf, &swap_ra);
  2857		if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
  2858			if (page)
  2859				put_page(page);
  2860			goto out;
  2861		}
  2862	
  2863		entry = pte_to_swp_entry(vmf->orig_pte);
  2864		if (unlikely(non_swap_entry(entry))) {
  2865			if (is_migration_entry(entry)) {
  2866				migration_entry_wait(vma->vm_mm, vmf->pmd,
  2867						     vmf->address);
  2868			} else if (is_device_private_entry(entry)) {
  2869				/*
  2870				 * For un-addressable device memory we call the pgmap
  2871				 * fault handler callback. The callback must migrate
  2872				 * the page back to some CPU accessible page.
  2873				 */
  2874				ret = device_private_entry_fault(vma, vmf->address, entry,
  2875							 vmf->flags, vmf->pmd);
  2876			} else if (is_hwpoison_entry(entry)) {
  2877				ret = VM_FAULT_HWPOISON;
  2878			} else {
  2879				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  2880				ret = VM_FAULT_SIGBUS;
  2881			}
  2882			goto out;
  2883		}
  2884	
  2885	
  2886		delayacct_set_flag(DELAYACCT_PF_SWAPIN);
  2887		if (!page)
  2888			page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
  2889						 vmf->address);
  2890		if (!page) {
> 2891			struct swap_info_struct *si = swp_swap_info(entry);
  2892	
  2893			if (!(si->flags & SWP_SYNCHRONOUS_IO)) {
  2894				if (vma_readahead)
  2895					page = do_swap_page_readahead(entry,
  2896						GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
  2897				else
  2898					page = swapin_readahead(entry,
  2899						GFP_HIGHUSER_MOVABLE, vma, vmf->address);
  2900				swapcache = page;
  2901			} else {
  2902				/* skip swapcache */
  2903				page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
  2904				__SetPageLocked(page);
  2905				__SetPageSwapBacked(page);
  2906				set_page_private(page, entry.val);
  2907				lru_cache_add_anon(page);
> 2908				swap_readpage(page, true);
  2909			}
  2910	
  2911			if (!page) {
  2912				/*
  2913				 * Back out if somebody else faulted in this pte
  2914				 * while we released the pte lock.
  2915				 */
  2916				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  2917						vmf->address, &vmf->ptl);
  2918				if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
  2919					ret = VM_FAULT_OOM;
  2920				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
  2921				goto unlock;
  2922			}
  2923	
  2924			/* Had to read the page from swap area: Major fault */
  2925			ret = VM_FAULT_MAJOR;
  2926			count_vm_event(PGMAJFAULT);
  2927			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  2928		} else if (PageHWPoison(page)) {
  2929			/*
  2930			 * hwpoisoned dirty swapcache pages are kept for killing
  2931			 * owner processes (which may be unknown at hwpoison time)
  2932			 */
  2933			ret = VM_FAULT_HWPOISON;
  2934			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
  2935			swapcache = page;
  2936			goto out_release;
  2937		}
  2938	
  2939		locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
  2940	
  2941		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
  2942		if (!locked) {
  2943			ret |= VM_FAULT_RETRY;
  2944			goto out_release;
  2945		}
  2946	
  2947		/*
  2948		 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
  2949		 * release the swapcache from under us.  The page pin, and pte_same
  2950		 * test below, are not enough to exclude that.  Even if it is still
  2951		 * swapcache, we need to check that the page's swap has not changed.
  2952		 */
  2953		if (unlikely((!PageSwapCache(page) ||
  2954				page_private(page) != entry.val)) && swapcache)
  2955			goto out_page;
  2956	
  2957		page = ksm_might_need_to_copy(page, vma, vmf->address);
  2958		if (unlikely(!page)) {
  2959			ret = VM_FAULT_OOM;
  2960			page = swapcache;
  2961			goto out_page;
  2962		}
  2963	
  2964		if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
  2965					&memcg, false)) {
  2966			ret = VM_FAULT_OOM;
  2967			goto out_page;
  2968		}
  2969	
  2970		/*
  2971		 * Back out if somebody else already faulted in this pte.
  2972		 */
  2973		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  2974				&vmf->ptl);
  2975		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
  2976			goto out_nomap;
  2977	
  2978		if (unlikely(!PageUptodate(page))) {
  2979			ret = VM_FAULT_SIGBUS;
  2980			goto out_nomap;
  2981		}
  2982	
  2983		/*
  2984		 * The page isn't present yet, go ahead with the fault.
  2985		 *
  2986		 * Be careful about the sequence of operations here.
  2987		 * To get its accounting right, reuse_swap_page() must be called
  2988		 * while the page is counted on swap but not yet in mapcount i.e.
  2989		 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
  2990		 * must be called after the swap_free(), or it will never succeed.
  2991		 */
  2992	
  2993		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
  2994		dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
  2995		pte = mk_pte(page, vma->vm_page_prot);
  2996		if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
  2997			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
  2998			vmf->flags &= ~FAULT_FLAG_WRITE;
  2999			ret |= VM_FAULT_WRITE;
  3000			exclusive = RMAP_EXCLUSIVE;
  3001		}
  3002		flush_icache_page(vma, page);
  3003		if (pte_swp_soft_dirty(vmf->orig_pte))
  3004			pte = pte_mksoft_dirty(pte);
  3005		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
  3006		vmf->orig_pte = pte;
  3007	
  3008		/* ksm created a completely new copy */
  3009		if (unlikely(page != swapcache && swapcache)) {
  3010			page_add_new_anon_rmap(page, vma, vmf->address, false);
  3011			mem_cgroup_commit_charge(page, memcg, false, false);
  3012			lru_cache_add_active_or_unevictable(page, vma);
  3013		} else {
  3014			do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
  3015			mem_cgroup_commit_charge(page, memcg, true, false);
  3016			activate_page(page);
  3017		}
  3018	
  3019		swap_free(entry);
  3020		if (mem_cgroup_swap_full(page) ||
  3021		    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
  3022			try_to_free_swap(page);
  3023		unlock_page(page);
  3024		if (page != swapcache && swapcache) {
  3025			/*
  3026			 * Hold the lock to avoid the swap entry to be reused
  3027			 * until we take the PT lock for the pte_same() check
  3028			 * (to avoid false positives from pte_same). For
  3029			 * further safety release the lock after the swap_free
  3030			 * so that the swap count won't change under a
  3031			 * parallel locked swapcache.
  3032			 */
  3033			unlock_page(swapcache);
  3034			put_page(swapcache);
  3035		}
  3036	
  3037		if (vmf->flags & FAULT_FLAG_WRITE) {
  3038			ret |= do_wp_page(vmf);
  3039			if (ret & VM_FAULT_ERROR)
  3040				ret &= VM_FAULT_ERROR;
  3041			goto out;
  3042		}
  3043	
  3044		/* No need to invalidate - it was non-present before */
  3045		update_mmu_cache(vma, vmf->address, vmf->pte);
  3046	unlock:
  3047		pte_unmap_unlock(vmf->pte, vmf->ptl);
  3048	out:
  3049		return ret;
  3050	out_nomap:
  3051		mem_cgroup_cancel_charge(page, memcg, false);
  3052		pte_unmap_unlock(vmf->pte, vmf->ptl);
  3053	out_page:
  3054		unlock_page(page);
  3055	out_release:
  3056		put_page(page);
  3057		if (page != swapcache) {
  3058			unlock_page(swapcache);
  3059			put_page(swapcache);
  3060		}
  3061		return ret;
  3062	}
  3063	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27043 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12  8:32                   ` Huang, Ying
@ 2017-09-12 23:34                     ` Minchan Kim
  2017-09-13  0:55                       ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 23:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
	Sergey Senozhatsky

On Tue, Sep 12, 2017 at 04:32:43PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
> > < snip >
> >> >> > My concern is users have been disabled swap readahead by page-cluster would
> >> >> > be regressed. Please take care of them.
> >> >> 
> >> >> How about disable VMA based swap readahead if zram used as swap?  Like
> >> >> we have done for hard disk?
> >> >
> >> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> >> > no seek cost swap devices if this patchset is merged so VM automatically
> >> > disables readahead. It is in my TODO but it's orthogonal work.
> >> >
> >> > The problem I raised is "Why shouldn't we obey user's decision?",
> >> > not zram sepcific issue.
> >> >
> >> > A user has used SSD as swap devices decided to disable swap readahead
> >> > by some reason(e.g., small memory system). Anyway, it has worked
> >> > via page-cluster for a several years but with vma-based swap devices,
> >> > it doesn't work any more.
> >> 
> >> Can they add one more line to their configuration scripts?
> >> 
> >> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
> >
> > We call it as "regression", don't we?
> 
> I think this always happen when we switch default algorithm.  For
> example, if we had switched default IO scheduler, then the user scripts
> to configure the parameters of old default IO scheduler will fail.

I don't follow what you are saying with specific example.
If kernel did it which breaks something on userspace which has worked well,
it should be fixed. No doubt.

Even although it happened by mistakes, it couldn't be a excuse to break
new thing, either.

Simple. Fix the regression. 

If you insist on "swap users should fix it by themselves via modification
of their script or it's not a regression", I don't want to waste my time to
persuade you any more. I will ask reverting your patches to Andrew.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
  2017-09-12 20:22   ` kbuild test robot
@ 2017-09-13  0:16     ` Minchan Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-13  0:16 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky, Dan Williams, Ross Zwisler,
	Hugh Dickins

On Wed, Sep 13, 2017 at 04:22:59AM +0800, kbuild test robot wrote:
> Hi Minchan,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on next-20170912]
> [cannot apply to linus/master v4.13]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Minchan-Kim/zram-set-BDI_CAP_STABLE_WRITES-once/20170913-025838
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x016-201737 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All error/warnings (new ones prefixed by >>):

I will fix !CONFIG_SWAP.
Thanks, 0-day!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
  2017-09-12 23:34                     ` Minchan Kim
@ 2017-09-13  0:55                       ` Huang, Ying
  0 siblings, 0 replies; 21+ messages in thread
From: Huang, Ying @ 2017-09-13  0:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Ilya Dryomov, Sergey Senozhatsky

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Sep 12, 2017 at 04:32:43PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
>> > < snip >
>> >> >> > My concern is users have been disabled swap readahead by page-cluster would
>> >> >> > be regressed. Please take care of them.
>> >> >> 
>> >> >> How about disable VMA based swap readahead if zram used as swap?  Like
>> >> >> we have done for hard disk?
>> >> >
>> >> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
>> >> > no seek cost swap devices if this patchset is merged so VM automatically
>> >> > disables readahead. It is in my TODO but it's orthogonal work.
>> >> >
>> >> > The problem I raised is "Why shouldn't we obey user's decision?",
>> >> > not zram sepcific issue.
>> >> >
>> >> > A user has used SSD as swap devices decided to disable swap readahead
>> >> > by some reason(e.g., small memory system). Anyway, it has worked
>> >> > via page-cluster for a several years but with vma-based swap devices,
>> >> > it doesn't work any more.
>> >> 
>> >> Can they add one more line to their configuration scripts?
>> >> 
>> >> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
>> >
>> > We call it as "regression", don't we?
>> 
>> I think this always happen when we switch default algorithm.  For
>> example, if we had switched default IO scheduler, then the user scripts
>> to configure the parameters of old default IO scheduler will fail.
>
> I don't follow what you are saying with specific example.
> If kernel did it which breaks something on userspace which has worked well,
> it should be fixed. No doubt.
>
> Even although it happened by mistakes, it couldn't be a excuse to break
> new thing, either.
>
> Simple. Fix the regression. 
>
> If you insist on "swap users should fix it by themselves via modification
> of their script or it's not a regression", I don't want to waste my time to
> persuade you any more. I will ask reverting your patches to Andrew.

There is no functionality regression definitely.  It may cause some
performance regression for some users, which could be resolved via some
scripts changing.  Please don't mix them.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-09-13  0:55 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12  2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
2017-09-12  2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
2017-09-12  2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
2017-09-12  4:48   ` Sergey Senozhatsky
2017-09-12  2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
2017-09-12  5:23   ` Huang, Ying
2017-09-12  6:25     ` Minchan Kim
2017-09-12  6:44       ` Huang, Ying
2017-09-12  6:52         ` Minchan Kim
2017-09-12  7:29           ` Huang, Ying
2017-09-12  7:56             ` Minchan Kim
2017-09-12  8:07               ` Huang, Ying
2017-09-12  8:22                 ` Minchan Kim
2017-09-12  8:32                   ` Huang, Ying
2017-09-12 23:34                     ` Minchan Kim
2017-09-13  0:55                       ` Huang, Ying
2017-09-12  2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
2017-09-12  6:04   ` Sergey Senozhatsky
2017-09-12  6:31     ` Minchan Kim
2017-09-12 20:22   ` kbuild test robot
2017-09-13  0:16     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).