* [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once
@ 2017-09-12 2:37 Minchan Kim
2017-09-12 2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 2:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
Sergey Senozhatsky
[1] fixed weird thing(i.e., reset BDI_CAP_STABLE_WRITES flag
unconditionally whenever revalidat_disk is called) so zram doesn't
need to reset the flag any more whenever revalidating the bdev.
Instead, set the flag just once when the zram device is created.
It shouldn't change any behavior.
[1] 19b7ccf8651d, block: get rid of blk_integrity_revalidate()
Cc: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
drivers/block/zram/zram_drv.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 2981c27d3aae..ba009675fdc0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -122,14 +122,6 @@ static inline bool is_partial_io(struct bio_vec *bvec)
}
#endif
-static void zram_revalidate_disk(struct zram *zram)
-{
- revalidate_disk(zram->disk);
- /* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
- zram->disk->queue->backing_dev_info->capabilities |=
- BDI_CAP_STABLE_WRITES;
-}
-
/*
* Check if request is within bounds and aligned on zram logical blocks.
*/
@@ -1385,7 +1377,8 @@ static ssize_t disksize_store(struct device *dev,
zram->comp = comp;
zram->disksize = disksize;
set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
- zram_revalidate_disk(zram);
+
+ revalidate_disk(zram->disk);
up_write(&zram->init_lock);
return len;
@@ -1432,7 +1425,7 @@ static ssize_t reset_store(struct device *dev,
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
- zram_revalidate_disk(zram);
+ revalidate_disk(zram->disk);
bdput(bdev);
mutex_lock(&bdev->bd_mutex);
@@ -1551,6 +1544,7 @@ static int zram_add(void)
/* zram devices sort of resembles non-rotational disks */
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue);
queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, zram->disk->queue);
+
/*
* To ensure that we always get PAGE_SIZE aligned
* and n*PAGE_SIZED sized I/O requests.
@@ -1575,6 +1569,8 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
+ zram->disk->queue->backing_dev_info->capabilities |=
+ BDI_CAP_STABLE_WRITES;
add_disk(zram->disk);
ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO
2017-09-12 2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
@ 2017-09-12 2:37 ` Minchan Kim
2017-09-12 2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 2:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
Sergey Senozhatsky, Christoph Hellwig, Dan Williams,
Ross Zwisler
By discussion[1], someday we will remove rw_page function. If so, we need
something to detect such super-fast storage which synchronous IO operation
like current rw_page is always win.
This patch introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices.
With it, we could use various optimization techniques.
[1] lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
drivers/block/brd.c | 2 ++
drivers/block/zram/zram_drv.c | 2 +-
drivers/nvdimm/btt.c | 3 +++
drivers/nvdimm/pmem.c | 2 ++
include/linux/backing-dev.h | 8 ++++++++
5 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bbd0d186cfc0..1fdb736aa882 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,6 +20,7 @@
#include <linux/radix-tree.h>
#include <linux/fs.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#ifdef CONFIG_BLK_DEV_RAM_DAX
#include <linux/pfn_t.h>
#include <linux/dax.h>
@@ -449,6 +450,7 @@ static struct brd_device *brd_alloc(int i)
disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
+ disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
#ifdef CONFIG_BLK_DEV_RAM_DAX
queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index ba009675fdc0..91a72df41ab0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1570,7 +1570,7 @@ static int zram_add(void)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
zram->disk->queue->backing_dev_info->capabilities |=
- BDI_CAP_STABLE_WRITES;
+ (BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
add_disk(zram->disk);
ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index d5612bd1cc81..e949e3302af4 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -23,6 +23,7 @@
#include <linux/ndctl.h>
#include <linux/fs.h>
#include <linux/nd.h>
+#include <linux/backing-dev.h>
#include "btt.h"
#include "nd.h"
@@ -1402,6 +1403,8 @@ static int btt_blk_init(struct btt *btt)
btt->btt_disk->private_data = btt;
btt->btt_disk->queue = btt->btt_queue;
btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+ btt->btt_disk->queue->backing_dev_info->capabilities |=
+ BDI_CAP_SYNCHRONOUS_IO;
blk_queue_make_request(btt->btt_queue, btt_make_request);
blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e9aa453da50c..5fc5999dca6a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -31,6 +31,7 @@
#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/nd.h>
+#include <linux/backing-dev.h>
#include "pmem.h"
#include "pfn.h"
#include "nd.h"
@@ -401,6 +402,7 @@ static int pmem_attach_disk(struct device *dev,
disk->fops = &pmem_fops;
disk->queue = q;
disk->flags = GENHD_FL_EXT_DEVT;
+ disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
/ 512);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..cd41617c6594 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -123,6 +123,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
*
* BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
+ * BDI_CAP_SYNCHRONOUS_IO: Device is so fast that asynchronous IO would be
+ * inefficient.
*/
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
@@ -130,6 +132,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
#define BDI_CAP_STABLE_WRITES 0x00000008
#define BDI_CAP_STRICTLIMIT 0x00000010
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
+#define BDI_CAP_SYNCHRONOUS_IO 0x00000040
#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +180,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
+static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
+{
+ return bdi->capabilities & BDI_CAP_SYNCHRONOUS_IO;
+}
+
static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
{
return bdi->capabilities & BDI_CAP_STABLE_WRITES;
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO
2017-09-12 2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
2017-09-12 2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12 2:37 ` Minchan Kim
2017-09-12 4:48 ` Sergey Senozhatsky
2017-09-12 2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
2017-09-12 2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
3 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 2:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
Sergey Senozhatsky
If rw-page based fast storage is used for swap devices, we need to
detect it to enhance swap IO operations.
This patch is preparation for optimizing of swap-in operation with
next patch.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 3 ++-
mm/swapfile.c | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8a807292037f..0f54b491e118 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -170,8 +170,9 @@ enum {
SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */
SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */
+ SWP_SYNCHRONOUS_IO = (1<<11), /* synchronous IO is efficient */
/* add others here before... */
- SWP_SCANNING = (1 << 11), /* refcount in scan_swap_map */
+ SWP_SCANNING = (1 << 12), /* refcount in scan_swap_map */
};
#define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bf91dc9e7a79..1305591cde4d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3168,6 +3168,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
p->flags |= SWP_STABLE_WRITES;
+ if (bdi_cap_synchronous_io(inode_to_bdi(inode)))
+ p->flags |= SWP_SYNCHRONOUS_IO;
+
if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
int cpu;
unsigned long ci, nr_cluster;
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
2017-09-12 2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
2017-09-12 2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12 2:37 ` Minchan Kim
2017-09-12 5:23 ` Huang, Ying
2017-09-12 2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
3 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 2:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
Sergey Senozhatsky, Huang, Ying
page_cluster 0 means "we don't want readahead" so in the case,
let's skip the readahead detection logic.
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0f54b491e118..739d94397c47 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
static inline bool swap_use_vma_readahead(void)
{
- return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
+ return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
+ && !atomic_read(&nr_rotate_swap);
}
/* Swap 50% full? Release swapcache more aggressively.. */
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
2017-09-12 2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
` (2 preceding siblings ...)
2017-09-12 2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
@ 2017-09-12 2:37 ` Minchan Kim
2017-09-12 6:04 ` Sergey Senozhatsky
2017-09-12 20:22 ` kbuild test robot
3 siblings, 2 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 2:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Minchan Kim, Ilya Dryomov,
Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins
With fast swap storage, platform want to use swap more aggressively
and swap-in is crucial to application latency.
The rw_page based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4 decompress
test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm.
This patch aims for reducing swap-in latency via skipping swapcache
if swap device is synchronous device like rw_page based device.
It enhances 45% my swapin test(5G sequential swapin, no readahead,
from 2.41sec to 1.64sec).
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 1 +
mm/memory.c | 48 +++++++++++++++++++++++++++++++++---------------
mm/page_io.c | 6 +++---
mm/swapfile.c | 11 +++++++----
4 files changed, 44 insertions(+), 22 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 739d94397c47..fb46a3b55d65 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,6 +462,7 @@ extern int page_swapcount(struct page *);
extern int __swp_swapcount(swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
+extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
extern bool reuse_swap_page(struct page *, int *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;
diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..130698cd2d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2842,7 +2842,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
int do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct page *page = NULL, *swapcache;
+ struct page *page = NULL, *swapcache = NULL;
struct mem_cgroup *memcg;
struct vma_swap_readahead swap_ra;
swp_entry_t entry;
@@ -2881,17 +2881,33 @@ int do_swap_page(struct vm_fault *vmf)
}
goto out;
}
+
+
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
if (!page)
page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
vmf->address);
if (!page) {
- if (vma_readahead)
- page = do_swap_page_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
- else
- page = swapin_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+ struct swap_info_struct *si = swp_swap_info(entry);
+
+ if (!(si->flags & SWP_SYNCHRONOUS_IO)) {
+ if (vma_readahead)
+ page = do_swap_page_readahead(entry,
+ GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
+ else
+ page = swapin_readahead(entry,
+ GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+ swapcache = page;
+ } else {
+ /* skip swapcache */
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+ set_page_private(page, entry.val);
+ lru_cache_add_anon(page);
+ swap_readpage(page, true);
+ }
+
if (!page) {
/*
* Back out if somebody else faulted in this pte
@@ -2920,7 +2936,6 @@ int do_swap_page(struct vm_fault *vmf)
goto out_release;
}
- swapcache = page;
locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2935,7 +2950,8 @@ int do_swap_page(struct vm_fault *vmf)
* test below, are not enough to exclude that. Even if it is still
* swapcache, we need to check that the page's swap has not changed.
*/
- if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
+ if (unlikely((!PageSwapCache(page) ||
+ page_private(page) != entry.val)) && swapcache)
goto out_page;
page = ksm_might_need_to_copy(page, vma, vmf->address);
@@ -2988,14 +3004,16 @@ int do_swap_page(struct vm_fault *vmf)
pte = pte_mksoft_dirty(pte);
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
vmf->orig_pte = pte;
- if (page == swapcache) {
- do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
- mem_cgroup_commit_charge(page, memcg, true, false);
- activate_page(page);
- } else { /* ksm created a completely new copy */
+
+ /* ksm created a completely new copy */
+ if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
+ } else {
+ do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+ mem_cgroup_commit_charge(page, memcg, true, false);
+ activate_page(page);
}
swap_free(entry);
@@ -3003,7 +3021,7 @@ int do_swap_page(struct vm_fault *vmf)
(vma->vm_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
- if (page != swapcache) {
+ if (page != swapcache && swapcache) {
/*
* Hold the lock to avoid the swap entry to be reused
* until we take the PT lock for the pte_same() check
diff --git a/mm/page_io.c b/mm/page_io.c
index 21502d341a67..d4a98e1f6608 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -346,7 +346,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return ret;
}
-int swap_readpage(struct page *page, bool do_poll)
+int swap_readpage(struct page *page, bool synchronous)
{
struct bio *bio;
int ret = 0;
@@ -354,7 +354,7 @@ int swap_readpage(struct page *page, bool do_poll)
blk_qc_t qc;
struct gendisk *disk;
- VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+ VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageUptodate(page), page);
if (frontswap_load(page) == 0) {
@@ -402,7 +402,7 @@ int swap_readpage(struct page *page, bool do_poll)
count_vm_event(PSWPIN);
bio_get(bio);
qc = submit_bio(bio);
- while (do_poll) {
+ while (synchronous) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!READ_ONCE(bio->bi_private))
break;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1305591cde4d..64a3d85226ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3454,10 +3454,15 @@ int swapcache_prepare(swp_entry_t entry)
return __swap_duplicate(entry, SWAP_HAS_CACHE);
}
+struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+{
+ return swap_info[swp_type(entry)];
+}
+
struct swap_info_struct *page_swap_info(struct page *page)
{
- swp_entry_t swap = { .val = page_private(page) };
- return swap_info[swp_type(swap)];
+ swp_entry_t entry = { .val = page_private(page) };
+ return swp_swap_info(entry);
}
/*
@@ -3465,7 +3470,6 @@ struct swap_info_struct *page_swap_info(struct page *page)
*/
struct address_space *__page_file_mapping(struct page *page)
{
- VM_BUG_ON_PAGE(!PageSwapCache(page), page);
return page_swap_info(page)->swap_file->f_mapping;
}
EXPORT_SYMBOL_GPL(__page_file_mapping);
@@ -3473,7 +3477,6 @@ EXPORT_SYMBOL_GPL(__page_file_mapping);
pgoff_t __page_file_index(struct page *page)
{
swp_entry_t swap = { .val = page_private(page) };
- VM_BUG_ON_PAGE(!PageSwapCache(page), page);
return swp_offset(swap);
}
EXPORT_SYMBOL_GPL(__page_file_index);
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO
2017-09-12 2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
@ 2017-09-12 4:48 ` Sergey Senozhatsky
0 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2017-09-12 4:48 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On (09/12/17 11:37), Minchan Kim wrote:
[..]
> If rw-page based fast storage is used for swap devices, we need to
> detect it to enhance swap IO operations.
> This patch is preparation for optimizing of swap-in operation with
> next patch.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> include/linux/swap.h | 3 ++-
> mm/swapfile.c | 3 +++
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8a807292037f..0f54b491e118 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -170,8 +170,9 @@ enum {
> SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */
> SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
> SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */
> + SWP_SYNCHRONOUS_IO = (1<<11), /* synchronous IO is efficient */
a nitpick: (1 << 11)
-ss
> /* add others here before... */
> - SWP_SCANNING = (1 << 11), /* refcount in scan_swap_map */
> + SWP_SCANNING = (1 << 12), /* refcount in scan_swap_map */
> };
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
@ 2017-09-12 5:23 ` Huang, Ying
2017-09-12 6:25 ` Minchan Kim
0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12 5:23 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky, Huang, Ying
Minchan Kim <minchan@kernel.org> writes:
> page_cluster 0 means "we don't want readahead" so in the case,
> let's skip the readahead detection logic.
>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> include/linux/swap.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0f54b491e118..739d94397c47 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>
> static inline bool swap_use_vma_readahead(void)
> {
> - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> + && !atomic_read(&nr_rotate_swap);
> }
>
> /* Swap 50% full? Release swapcache more aggressively.. */
Now the readahead window size of the VMA based swap readahead is
controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
original swap readahead is controlled by sysctl page_cluster. It is
possible for anonymous memory to use VMA based swap readahead and tmpfs
to use original swap readahead algorithm at the same time. So that, I
think it is necessary to use different control knob to control these two
algorithm. So if we want to disable readahead for tmpfs, but keep it
for VMA based readahead, we can set 0 to page_cluster but non-zero to
/sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
impossible.
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
2017-09-12 2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
@ 2017-09-12 6:04 ` Sergey Senozhatsky
2017-09-12 6:31 ` Minchan Kim
2017-09-12 20:22 ` kbuild test robot
1 sibling, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2017-09-12 6:04 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins
On (09/12/17 11:37), Minchan Kim wrote:
> + } else {
> + /* skip swapcache */
> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
what if alloc_page_vma() fails?
> + __SetPageLocked(page);
> + __SetPageSwapBacked(page);
> + set_page_private(page, entry.val);
> + lru_cache_add_anon(page);
> + swap_readpage(page, true);
> + }
-ss
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 5:23 ` Huang, Ying
@ 2017-09-12 6:25 ` Minchan Kim
2017-09-12 6:44 ` Huang, Ying
0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 6:25 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > page_cluster 0 means "we don't want readahead" so in the case,
> > let's skip the readahead detection logic.
> >
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > include/linux/swap.h | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0f54b491e118..739d94397c47 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >
> > static inline bool swap_use_vma_readahead(void)
> > {
> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> > + && !atomic_read(&nr_rotate_swap);
> > }
> >
> > /* Swap 50% full? Release swapcache more aggressively.. */
>
> Now the readahead window size of the VMA based swap readahead is
> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> original swap readahead is controlled by sysctl page_cluster. It is
> possible for anonymous memory to use VMA based swap readahead and tmpfs
> to use original swap readahead algorithm at the same time. So that, I
> think it is necessary to use different control knob to control these two
> algorithm. So if we want to disable readahead for tmpfs, but keep it
> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
> impossible.
For a long time, page-cluster have been used as controlling swap readahead.
One of example, zram users have been disabled readahead via 0 page-cluster.
However, with your change, it would be regressed if it doesn't disable
vma_ra_max_order.
As well, all of swap users should be aware of vma_ra_max_order as well as
page-cluster to control swap readahead but I didn't see any document about
that. Acutaully, I don't like it but want to unify it with page-cluster.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
2017-09-12 6:04 ` Sergey Senozhatsky
@ 2017-09-12 6:31 ` Minchan Kim
0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 6:31 UTC (permalink / raw)
To: Sergey Senozhatsky
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky, Dan Williams, Ross Zwisler, Hugh Dickins
On Tue, Sep 12, 2017 at 03:04:56PM +0900, Sergey Senozhatsky wrote:
> On (09/12/17 11:37), Minchan Kim wrote:
> > + } else {
> > + /* skip swapcache */
> > + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
>
> what if alloc_page_vma() fails?
Several modifications during development finally makes me remove the NULL check.
Thanks for catching it as well as style fix-up of other patch, Sergey.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 6:25 ` Minchan Kim
@ 2017-09-12 6:44 ` Huang, Ying
2017-09-12 6:52 ` Minchan Kim
0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12 6:44 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > page_cluster 0 means "we don't want readahead" so in the case,
>> > let's skip the readahead detection logic.
>> >
>> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> > include/linux/swap.h | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 0f54b491e118..739d94397c47 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >
>> > static inline bool swap_use_vma_readahead(void)
>> > {
>> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> > + && !atomic_read(&nr_rotate_swap);
>> > }
>> >
>> > /* Swap 50% full? Release swapcache more aggressively.. */
>>
>> Now the readahead window size of the VMA based swap readahead is
>> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> original swap readahead is controlled by sysctl page_cluster. It is
>> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> to use original swap readahead algorithm at the same time. So that, I
>> think it is necessary to use different control knob to control these two
>> algorithm. So if we want to disable readahead for tmpfs, but keep it
>> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
>> impossible.
>
> For a long time, page-cluster have been used as controlling swap readahead.
> One of example, zram users have been disabled readahead via 0 page-cluster.
> However, with your change, it would be regressed if it doesn't disable
> vma_ra_max_order.
>
> As well, all of swap users should be aware of vma_ra_max_order as well as
> page-cluster to control swap readahead but I didn't see any document about
> that. Acutaully, I don't like it but want to unify it with page-cluster.
The document is in
Documentation/ABI/testing/sysfs-kernel-mm-swap
The concern of unifying it with page-cluster is as following.
Original swap readahead on tmpfs may not work well because the combined
workload is running, so we want to disable or constrain it. But at the
same time, the VMA based swap readahead may work better. So I think it
may be necessary to control them separately.
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 6:44 ` Huang, Ying
@ 2017-09-12 6:52 ` Minchan Kim
2017-09-12 7:29 ` Huang, Ying
0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 6:52 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >>
> >> > page_cluster 0 means "we don't want readahead" so in the case,
> >> > let's skip the readahead detection logic.
> >> >
> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> > ---
> >> > include/linux/swap.h | 3 ++-
> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 0f54b491e118..739d94397c47 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >> >
> >> > static inline bool swap_use_vma_readahead(void)
> >> > {
> >> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> >> > + && !atomic_read(&nr_rotate_swap);
> >> > }
> >> >
> >> > /* Swap 50% full? Release swapcache more aggressively.. */
> >>
> >> Now the readahead window size of the VMA based swap readahead is
> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> >> original swap readahead is controlled by sysctl page_cluster. It is
> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
> >> to use original swap readahead algorithm at the same time. So that, I
> >> think it is necessary to use different control knob to control these two
> >> algorithm. So if we want to disable readahead for tmpfs, but keep it
> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> >> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
> >> impossible.
> >
> > For a long time, page-cluster have been used as controlling swap readahead.
> > One of example, zram users have been disabled readahead via 0 page-cluster.
> > However, with your change, it would be regressed if it doesn't disable
> > vma_ra_max_order.
> >
> > As well, all of swap users should be aware of vma_ra_max_order as well as
> > page-cluster to control swap readahead but I didn't see any document about
> > that. Acutaully, I don't like it but want to unify it with page-cluster.
>
> The document is in
>
> Documentation/ABI/testing/sysfs-kernel-mm-swap
>
> The concern of unifying it with page-cluster is as following.
>
> Original swap readahead on tmpfs may not work well because the combined
> workload is running, so we want to disable or constrain it. But at the
> same time, the VMA based swap readahead may work better. So I think it
> may be necessary to control them separately.
My concern is users have been disabled swap readahead by page-cluster would
be regressed. Please take care of them.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 6:52 ` Minchan Kim
@ 2017-09-12 7:29 ` Huang, Ying
2017-09-12 7:56 ` Minchan Kim
0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12 7:29 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >>
>> >> > page_cluster 0 means "we don't want readahead" so in the case,
>> >> > let's skip the readahead detection logic.
>> >> >
>> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> > ---
>> >> > include/linux/swap.h | 3 ++-
>> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 0f54b491e118..739d94397c47 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >> >
>> >> > static inline bool swap_use_vma_readahead(void)
>> >> > {
>> >> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> >> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> >> > + && !atomic_read(&nr_rotate_swap);
>> >> > }
>> >> >
>> >> > /* Swap 50% full? Release swapcache more aggressively.. */
>> >>
>> >> Now the readahead window size of the VMA based swap readahead is
>> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> >> original swap readahead is controlled by sysctl page_cluster. It is
>> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> >> to use original swap readahead algorithm at the same time. So that, I
>> >> think it is necessary to use different control knob to control these two
>> >> algorithm. So if we want to disable readahead for tmpfs, but keep it
>> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> >> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
>> >> impossible.
>> >
>> > For a long time, page-cluster have been used as controlling swap readahead.
>> > One of example, zram users have been disabled readahead via 0 page-cluster.
>> > However, with your change, it would be regressed if it doesn't disable
>> > vma_ra_max_order.
>> >
>> > As well, all of swap users should be aware of vma_ra_max_order as well as
>> > page-cluster to control swap readahead but I didn't see any document about
>> > that. Acutaully, I don't like it but want to unify it with page-cluster.
>>
>> The document is in
>>
>> Documentation/ABI/testing/sysfs-kernel-mm-swap
>>
>> The concern of unifying it with page-cluster is as following.
>>
>> Original swap readahead on tmpfs may not work well because the combined
>> workload is running, so we want to disable or constrain it. But at the
>> same time, the VMA based swap readahead may work better. So I think it
>> may be necessary to control them separately.
>
> My concern is users have been disabled swap readahead by page-cluster would
> be regressed. Please take care of them.
How about disable VMA based swap readahead if zram used as swap? Like
we have done for hard disk?
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 7:29 ` Huang, Ying
@ 2017-09-12 7:56 ` Minchan Kim
2017-09-12 8:07 ` Huang, Ying
0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 7:56 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On Tue, Sep 12, 2017 at 03:29:45PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >>
> >> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >>
> >> >> > page_cluster 0 means "we don't want readahead" so in the case,
> >> >> > let's skip the readahead detection logic.
> >> >> >
> >> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> > ---
> >> >> > include/linux/swap.h | 3 ++-
> >> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >> >
> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> > index 0f54b491e118..739d94397c47 100644
> >> >> > --- a/include/linux/swap.h
> >> >> > +++ b/include/linux/swap.h
> >> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
> >> >> >
> >> >> > static inline bool swap_use_vma_readahead(void)
> >> >> > {
> >> >> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >> >> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
> >> >> > + && !atomic_read(&nr_rotate_swap);
> >> >> > }
> >> >> >
> >> >> > /* Swap 50% full? Release swapcache more aggressively.. */
> >> >>
> >> >> Now the readahead window size of the VMA based swap readahead is
> >> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
> >> >> original swap readahead is controlled by sysctl page_cluster. It is
> >> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
> >> >> to use original swap readahead algorithm at the same time. So that, I
> >> >> think it is necessary to use different control knob to control these two
> >> >> algorithm. So if we want to disable readahead for tmpfs, but keep it
> >> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
> >> >> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
> >> >> impossible.
> >> >
> >> > For a long time, page-cluster have been used as controlling swap readahead.
> >> > One of example, zram users have been disabled readahead via 0 page-cluster.
> >> > However, with your change, it would be regressed if it doesn't disable
> >> > vma_ra_max_order.
> >> >
> >> > As well, all of swap users should be aware of vma_ra_max_order as well as
> >> > page-cluster to control swap readahead but I didn't see any document about
> >> > that. Acutaully, I don't like it but want to unify it with page-cluster.
> >>
> >> The document is in
> >>
> >> Documentation/ABI/testing/sysfs-kernel-mm-swap
> >>
> >> The concern of unifying it with page-cluster is as following.
> >>
> >> Original swap readahead on tmpfs may not work well because the combined
> >> workload is running, so we want to disable or constrain it. But at the
> >> same time, the VMA based swap readahead may work better. So I think it
> >> may be necessary to control them separately.
> >
> > My concern is users have been disabled swap readahead by page-cluster would
> > be regressed. Please take care of them.
>
> How about disable VMA based swap readahead if zram used as swap? Like
> we have done for hard disk?
It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
no seek cost swap devices if this patchset is merged so VM automatically
disables readahead. It is in my TODO but it's orthogonal work.
The problem I raised is "Why shouldn't we obey user's decision?",
not zram sepcific issue.
A user has used SSD as swap devices decided to disable swap readahead
by some reason(e.g., small memory system). Anyway, it has worked
via page-cluster for a several years but with vma-based swap devices,
it doesn't work any more.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 7:56 ` Minchan Kim
@ 2017-09-12 8:07 ` Huang, Ying
2017-09-12 8:22 ` Minchan Kim
0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12 8:07 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 12, 2017 at 03:29:45PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Tue, Sep 12, 2017 at 02:44:36PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >>
>> >> > On Tue, Sep 12, 2017 at 01:23:01PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >>
>> >> >> > page_cluster 0 means "we don't want readahead" so in the case,
>> >> >> > let's skip the readahead detection logic.
>> >> >> >
>> >> >> > Cc: "Huang, Ying" <ying.huang@intel.com>
>> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> > ---
>> >> >> > include/linux/swap.h | 3 ++-
>> >> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >> >
>> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> > index 0f54b491e118..739d94397c47 100644
>> >> >> > --- a/include/linux/swap.h
>> >> >> > +++ b/include/linux/swap.h
>> >> >> > @@ -427,7 +427,8 @@ extern bool has_usable_swap(void);
>> >> >> >
>> >> >> > static inline bool swap_use_vma_readahead(void)
>> >> >> > {
>> >> >> > - return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> >> >> > + return page_cluster > 0 && READ_ONCE(swap_vma_readahead)
>> >> >> > + && !atomic_read(&nr_rotate_swap);
>> >> >> > }
>> >> >> >
>> >> >> > /* Swap 50% full? Release swapcache more aggressively.. */
>> >> >>
>> >> >> Now the readahead window size of the VMA based swap readahead is
>> >> >> controlled by /sys/kernel/mm/swap/vma_ra_max_order, while that of the
>> >> >> original swap readahead is controlled by sysctl page_cluster. It is
>> >> >> possible for anonymous memory to use VMA based swap readahead and tmpfs
>> >> >> to use original swap readahead algorithm at the same time. So that, I
>> >> >> think it is necessary to use different control knob to control these two
>> >> >> algorithm. So if we want to disable readahead for tmpfs, but keep it
>> >> >> for VMA based readahead, we can set 0 to page_cluster but non-zero to
>> >> >> /sys/kernel/mm/swap/vma_ra_max_order. With your change, this will be
>> >> >> impossible.
>> >> >
>> >> > For a long time, page-cluster have been used as controlling swap readahead.
>> >> > One of example, zram users have been disabled readahead via 0 page-cluster.
>> >> > However, with your change, it would be regressed if it doesn't disable
>> >> > vma_ra_max_order.
>> >> >
>> >> > As well, all of swap users should be aware of vma_ra_max_order as well as
>> >> > page-cluster to control swap readahead but I didn't see any document about
>> >> > that. Acutaully, I don't like it but want to unify it with page-cluster.
>> >>
>> >> The document is in
>> >>
>> >> Documentation/ABI/testing/sysfs-kernel-mm-swap
>> >>
>> >> The concern of unifying it with page-cluster is as following.
>> >>
>> >> Original swap readahead on tmpfs may not work well because the combined
>> >> workload is running, so we want to disable or constrain it. But at the
>> >> same time, the VMA based swap readahead may work better. So I think it
>> >> may be necessary to control them separately.
>> >
>> > My concern is users have been disabled swap readahead by page-cluster would
>> > be regressed. Please take care of them.
>>
>> How about disable VMA based swap readahead if zram used as swap? Like
>> we have done for hard disk?
>
> It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> no seek cost swap devices if this patchset is merged so VM automatically
> disables readahead. It is in my TODO but it's orthogonal work.
>
> The problem I raised is "Why shouldn't we obey user's decision?",
> not zram sepcific issue.
>
> A user has used SSD as swap devices decided to disable swap readahead
> by some reason(e.g., small memory system). Anyway, it has worked
> via page-cluster for a several years but with vma-based swap devices,
> it doesn't work any more.
Can they add one more line to their configuration scripts?
echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 8:07 ` Huang, Ying
@ 2017-09-12 8:22 ` Minchan Kim
2017-09-12 8:32 ` Huang, Ying
0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 8:22 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
< snip >
> >> > My concern is users have been disabled swap readahead by page-cluster would
> >> > be regressed. Please take care of them.
> >>
> >> How about disable VMA based swap readahead if zram used as swap? Like
> >> we have done for hard disk?
> >
> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> > no seek cost swap devices if this patchset is merged so VM automatically
> > disables readahead. It is in my TODO but it's orthogonal work.
> >
> > The problem I raised is "Why shouldn't we obey user's decision?",
> > not zram sepcific issue.
> >
> > A user has used SSD as swap devices decided to disable swap readahead
> > by some reason(e.g., small memory system). Anyway, it has worked
> > via page-cluster for a several years but with vma-based swap devices,
> > it doesn't work any more.
>
> Can they add one more line to their configuration scripts?
>
> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
We call it as "regression", don't we?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 8:22 ` Minchan Kim
@ 2017-09-12 8:32 ` Huang, Ying
2017-09-12 23:34 ` Minchan Kim
0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2017-09-12 8:32 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
> < snip >
>> >> > My concern is users have been disabled swap readahead by page-cluster would
>> >> > be regressed. Please take care of them.
>> >>
>> >> How about disable VMA based swap readahead if zram used as swap? Like
>> >> we have done for hard disk?
>> >
>> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
>> > no seek cost swap devices if this patchset is merged so VM automatically
>> > disables readahead. It is in my TODO but it's orthogonal work.
>> >
>> > The problem I raised is "Why shouldn't we obey user's decision?",
>> > not zram sepcific issue.
>> >
>> > A user has used SSD as swap devices decided to disable swap readahead
>> > by some reason(e.g., small memory system). Anyway, it has worked
>> > via page-cluster for a several years but with vma-based swap devices,
>> > it doesn't work any more.
>>
>> Can they add one more line to their configuration scripts?
>>
>> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
>
> We call it as "regression", don't we?
I think this always happen when we switch default algorithm. For
example, if we had switched default IO scheduler, then the user scripts
to configure the parameters of old default IO scheduler will fail.
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
2017-09-12 2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
2017-09-12 6:04 ` Sergey Senozhatsky
@ 2017-09-12 20:22 ` kbuild test robot
2017-09-13 0:16 ` Minchan Kim
1 sibling, 1 reply; 21+ messages in thread
From: kbuild test robot @ 2017-09-12 20:22 UTC (permalink / raw)
To: Minchan Kim
Cc: kbuild-all, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky, Dan Williams, Ross Zwisler,
Hugh Dickins
[-- Attachment #1: Type: text/plain, Size: 12424 bytes --]
Hi Minchan,
[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20170912]
[cannot apply to linus/master v4.13]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Minchan-Kim/zram-set-BDI_CAP_STABLE_WRITES-once/20170913-025838
base: git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x016-201737 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All error/warnings (new ones prefixed by >>):
mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:2891:33: error: implicit declaration of function 'swp_swap_info' [-Werror=implicit-function-declaration]
struct swap_info_struct *si = swp_swap_info(entry);
^~~~~~~~~~~~~
>> mm/memory.c:2891:33: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
>> mm/memory.c:2908:4: error: implicit declaration of function 'swap_readpage' [-Werror=implicit-function-declaration]
swap_readpage(page, true);
^~~~~~~~~~~~~
In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/linux/smp.h:10,
from include/linux/kernel_stat.h:4,
from mm/memory.c:41:
mm/memory.c: At top level:
include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
______f = { \
^
include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
^~~~~~~~~~
include/linux/string.h:451:2: note: in expansion of macro 'if'
if (dest_len > count) {
^~
include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
______f = { \
^
include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
^~~~~~~~~~
include/linux/string.h:449:2: note: in expansion of macro 'if'
if (dest_size < dest_len)
^~
include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
______f = { \
^
include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
^~~~~~~~~~
include/linux/string.h:446:8: note: in expansion of macro 'if'
else if (src_size < dest_len && src_size < count)
^~
include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
______f = { \
^
include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
^~~~~~~~~~
include/linux/string.h:444:3: note: in expansion of macro 'if'
if (dest_size < dest_len && dest_size < count)
^~
include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcpy_and_pad' which is not static
______f = { \
^
include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
^~~~~~~~~~
include/linux/string.h:443:2: note: in expansion of macro 'if'
if (__builtin_constant_p(dest_len) && __builtin_constant_p(count)) {
^~
cc1: some warnings being treated as errors
vim +/swp_swap_info +2891 mm/memory.c
2833
2834 /*
2835 * We enter with non-exclusive mmap_sem (to exclude vma changes,
2836 * but allow concurrent faults), and pte mapped but not yet locked.
2837 * We return with pte unmapped and unlocked.
2838 *
2839 * We return with the mmap_sem locked or unlocked in the same cases
2840 * as does filemap_fault().
2841 */
2842 int do_swap_page(struct vm_fault *vmf)
2843 {
2844 struct vm_area_struct *vma = vmf->vma;
2845 struct page *page = NULL, *swapcache = NULL;
2846 struct mem_cgroup *memcg;
2847 struct vma_swap_readahead swap_ra;
2848 swp_entry_t entry;
2849 pte_t pte;
2850 int locked;
2851 int exclusive = 0;
2852 int ret = 0;
2853 bool vma_readahead = swap_use_vma_readahead();
2854
2855 if (vma_readahead)
2856 page = swap_readahead_detect(vmf, &swap_ra);
2857 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
2858 if (page)
2859 put_page(page);
2860 goto out;
2861 }
2862
2863 entry = pte_to_swp_entry(vmf->orig_pte);
2864 if (unlikely(non_swap_entry(entry))) {
2865 if (is_migration_entry(entry)) {
2866 migration_entry_wait(vma->vm_mm, vmf->pmd,
2867 vmf->address);
2868 } else if (is_device_private_entry(entry)) {
2869 /*
2870 * For un-addressable device memory we call the pgmap
2871 * fault handler callback. The callback must migrate
2872 * the page back to some CPU accessible page.
2873 */
2874 ret = device_private_entry_fault(vma, vmf->address, entry,
2875 vmf->flags, vmf->pmd);
2876 } else if (is_hwpoison_entry(entry)) {
2877 ret = VM_FAULT_HWPOISON;
2878 } else {
2879 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
2880 ret = VM_FAULT_SIGBUS;
2881 }
2882 goto out;
2883 }
2884
2885
2886 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
2887 if (!page)
2888 page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
2889 vmf->address);
2890 if (!page) {
> 2891 struct swap_info_struct *si = swp_swap_info(entry);
2892
2893 if (!(si->flags & SWP_SYNCHRONOUS_IO)) {
2894 if (vma_readahead)
2895 page = do_swap_page_readahead(entry,
2896 GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
2897 else
2898 page = swapin_readahead(entry,
2899 GFP_HIGHUSER_MOVABLE, vma, vmf->address);
2900 swapcache = page;
2901 } else {
2902 /* skip swapcache */
2903 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
2904 __SetPageLocked(page);
2905 __SetPageSwapBacked(page);
2906 set_page_private(page, entry.val);
2907 lru_cache_add_anon(page);
> 2908 swap_readpage(page, true);
2909 }
2910
2911 if (!page) {
2912 /*
2913 * Back out if somebody else faulted in this pte
2914 * while we released the pte lock.
2915 */
2916 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
2917 vmf->address, &vmf->ptl);
2918 if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
2919 ret = VM_FAULT_OOM;
2920 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
2921 goto unlock;
2922 }
2923
2924 /* Had to read the page from swap area: Major fault */
2925 ret = VM_FAULT_MAJOR;
2926 count_vm_event(PGMAJFAULT);
2927 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
2928 } else if (PageHWPoison(page)) {
2929 /*
2930 * hwpoisoned dirty swapcache pages are kept for killing
2931 * owner processes (which may be unknown at hwpoison time)
2932 */
2933 ret = VM_FAULT_HWPOISON;
2934 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
2935 swapcache = page;
2936 goto out_release;
2937 }
2938
2939 locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
2940
2941 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
2942 if (!locked) {
2943 ret |= VM_FAULT_RETRY;
2944 goto out_release;
2945 }
2946
2947 /*
2948 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
2949 * release the swapcache from under us. The page pin, and pte_same
2950 * test below, are not enough to exclude that. Even if it is still
2951 * swapcache, we need to check that the page's swap has not changed.
2952 */
2953 if (unlikely((!PageSwapCache(page) ||
2954 page_private(page) != entry.val)) && swapcache)
2955 goto out_page;
2956
2957 page = ksm_might_need_to_copy(page, vma, vmf->address);
2958 if (unlikely(!page)) {
2959 ret = VM_FAULT_OOM;
2960 page = swapcache;
2961 goto out_page;
2962 }
2963
2964 if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
2965 &memcg, false)) {
2966 ret = VM_FAULT_OOM;
2967 goto out_page;
2968 }
2969
2970 /*
2971 * Back out if somebody else already faulted in this pte.
2972 */
2973 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
2974 &vmf->ptl);
2975 if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
2976 goto out_nomap;
2977
2978 if (unlikely(!PageUptodate(page))) {
2979 ret = VM_FAULT_SIGBUS;
2980 goto out_nomap;
2981 }
2982
2983 /*
2984 * The page isn't present yet, go ahead with the fault.
2985 *
2986 * Be careful about the sequence of operations here.
2987 * To get its accounting right, reuse_swap_page() must be called
2988 * while the page is counted on swap but not yet in mapcount i.e.
2989 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
2990 * must be called after the swap_free(), or it will never succeed.
2991 */
2992
2993 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
2994 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
2995 pte = mk_pte(page, vma->vm_page_prot);
2996 if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
2997 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
2998 vmf->flags &= ~FAULT_FLAG_WRITE;
2999 ret |= VM_FAULT_WRITE;
3000 exclusive = RMAP_EXCLUSIVE;
3001 }
3002 flush_icache_page(vma, page);
3003 if (pte_swp_soft_dirty(vmf->orig_pte))
3004 pte = pte_mksoft_dirty(pte);
3005 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
3006 vmf->orig_pte = pte;
3007
3008 /* ksm created a completely new copy */
3009 if (unlikely(page != swapcache && swapcache)) {
3010 page_add_new_anon_rmap(page, vma, vmf->address, false);
3011 mem_cgroup_commit_charge(page, memcg, false, false);
3012 lru_cache_add_active_or_unevictable(page, vma);
3013 } else {
3014 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
3015 mem_cgroup_commit_charge(page, memcg, true, false);
3016 activate_page(page);
3017 }
3018
3019 swap_free(entry);
3020 if (mem_cgroup_swap_full(page) ||
3021 (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3022 try_to_free_swap(page);
3023 unlock_page(page);
3024 if (page != swapcache && swapcache) {
3025 /*
3026 * Hold the lock to avoid the swap entry to be reused
3027 * until we take the PT lock for the pte_same() check
3028 * (to avoid false positives from pte_same). For
3029 * further safety release the lock after the swap_free
3030 * so that the swap count won't change under a
3031 * parallel locked swapcache.
3032 */
3033 unlock_page(swapcache);
3034 put_page(swapcache);
3035 }
3036
3037 if (vmf->flags & FAULT_FLAG_WRITE) {
3038 ret |= do_wp_page(vmf);
3039 if (ret & VM_FAULT_ERROR)
3040 ret &= VM_FAULT_ERROR;
3041 goto out;
3042 }
3043
3044 /* No need to invalidate - it was non-present before */
3045 update_mmu_cache(vma, vmf->address, vmf->pte);
3046 unlock:
3047 pte_unmap_unlock(vmf->pte, vmf->ptl);
3048 out:
3049 return ret;
3050 out_nomap:
3051 mem_cgroup_cancel_charge(page, memcg, false);
3052 pte_unmap_unlock(vmf->pte, vmf->ptl);
3053 out_page:
3054 unlock_page(page);
3055 out_release:
3056 put_page(page);
3057 if (page != swapcache) {
3058 unlock_page(swapcache);
3059 put_page(swapcache);
3060 }
3061 return ret;
3062 }
3063
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27043 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 8:32 ` Huang, Ying
@ 2017-09-12 23:34 ` Minchan Kim
2017-09-13 0:55 ` Huang, Ying
0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2017-09-12 23:34 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Ilya Dryomov,
Sergey Senozhatsky
On Tue, Sep 12, 2017 at 04:32:43PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
>
> > On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
> > < snip >
> >> >> > My concern is users have been disabled swap readahead by page-cluster would
> >> >> > be regressed. Please take care of them.
> >> >>
> >> >> How about disable VMA based swap readahead if zram used as swap? Like
> >> >> we have done for hard disk?
> >> >
> >> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
> >> > no seek cost swap devices if this patchset is merged so VM automatically
> >> > disables readahead. It is in my TODO but it's orthogonal work.
> >> >
> >> > The problem I raised is "Why shouldn't we obey user's decision?",
> >> > not zram sepcific issue.
> >> >
> >> > A user has used SSD as swap devices decided to disable swap readahead
> >> > by some reason(e.g., small memory system). Anyway, it has worked
> >> > via page-cluster for a several years but with vma-based swap devices,
> >> > it doesn't work any more.
> >>
> >> Can they add one more line to their configuration scripts?
> >>
> >> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
> >
> > We call it as "regression", don't we?
>
> I think this always happen when we switch default algorithm. For
> example, if we had switched default IO scheduler, then the user scripts
> to configure the parameters of old default IO scheduler will fail.
I don't follow what you are saying with specific example.
If kernel did it which breaks something on userspace which has worked well,
it should be fixed. No doubt.
Even although it happened by mistakes, it couldn't be a excuse to break
new thing, either.
Simple. Fix the regression.
If you insist on "swap users should fix it by themselves via modification
of their script or it's not a regression", I don't want to waste my time to
persuade you any more. I will ask reverting your patches to Andrew.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device
2017-09-12 20:22 ` kbuild test robot
@ 2017-09-13 0:16 ` Minchan Kim
0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-09-13 0:16 UTC (permalink / raw)
To: kbuild test robot
Cc: kbuild-all, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky, Dan Williams, Ross Zwisler,
Hugh Dickins
On Wed, Sep 13, 2017 at 04:22:59AM +0800, kbuild test robot wrote:
> Hi Minchan,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on next-20170912]
> [cannot apply to linus/master v4.13]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Minchan-Kim/zram-set-BDI_CAP_STABLE_WRITES-once/20170913-025838
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x016-201737 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
>
> All error/warnings (new ones prefixed by >>):
I will fix !CONFIG_SWAP.
Thanks, 0-day!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] mm:swap: respect page_cluster for readahead
2017-09-12 23:34 ` Minchan Kim
@ 2017-09-13 0:55 ` Huang, Ying
0 siblings, 0 replies; 21+ messages in thread
From: Huang, Ying @ 2017-09-13 0:55 UTC (permalink / raw)
To: Minchan Kim
Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, kernel-team,
Ilya Dryomov, Sergey Senozhatsky
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 12, 2017 at 04:32:43PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Tue, Sep 12, 2017 at 04:07:01PM +0800, Huang, Ying wrote:
>> > < snip >
>> >> >> > My concern is users have been disabled swap readahead by page-cluster would
>> >> >> > be regressed. Please take care of them.
>> >> >>
>> >> >> How about disable VMA based swap readahead if zram used as swap? Like
>> >> >> we have done for hard disk?
>> >> >
>> >> > It could be with SWP_SYNCHRONOUS_IO flag which indicates super-fast,
>> >> > no seek cost swap devices if this patchset is merged so VM automatically
>> >> > disables readahead. It is in my TODO but it's orthogonal work.
>> >> >
>> >> > The problem I raised is "Why shouldn't we obey user's decision?",
>> >> > not zram sepcific issue.
>> >> >
>> >> > A user has used SSD as swap devices decided to disable swap readahead
>> >> > by some reason(e.g., small memory system). Anyway, it has worked
>> >> > via page-cluster for a several years but with vma-based swap devices,
>> >> > it doesn't work any more.
>> >>
>> >> Can they add one more line to their configuration scripts?
>> >>
>> >> echo 0 > /sys/kernel/mm/swap/vma_ra_max_order
>> >
>> > We call it as "regression", don't we?
>>
>> I think this always happen when we switch default algorithm. For
>> example, if we had switched default IO scheduler, then the user scripts
>> to configure the parameters of old default IO scheduler will fail.
>
> I don't follow what you are saying with specific example.
> If kernel did it which breaks something on userspace which has worked well,
> it should be fixed. No doubt.
>
> Even although it happened by mistakes, it couldn't be a excuse to break
> new thing, either.
>
> Simple. Fix the regression.
>
> If you insist on "swap users should fix it by themselves via modification
> of their script or it's not a regression", I don't want to waste my time to
> persuade you any more. I will ask reverting your patches to Andrew.
There is no functionality regression definitely. It may cause some
performance regression for some users, which could be resolved via some
scripts changing. Please don't mix them.
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2017-09-13 0:55 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12 2:37 [PATCH 1/5] zram: set BDI_CAP_STABLE_WRITES once Minchan Kim
2017-09-12 2:37 ` [PATCH 2/5] bdi: introduce BDI_CAP_SYNCHRONOUS_IO Minchan Kim
2017-09-12 2:37 ` [PATCH 3/5] mm:swap: introduce SWP_SYNCHRONOUS_IO Minchan Kim
2017-09-12 4:48 ` Sergey Senozhatsky
2017-09-12 2:37 ` [PATCH 4/5] mm:swap: respect page_cluster for readahead Minchan Kim
2017-09-12 5:23 ` Huang, Ying
2017-09-12 6:25 ` Minchan Kim
2017-09-12 6:44 ` Huang, Ying
2017-09-12 6:52 ` Minchan Kim
2017-09-12 7:29 ` Huang, Ying
2017-09-12 7:56 ` Minchan Kim
2017-09-12 8:07 ` Huang, Ying
2017-09-12 8:22 ` Minchan Kim
2017-09-12 8:32 ` Huang, Ying
2017-09-12 23:34 ` Minchan Kim
2017-09-13 0:55 ` Huang, Ying
2017-09-12 2:37 ` [PATCH 5/5] mm:swap: skip swapcache for swapin of synchronous device Minchan Kim
2017-09-12 6:04 ` Sergey Senozhatsky
2017-09-12 6:31 ` Minchan Kim
2017-09-12 20:22 ` kbuild test robot
2017-09-13 0:16 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).