linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] zram idle page writeback
@ 2018-11-16  7:20 Minchan Kim
  2018-11-16  7:20 ` [PATCH 1/6] zram: fix lockdep warning of free block handling Minchan Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Minchan Kim (6):
  zram: fix lockdep warning of free block handling
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt            |  42 +-
 drivers/block/zram/Kconfig                 |   5 +-
 drivers/block/zram/zram_drv.c              | 435 +++++++++++++++++----
 drivers/block/zram/zram_drv.h              |  18 +-
 5 files changed, 438 insertions(+), 94 deletions(-)

-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] zram: fix lockdep warning of free block handling
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  2018-11-16  7:20 ` [PATCH 2/6] zram: refactoring flags and writeback stuff Minchan Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim, stable

[  254.519728] ================================
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] --------------------------------
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] 00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]        CPU0
[  254.521732]        ----
[  254.521732]   lock(&(&zram->bitmap_lock)->rlock);
[  254.521732]   <Interrupt>
[  254.521732]     lock(&(&zram->bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  <IRQ>
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  </IRQ>

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: stable@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 56 +++++++++++++++++++++++++----------
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..472027eaed60 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
 
+static int zram_slot_trylock(struct zram *zram, u32 index)
+{
+	return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
+}
+
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
 	bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
@@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,
 
 static unsigned long get_entry_bdev(struct zram *zram)
 {
-	unsigned long entry;
+	unsigned long blk_idx;
+	unsigned long ret = 0;
 
-	spin_lock(&zram->bitmap_lock);
 	/* skip 0 bit to confuse zram.handle = 0 */
-	entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
-	if (entry == zram->nr_pages) {
-		spin_unlock(&zram->bitmap_lock);
-		return 0;
+	blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+	if (blk_idx == zram->nr_pages)
+		goto retry;
+
+	spin_lock_irq(&zram->bitmap_lock);
+	if (test_bit(blk_idx, zram->bitmap)) {
+		spin_unlock_irq(&zram->bitmap_lock);
+		goto retry;
 	}
 
-	set_bit(entry, zram->bitmap);
-	spin_unlock(&zram->bitmap_lock);
+	set_bit(blk_idx, zram->bitmap);
+	ret = blk_idx;
+	goto out;
+retry:
+	spin_lock_irq(&zram->bitmap_lock);
+	blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+	if (blk_idx == zram->nr_pages)
+		goto out;
+
+	set_bit(blk_idx, zram->bitmap);
+	ret = blk_idx;
+out:
+	spin_unlock_irq(&zram->bitmap_lock);
 
-	return entry;
+	return ret;
 }
 
 static void put_entry_bdev(struct zram *zram, unsigned long entry)
 {
 	int was_set;
+	unsigned long flags;
 
-	spin_lock(&zram->bitmap_lock);
+	spin_lock_irqsave(&zram->bitmap_lock, flags);
 	was_set = test_and_clear_bit(entry, zram->bitmap);
-	spin_unlock(&zram->bitmap_lock);
+	spin_unlock_irqrestore(&zram->bitmap_lock, flags);
 	WARN_ON_ONCE(!was_set);
 }
 
@@ -886,9 +907,10 @@ static ssize_t debug_stat_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	ret = scnprintf(buf, PAGE_SIZE,
-			"version: %d\n%8llu\n",
+			"version: %d\n%8llu %8llu\n",
 			version,
-			(u64)atomic64_read(&zram->stats.writestall));
+			(u64)atomic64_read(&zram->stats.writestall),
+			(u64)atomic64_read(&zram->stats.miss_free));
 	up_read(&zram->init_lock);
 
 	return ret;
@@ -1400,10 +1422,14 @@ static void zram_slot_free_notify(struct block_device *bdev,
 
 	zram = bdev->bd_disk->private_data;
 
-	zram_slot_lock(zram, index);
+	atomic64_inc(&zram->stats.notify_free);
+	if (!zram_slot_trylock(zram, index)) {
+		atomic64_inc(&zram->stats.miss_free);
+		return;
+	}
+
 	zram_free_page(zram, index);
 	zram_slot_unlock(zram, index);
-	atomic64_inc(&zram->stats.notify_free);
 }
 
 static int zram_rw_page(struct block_device *bdev, sector_t sector,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 72c8584b6dff..89da501292ff 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -79,6 +79,7 @@ struct zram_stats {
 	atomic64_t pages_stored;	/* no. of pages currently stored */
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
 	atomic64_t writestall;		/* no. of write slow paths */
+	atomic64_t miss_free;		/* no. of missed free */
 };
 
 struct zram {
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] zram: refactoring flags and writeback stuff
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
  2018-11-16  7:20 ` [PATCH 1/6] zram: fix lockdep warning of free block handling Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  2018-11-16  7:20 ` [PATCH 3/6] zram: introduce ZRAM_IDLE flag Minchan Kim
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 105 +++++++++++++---------------------
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 472027eaed60..bc59db2b1036 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-	return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
+	return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-	bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
+	bit_spin_lock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-	bit_spin_unlock(ZRAM_LOCK, &zram->table[index].value);
+	bit_spin_unlock(ZRAM_LOCK, &zram->table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-	return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+	return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
 					zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
-	return zram->table[index].value & BIT(flag);
+	return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
-	zram->table[index].value |= BIT(flag);
+	zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
-	zram->table[index].value &= ~BIT(flag);
+	zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-	return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+	return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
 					u32 index, size_t size)
 {
-	unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+	unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-	zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+	zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-	return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
 	struct block_device *bdev;
 
-	if (!zram_wb_enabled(zram))
+	if (!zram->backing_dev)
 		return;
 
 	bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
 	ssize_t ret;
 
 	down_read(&zram->init_lock);
-	if (!zram_wb_enabled(zram)) {
+	if (!zram->backing_dev) {
 		memcpy(buf, "none\n", 5);
 		up_read(&zram->init_lock);
 		return 5;
@@ -446,7 +441,7 @@ static ssize_t backing_dev_store(struct device *dev,
 	return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
 	unsigned long blk_idx;
 	unsigned long ret = 0;
@@ -479,13 +474,13 @@ static unsigned long get_entry_bdev(struct zram *zram)
 	return ret;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
 	int was_set;
 	unsigned long flags;
 
 	spin_lock_irqsave(&zram->bitmap_lock, flags);
-	was_set = test_and_clear_bit(entry, zram->bitmap);
+	was_set = test_and_clear_bit(blk_idx, zram->bitmap);
 	spin_unlock_irqrestore(&zram->bitmap_lock, flags);
 	WARN_ON_ONCE(!was_set);
 }
@@ -599,7 +594,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 	if (!bio)
 		return -ENOMEM;
 
-	entry = get_entry_bdev(zram);
+	entry = alloc_block_bdev(zram);
 	if (!entry) {
 		bio_put(bio);
 		return -ENOSPC;
@@ -610,7 +605,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 	if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
 					bvec->bv_offset)) {
 		bio_put(bio);
-		put_entry_bdev(zram, entry);
+		free_block_bdev(zram, entry);
 		return -EIO;
 	}
 
@@ -628,18 +623,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 	return 0;
 }
 
-static void zram_wb_clear(struct zram *zram, u32 index)
-{
-	unsigned long entry;
-
-	zram_clear_flag(zram, index, ZRAM_WB);
-	entry = zram_get_element(zram, index);
-	zram_set_element(zram, index, 0);
-	put_entry_bdev(zram, entry);
-}
-
 #else
-static bool zram_wb_enabled(struct zram *zram) { return false; }
 static inline void reset_bdev(struct zram *zram) {};
 static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 					u32 index, struct bio *parent,
@@ -654,7 +638,8 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
 {
 	return -EIO;
 }
-static void zram_wb_clear(struct zram *zram, u32 index) {}
+
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx) {};
 #endif
 
 #ifdef CONFIG_ZRAM_MEMORY_TRACKING
@@ -676,11 +661,6 @@ static void zram_accessed(struct zram *zram, u32 index)
 	zram->table[index].ac_time = ktime_get_boottime();
 }
 
-static void zram_reset_access(struct zram *zram, u32 index)
-{
-	zram->table[index].ac_time = 0;
-}
-
 static ssize_t read_block_state(struct file *file, char __user *buf,
 				size_t count, loff_t *ppos)
 {
@@ -761,7 +741,6 @@ static void zram_debugfs_unregister(struct zram *zram)
 static void zram_debugfs_create(void) {};
 static void zram_debugfs_destroy(void) {};
 static void zram_accessed(struct zram *zram, u32 index) {};
-static void zram_reset_access(struct zram *zram, u32 index) {};
 static void zram_debugfs_register(struct zram *zram) {};
 static void zram_debugfs_unregister(struct zram *zram) {};
 #endif
@@ -962,17 +941,18 @@ static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
 
-	zram_reset_access(zram, index);
-
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+	zram->table[index].ac_time = 0;
+#endif
 	if (zram_test_flag(zram, index, ZRAM_HUGE)) {
 		zram_clear_flag(zram, index, ZRAM_HUGE);
 		atomic64_dec(&zram->stats.huge_pages);
 	}
 
-	if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) {
-		zram_wb_clear(zram, index);
-		atomic64_dec(&zram->stats.pages_stored);
-		return;
+	if (zram_test_flag(zram, index, ZRAM_WB)) {
+		zram_clear_flag(zram, index, ZRAM_WB);
+		free_block_bdev(zram, zram_get_element(zram, index));
+		goto out;
 	}
 
 	/*
@@ -981,10 +961,8 @@ static void zram_free_page(struct zram *zram, size_t index)
 	 */
 	if (zram_test_flag(zram, index, ZRAM_SAME)) {
 		zram_clear_flag(zram, index, ZRAM_SAME);
-		zram_set_element(zram, index, 0);
 		atomic64_dec(&zram->stats.same_pages);
-		atomic64_dec(&zram->stats.pages_stored);
-		return;
+		goto out;
 	}
 
 	handle = zram_get_handle(zram, index);
@@ -995,10 +973,11 @@ static void zram_free_page(struct zram *zram, size_t index)
 
 	atomic64_sub(zram_get_obj_size(zram, index),
 			&zram->stats.compr_data_size);
+out:
 	atomic64_dec(&zram->stats.pages_stored);
-
 	zram_set_handle(zram, index, 0);
 	zram_set_obj_size(zram, index, 0);
+	WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
 }
 
 static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1009,24 +988,20 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
 	unsigned int size;
 	void *src, *dst;
 
-	if (zram_wb_enabled(zram)) {
-		zram_slot_lock(zram, index);
-		if (zram_test_flag(zram, index, ZRAM_WB)) {
-			struct bio_vec bvec;
-
-			zram_slot_unlock(zram, index);
+	zram_slot_lock(zram, index);
+	if (zram_test_flag(zram, index, ZRAM_WB)) {
+		struct bio_vec bvec;
 
-			bvec.bv_page = page;
-			bvec.bv_len = PAGE_SIZE;
-			bvec.bv_offset = 0;
-			return read_from_bdev(zram, &bvec,
-					zram_get_element(zram, index),
-					bio, partial_io);
-		}
 		zram_slot_unlock(zram, index);
+
+		bvec.bv_page = page;
+		bvec.bv_len = PAGE_SIZE;
+		bvec.bv_offset = 0;
+		return read_from_bdev(zram, &bvec,
+				zram_get_element(zram, index),
+				bio, partial_io);
 	}
 
-	zram_slot_lock(zram, index);
 	handle = zram_get_handle(zram, index);
 	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
 		unsigned long value;
@@ -1138,7 +1113,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 
 	if (unlikely(comp_len >= huge_class_size)) {
 		comp_len = PAGE_SIZE;
-		if (zram_wb_enabled(zram) && allow_wb) {
+		if (zram->backing_dev && allow_wb) {
 			zcomp_stream_put(zram->comp);
 			ret = write_to_bdev(zram, bvec, index, bio, &element);
 			if (!ret) {
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 89da501292ff..d75bf190f262 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -30,7 +30,7 @@
 
 
 /*
- * The lower ZRAM_FLAG_SHIFT bits of table.value is for
+ * The lower ZRAM_FLAG_SHIFT bits of table.flags is for
  * object size (excluding header), the higher bits is for
  * zram_pageflags.
  *
@@ -41,7 +41,7 @@
  */
 #define ZRAM_FLAG_SHIFT 24
 
-/* Flags for zram pages (table[page_no].value) */
+/* Flags for zram pages (table[page_no].flags) */
 enum zram_pageflags {
 	/* zram slot is locked */
 	ZRAM_LOCK = ZRAM_FLAG_SHIFT,
@@ -60,7 +60,7 @@ struct zram_table_entry {
 		unsigned long handle;
 		unsigned long element;
 	};
-	unsigned long value;
+	unsigned long flags;
 #ifdef CONFIG_ZRAM_MEMORY_TRACKING
 	ktime_t ac_time;
 #endif
@@ -105,8 +105,8 @@ struct zram {
 	 * zram is claimed so open request will be failed
 	 */
 	bool claim; /* Protected by bdev->bd_mutex */
-#ifdef CONFIG_ZRAM_WRITEBACK
 	struct file *backing_dev;
+#ifdef CONFIG_ZRAM_WRITEBACK
 	struct block_device *bdev;
 	unsigned int old_block_size;
 	unsigned long *bitmap;
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] zram: introduce ZRAM_IDLE flag
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
  2018-11-16  7:20 ` [PATCH 1/6] zram: fix lockdep warning of free block handling Minchan Kim
  2018-11-16  7:20 ` [PATCH 2/6] zram: refactoring flags and writeback stuff Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  2018-11-20  2:46   ` Sergey Senozhatsky
  2018-11-16  7:20 ` [PATCH 4/6] zram: support idle page writeback Minchan Kim
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
	"echo 1 > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

          300    75.033841 ...i
          301    63.806904 s..i
          302    63.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

	  300    75.033841 ...
          301    63.806904 s..i
          302    63.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++++
 Documentation/blockdev/zram.txt            | 10 +++--
 drivers/block/zram/zram_drv.c              | 44 ++++++++++++++++++++--
 drivers/block/zram/zram_drv.h              |  1 +
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact:	Minchan Kim <minchan@kernel.org>
 		The backing_dev file is read-write and set up backing
 		device for zram to write incompressible pages.
 		For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:		/sys/block/zram<id>/idle
+Date:		November 2018
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		idle file is write-only and mark zram slot as idle.
+		If system has mounted debugfs, user can see which slots
+		are idle via /sys/kernel/debug/zram/zram<id>/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithm    RW    show and change the compression algorithm
 compact           WO    trigger memory compaction
 debug_stat        RO    this file is used for zram debugging purposes
 backing_dev	  RW	set up backend storage for zram to write out
+idle		  WO	mark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
-	  300    75.033841 .wh
-	  301    63.806904 s..
-	  302    63.806919 ..h
+	  300    75.033841 .wh.
+	  301    63.806904 s...
+	  302    63.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bc59db2b1036..f956179076ce 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,34 @@ static ssize_t mem_used_max_store(struct device *dev,
 	return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct zram *zram = dev_to_zram(dev);
+	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+	int index;
+
+	down_read(&zram->init_lock);
+	if (!init_done(zram)) {
+		up_read(&zram->init_lock);
+		return -EINVAL;
+	}
+
+	for (index = 0; index < nr_pages; index++) {
+		zram_slot_lock(zram, index);
+		if (!zram_allocated(zram, index))
+			goto next;
+
+		zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+		zram_slot_unlock(zram, index);
+	}
+
+	up_read(&zram->init_lock);
+
+	return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -658,6 +686,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+	zram_clear_flag(zram, index, ZRAM_IDLE);
 	zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -690,12 +719,13 @@ static ssize_t read_block_state(struct file *file, char __user *buf,
 
 		ts = ktime_to_timespec64(zram->table[index].ac_time);
 		copied = snprintf(kbuf + written, count,
-			"%12zd %12lld.%06lu %c%c%c\n",
+			"%12zd %12lld.%06lu %c%c%c%c\n",
 			index, (s64)ts.tv_sec,
 			ts.tv_nsec / NSEC_PER_USEC,
 			zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
 			zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
-			zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.');
+			zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.',
+			zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.');
 
 		if (count < copied) {
 			zram_slot_unlock(zram, index);
@@ -740,7 +770,10 @@ static void zram_debugfs_unregister(struct zram *zram)
 #else
 static void zram_debugfs_create(void) {};
 static void zram_debugfs_destroy(void) {};
-static void zram_accessed(struct zram *zram, u32 index) {};
+static void zram_accessed(struct zram *zram, u32 index)
+{
+	zram_clear_flag(zram, index, ZRAM_IDLE);
+};
 static void zram_debugfs_register(struct zram *zram) {};
 static void zram_debugfs_unregister(struct zram *zram) {};
 #endif
@@ -944,6 +977,9 @@ static void zram_free_page(struct zram *zram, size_t index)
 #ifdef CONFIG_ZRAM_MEMORY_TRACKING
 	zram->table[index].ac_time = 0;
 #endif
+	if (zram_test_flag(zram, index, ZRAM_IDLE))
+		zram_clear_flag(zram, index, ZRAM_IDLE);
+
 	if (zram_test_flag(zram, index, ZRAM_HUGE)) {
 		zram_clear_flag(zram, index, ZRAM_HUGE);
 		atomic64_dec(&zram->stats.huge_pages);
@@ -1609,6 +1645,7 @@ static DEVICE_ATTR_RO(initstate);
 static DEVICE_ATTR_WO(reset);
 static DEVICE_ATTR_WO(mem_limit);
 static DEVICE_ATTR_WO(mem_used_max);
+static DEVICE_ATTR_WO(idle);
 static DEVICE_ATTR_RW(max_comp_streams);
 static DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
@@ -1622,6 +1659,7 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_compact.attr,
 	&dev_attr_mem_limit.attr,
 	&dev_attr_mem_used_max.attr,
+	&dev_attr_idle.attr,
 	&dev_attr_max_comp_streams.attr,
 	&dev_attr_comp_algorithm.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index d75bf190f262..214fa4bb46b9 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -48,6 +48,7 @@ enum zram_pageflags {
 	ZRAM_SAME,	/* Page consists the same element */
 	ZRAM_WB,	/* page is stored on backing_device */
 	ZRAM_HUGE,	/* Incompressible page */
+	ZRAM_IDLE,	/* not accessed page since last idle marking */
 
 	__NR_ZRAM_PAGEFLAGS,
 };
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] zram: support idle page writeback
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
                   ` (2 preceding siblings ...)
  2018-11-16  7:20 ` [PATCH 3/6] zram: introduce ZRAM_IDLE flag Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  2018-11-21  4:55   ` Sergey Senozhatsky
  2018-11-16  7:20 ` [PATCH 5/6] zram: add bd_stat statistics Minchan Kim
  2018-11-16  7:20 ` [PATCH 6/6] zram: writeback throttle Minchan Kim
  5 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

This patch supports new feature "zram idle page writeback".
On zram-swap usecase, zram has usually idle swap pages come
from many processes. It's pointless to keep in memory(ie, zram).

To solve the problem, this feature gives idle page writeback to
backing device so the goal is to save more memory space
on embedded system.

Normal sequence to use the feature is as follows,

while (1) {
        # mark allocated zram slot to idle
        echo 1 > /sys/block/zram0/idle
        sleep several hours
        # idle zram slots are still IDLE marked.
        echo 3 > /sys/block/zram0/writeback
        # write the IDLE marked slot into backing device and free
        # the memory.
}

echo 'val' > /sys/block/zramX/writeback

val is combination of bits.

0th bit: hugepage writeback
1th bit: idlepage writeback

Thus,
1 -> hugepage writeback
2 -> idlepage writeabck
3 -> writeback both pages

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt            |  19 +++
 drivers/block/zram/Kconfig                 |   5 +-
 drivers/block/zram/zram_drv.c              | 166 +++++++++++++++++++--
 drivers/block/zram/zram_drv.h              |   1 +
 5 files changed, 187 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:	Minchan Kim <minchan@kernel.org>
 		idle file is write-only and mark zram slot as idle.
 		If system has mounted debugfs, user can see which slots
 		are idle via /sys/kernel/debug/zram/zram<id>/block_state
+
+What:		/sys/block/zram<id>/writeback
+Date:		November 2018
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		The writeback file is write-only and trigger idle and/or
+		huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..60b585dab6e0 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -244,6 +244,25 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+User can writeback idle pages to backing device. To use the feature,
+first, user need to mark zram slots allocated currently as idle.
+Afterward, slots not accessed since then will have still idle mark.
+Then, if user does,
+	"echo val > /sys/block/zramX/writeback"
+
+  val is combination of bits.
+
+  0th bit: hugepage writeback
+  1th bit: idlepage writeback
+
+  Thus,
+  1 -> hugepage writeback
+  2 -> idlepage writeabck
+  3 -> writeback both pages
+
+zram will writeback the idle/huge pages to backing device and free the
+memory space pages occupied so save memory.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
 	  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-       bool "Write back incompressible page to backing device"
+       bool "Write back incompressible or idle page to backing device"
        depends on ZRAM
        help
 	 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 	 For this feature, admin should set up backing device via
 	 /sys/block/zramX/backing_dev.
 
+	 With /sys/block/zramX/{idle,writeback}, application could ask
+	 idle page's writeback to the backing device to save in memory.
+
 	 See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_MEMORY_TRACKING
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f956179076ce..b7b5c9e5f0cd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
 static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+				u32 index, int offset, struct bio *bio);
+
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
 	return zram->disksize;
 }
 
-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
-	return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
-					zram->table[index].handle;
-}
-
 static inline struct zram *dev_to_zram(struct device *dev)
 {
 	return (struct zram *)dev_to_disk(dev)->private_data;
@@ -138,6 +134,13 @@ static void zram_set_obj_size(struct zram *zram,
 	zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+	return zram_get_obj_size(zram, index) ||
+			zram_test_flag(zram, index, ZRAM_SAME) ||
+			zram_test_flag(zram, index, ZRAM_WB);
+}
+
 #if PAGE_SIZE != 4096
 static inline bool is_partial_io(struct bio_vec *bvec)
 {
@@ -295,10 +298,14 @@ static ssize_t idle_store(struct device *dev,
 	}
 
 	for (index = 0; index < nr_pages; index++) {
+		/*
+		 * Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
+		 * See the comment in writeback_store.
+		 */
 		zram_slot_lock(zram, index);
-		if (!zram_allocated(zram, index))
+		if (!zram_allocated(zram, index) ||
+				zram_test_flag(zram, index, ZRAM_UNDER_WB))
 			goto next;
-
 		zram_set_flag(zram, index, ZRAM_IDLE);
 next:
 		zram_slot_unlock(zram, index);
@@ -553,6 +560,142 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
 	return 1;
 }
 
+static ssize_t writeback_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct zram *zram = dev_to_zram(dev);
+	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+	int index;
+	struct bio bio;
+	struct bio_vec bio_vec;
+	struct page *page;
+	ssize_t ret;
+	unsigned long mode;
+	unsigned long blk_idx = 0;
+
+#define HUGE_WRITEBACK 0x1
+#define IDLE_WRITEBACK 0x2
+
+	ret = kstrtoul(buf, 10, &mode);
+	if (ret)
+		return ret;
+
+	if (!(mode & (HUGE_WRITEBACK|IDLE_WRITEBACK)))
+		return -EINVAL;
+
+	down_read(&zram->init_lock);
+	if (!init_done(zram)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!zram->backing_dev) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (index = 0; index < nr_pages; index++) {
+		struct bio_vec bvec;
+
+		bvec.bv_page = page;
+		bvec.bv_len = PAGE_SIZE;
+		bvec.bv_offset = 0;
+
+		if (!blk_idx) {
+			blk_idx = alloc_block_bdev(zram);
+			if (!blk_idx) {
+				ret = -ENOSPC;
+				break;
+			}
+		}
+
+		zram_slot_lock(zram, index);
+		if (!zram_allocated(zram, index))
+			goto next;
+
+		if (zram_test_flag(zram, index, ZRAM_WB) ||
+				zram_test_flag(zram, index, ZRAM_SAME) ||
+				zram_test_flag(zram, index, ZRAM_UNDER_WB))
+			goto next;
+
+		if ((mode & IDLE_WRITEBACK &&
+			  !zram_test_flag(zram, index, ZRAM_IDLE)) &&
+		    (mode & HUGE_WRITEBACK &&
+			  !zram_test_flag(zram, index, ZRAM_HUGE)))
+			goto next;
+		/*
+		 * Clearing ZRAM_UNDER_WB is duty of caller.
+		 * IOW, zram_free_page never clear it.
+		 */
+		zram_set_flag(zram, index, ZRAM_UNDER_WB);
+		zram_slot_unlock(zram, index);
+		if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
+			zram_slot_lock(zram, index);
+			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+			zram_slot_unlock(zram, index);
+			continue;
+		}
+
+		bio_init(&bio, &bio_vec, 1);
+		bio_set_dev(&bio, zram->bdev);
+		bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
+		bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
+
+		bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
+				bvec.bv_offset);
+		/*
+		 * XXX: A single page IO would be inefficient for write
+		 * but it would be not bad as starter.
+		 */
+		ret = submit_bio_wait(&bio);
+		if (ret) {
+			zram_slot_lock(zram, index);
+			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+			zram_slot_unlock(zram, index);
+			continue;
+		}
+
+		/*
+		 * We released zram_slot_lock so need to check if the slot was
+		 * changed. If there is freeing for the slot, we can catch it
+		 * easily by zram_allocated.
+		 * A subtle case is the slot is freed/reallocated/marked as
+		 * ZRAM_IDLE again. To close the race, idle_store doesn't
+		 * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
+		 * Thus, we could close the race by checking ZRAM_IDLE bit.
+		 */
+		zram_slot_lock(zram, index);
+		if (!zram_allocated(zram, index) ||
+			  !zram_test_flag(zram, index, ZRAM_IDLE)) {
+			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+			goto next;
+		}
+
+		zram_free_page(zram, index);
+		zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+		zram_set_flag(zram, index, ZRAM_WB);
+		zram_set_element(zram, index, blk_idx);
+		blk_idx = 0;
+		atomic64_inc(&zram->stats.pages_stored);
+next:
+		zram_slot_unlock(zram, index);
+	}
+	if (blk_idx)
+		free_block_bdev(zram, blk_idx);
+	ret = len;
+	__free_page(page);
+out:
+	up_read(&zram->init_lock);
+
+	return ret;
+}
+
 struct zram_work {
 	struct work_struct work;
 	struct zram *zram;
@@ -1013,7 +1156,8 @@ static void zram_free_page(struct zram *zram, size_t index)
 	atomic64_dec(&zram->stats.pages_stored);
 	zram_set_handle(zram, index, 0);
 	zram_set_obj_size(zram, index, 0);
-	WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
+	WARN_ON_ONCE(zram->table[index].flags &
+		~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
 }
 
 static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1650,6 +1794,7 @@ static DEVICE_ATTR_RW(max_comp_streams);
 static DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RW(backing_dev);
+static DEVICE_ATTR_WO(writeback);
 #endif
 
 static struct attribute *zram_disk_attrs[] = {
@@ -1664,6 +1809,7 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_comp_algorithm.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_backing_dev.attr,
+	&dev_attr_writeback.attr,
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 214fa4bb46b9..07695fe70e17 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -47,6 +47,7 @@ enum zram_pageflags {
 	ZRAM_LOCK = ZRAM_FLAG_SHIFT,
 	ZRAM_SAME,	/* Page consists the same element */
 	ZRAM_WB,	/* page is stored on backing_device */
+	ZRAM_UNDER_WB,	/* page is under writeback */
 	ZRAM_HUGE,	/* Incompressible page */
 	ZRAM_IDLE,	/* not accessed page since last idle marking */
 
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] zram: add bd_stat statistics
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
                   ` (3 preceding siblings ...)
  2018-11-16  7:20 ` [PATCH 4/6] zram: support idle page writeback Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  2018-11-16  7:20 ` [PATCH 6/6] zram: writeback throttle Minchan Kim
  5 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

bd_stat reprenents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++++++
 Documentation/blockdev/zram.txt            | 11 ++++++++
 drivers/block/zram/zram_drv.c              | 31 ++++++++++++++++++++++
 drivers/block/zram/zram_drv.h              |  5 ++++
 4 files changed, 55 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..a4daca7e5043 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:	Minchan Kim <minchan@kernel.org>
 Description:
 		The writeback file is write-only and trigger idle and/or
 		huge page writeback to backing device.
+
+What:		/sys/block/zram<id>/bd_stat
+Date:		November 2018
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		The bd_stat file is read-only and represents backing device's
+		statistics (bd_count, bd_reads, bd_writes.) in a format
+		similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 60b585dab6e0..1f4907307a0d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages	  the number of incompressible pages
 
+File /sys/block/zram<id>/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count	size of data written in backing device.
+		Unit: pages
+ bd_reads	the number of reads from backing device
+		Unit: pages
+ bd_writes	the number of writes to backing device
+		Unit: pages
+
 9) Deactivate:
 	swapoff /dev/zram0
 	umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7b5c9e5f0cd..17d566d9a321 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -505,6 +505,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
 	ret = blk_idx;
 out:
 	spin_unlock_irq(&zram->bitmap_lock);
+	if (ret != 0)
+		atomic64_inc(&zram->stats.bd_count);
 
 	return ret;
 }
@@ -518,6 +520,7 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 	was_set = test_and_clear_bit(blk_idx, zram->bitmap);
 	spin_unlock_irqrestore(&zram->bitmap_lock, flags);
 	WARN_ON_ONCE(!was_set);
+	atomic64_dec(&zram->stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -661,6 +664,7 @@ static ssize_t writeback_store(struct device *dev,
 			continue;
 		}
 
+		atomic64_inc(&zram->stats.bd_writes);
 		/*
 		 * We released zram_slot_lock so need to check if the slot was
 		 * changed. If there is freeing for the slot, we can catch it
@@ -748,6 +752,7 @@ static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
 			unsigned long entry, struct bio *parent, bool sync)
 {
+	atomic64_inc(&zram->stats.bd_reads);
 	if (sync)
 		return read_from_bdev_sync(zram, bvec, entry, parent);
 	else
@@ -790,6 +795,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 
 	submit_bio(bio);
 	*pentry = entry;
+	atomic64_inc(&zram->stats.bd_writes);
 
 	return 0;
 }
@@ -1053,6 +1059,25 @@ static ssize_t mm_stat_show(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct zram *zram = dev_to_zram(dev);
+	ssize_t ret;
+
+	down_read(&zram->init_lock);
+	ret = scnprintf(buf, PAGE_SIZE,
+			"%8llu %8llu %8llu\n",
+			(u64)atomic64_read(&zram->stats.bd_count),
+			(u64)atomic64_read(&zram->stats.bd_reads),
+			(u64)atomic64_read(&zram->stats.bd_writes));
+	up_read(&zram->init_lock);
+
+	return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -1073,6 +1098,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1813,6 +1841,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+	&dev_attr_bd_stat.attr,
+#endif
 	&dev_attr_debug_stat.attr,
 	NULL,
 };
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 07695fe70e17..487ff283fb31 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -82,6 +82,11 @@ struct zram_stats {
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
 	atomic64_t writestall;		/* no. of write slow paths */
 	atomic64_t miss_free;		/* no. of missed free */
+#ifdef	CONFIG_ZRAM_WRITEBACK
+	atomic64_t bd_count;		/* no. of pages in backing device */
+	atomic64_t bd_reads;		/* no. of read from backing device */
+	atomic64_t bd_writes;		/* no. of write from backing device */
+#endif
 };
 
 struct zram {
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] zram: writeback throttle
  2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
                   ` (4 preceding siblings ...)
  2018-11-16  7:20 ` [PATCH 5/6] zram: add bd_stat statistics Minchan Kim
@ 2018-11-16  7:20 ` Minchan Kim
  5 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-16  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/ABI/testing/sysfs-block-zram |  9 ++++
 Documentation/blockdev/zram.txt            |  2 +
 drivers/block/zram/zram_drv.c              | 55 ++++++++++++++++++++--
 drivers/block/zram/zram_drv.h              |  2 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index a4daca7e5043..210f2cdac752 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:	Minchan Kim <minchan@kernel.org>
 		The bd_stat file is read-only and represents backing device's
 		statistics (bd_count, bd_reads, bd_writes.) in a format
 		similar to block layer statistics file format.
+
+What:		/sys/block/zram<id>/writeback_limit
+Date:		November 2018
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		The writeback_limit file is read-write and specifies the maximum
+		amount of writeback ZRAM can do. The limit could be changed
+		in run time and "0" means disable the limit.
+		No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 1f4907307a0d..39ee416bf552 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset             WO    trigger device reset
 mem_used_max      WO    reset the `mem_used_max' counter (see later)
 mem_limit         WO    specifies the maximum amount of memory ZRAM can use
                         to store the compressed data
+writeback_limit	  WO	specifies the maximum amount of write IO zram can
+			write out to backing device
 max_comp_streams  RW    the number of possible concurrent compress operations
 comp_algorithm    RW    show and change the compression algorithm
 compact           WO    trigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 17d566d9a321..b263febaed10 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -317,6 +317,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct zram *zram = dev_to_zram(dev);
+	u64 val;
+	ssize_t ret = -EINVAL;
+
+	if (kstrtoull(buf, 10, &val))
+		return ret;
+
+	down_read(&zram->init_lock);
+	atomic64_set(&zram->stats.bd_wb_limit, val);
+	if (val == 0 || val > atomic64_read(&zram->stats.bd_writes))
+		zram->stop_writeback = false;
+	up_read(&zram->init_lock);
+	ret = len;
+
+	return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	u64 val;
+	struct zram *zram = dev_to_zram(dev);
+
+	down_read(&zram->init_lock);
+	val = atomic64_read(&zram->stats.bd_wb_limit);
+	up_read(&zram->init_lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
 	struct block_device *bdev;
@@ -575,6 +609,7 @@ static ssize_t writeback_store(struct device *dev,
 	ssize_t ret;
 	unsigned long mode;
 	unsigned long blk_idx = 0;
+	u64 wb_count, wb_limit;
 
 #define HUGE_WRITEBACK 0x1
 #define IDLE_WRITEBACK 0x2
@@ -610,6 +645,11 @@ static ssize_t writeback_store(struct device *dev,
 		bvec.bv_len = PAGE_SIZE;
 		bvec.bv_offset = 0;
 
+		if (zram->stop_writeback) {
+			ret = -EIO;
+			break;
+		}
+
 		if (!blk_idx) {
 			blk_idx = alloc_block_bdev(zram);
 			if (!blk_idx) {
@@ -664,7 +704,7 @@ static ssize_t writeback_store(struct device *dev,
 			continue;
 		}
 
-		atomic64_inc(&zram->stats.bd_writes);
+		wb_count = atomic64_inc_return(&zram->stats.bd_writes);
 		/*
 		 * We released zram_slot_lock so need to check if the slot was
 		 * changed. If there is freeing for the slot, we can catch it
@@ -687,6 +727,9 @@ static ssize_t writeback_store(struct device *dev,
 		zram_set_element(zram, index, blk_idx);
 		blk_idx = 0;
 		atomic64_inc(&zram->stats.pages_stored);
+		wb_limit = atomic64_read(&zram->stats.bd_wb_limit);
+		if (wb_limit != 0 && wb_count >= wb_limit)
+			zram->stop_writeback = true;
 next:
 		zram_slot_unlock(zram, index);
 	}
@@ -765,6 +808,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 {
 	struct bio *bio;
 	unsigned long entry;
+	u64 wb_count, wb_limit;
 
 	bio = bio_alloc(GFP_ATOMIC, 1);
 	if (!bio)
@@ -795,7 +839,10 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
 
 	submit_bio(bio);
 	*pentry = entry;
-	atomic64_inc(&zram->stats.bd_writes);
+	wb_count = atomic64_inc_return(&zram->stats.bd_writes);
+	wb_limit = atomic64_read(&zram->stats.bd_wb_limit);
+	if (wb_limit != 0 && wb_count >= wb_limit)
+		zram->stop_writeback = true;
 
 	return 0;
 }
@@ -1319,7 +1366,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 		return ret;
 	}
 
-	if (unlikely(comp_len >= huge_class_size)) {
+	if (unlikely(comp_len >= huge_class_size) && !zram->stop_writeback) {
 		comp_len = PAGE_SIZE;
 		if (zram->backing_dev && allow_wb) {
 			zcomp_stream_put(zram->comp);
@@ -1823,6 +1870,7 @@ static DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RW(backing_dev);
 static DEVICE_ATTR_WO(writeback);
+static DEVICE_ATTR_RW(writeback_limit);
 #endif
 
 static struct attribute *zram_disk_attrs[] = {
@@ -1838,6 +1886,7 @@ static struct attribute *zram_disk_attrs[] = {
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_backing_dev.attr,
 	&dev_attr_writeback.attr,
+	&dev_attr_writeback_limit.attr,
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 487ff283fb31..41cbd0f7aea8 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -86,6 +86,7 @@ struct zram_stats {
 	atomic64_t bd_count;		/* no. of pages in backing device */
 	atomic64_t bd_reads;		/* no. of read from backing device */
 	atomic64_t bd_writes;		/* no. of write from backing device */
+	atomic64_t bd_wb_limit;		/* writeback limit of backing device */
 #endif
 };
 
@@ -113,6 +114,7 @@ struct zram {
 	 */
 	bool claim; /* Protected by bdev->bd_mutex */
 	struct file *backing_dev;
+	bool stop_writeback;
 #ifdef CONFIG_ZRAM_WRITEBACK
 	struct block_device *bdev;
 	unsigned int old_block_size;
-- 
2.19.1.1215.g8438c0b245-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag
  2018-11-16  7:20 ` [PATCH 3/6] zram: introduce ZRAM_IDLE flag Minchan Kim
@ 2018-11-20  2:46   ` Sergey Senozhatsky
  2018-11-22  5:11     ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-20  2:46 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, LKML, Sergey Senozhatsky

Hello,

On (11/16/18 16:20), Minchan Kim wrote:
[..]
> +static ssize_t idle_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t len)
> +{
> +	struct zram *zram = dev_to_zram(dev);
> +	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> +	int index;
> +
> +	down_read(&zram->init_lock);
> +	if (!init_done(zram)) {
> +		up_read(&zram->init_lock);
> +		return -EINVAL;
> +	}
> +
> +	for (index = 0; index < nr_pages; index++) {
> +		zram_slot_lock(zram, index);
> +		if (!zram_allocated(zram, index))
> +			goto next;
> +
> +		zram_set_flag(zram, index, ZRAM_IDLE);
> +next:
> +		zram_slot_unlock(zram, index);
> +	}
> +
> +	up_read(&zram->init_lock);
> +
> +	return len;
> +}

This is one way of doing it.

The other one could, probabaly, be a bit more friendly to the cache
lines and CPU cycles. Basically, have a static timestamp variable,
which would keep the timestamp of last idle_store().

static idle_snapshot_ts;

static ssize_t idle_store(struct device *dev,
			  struct device_attribute *attr,
			  const char *buf, size_t len)
{
	idle_snapshot_ts = ktime();
}

And then in read_block_state() compare handle access time and
idle_snapshot_ts (if it's not 0). If the page was not modified/access
since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
not idle.

Would this do the trick?

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-16  7:20 ` [PATCH 4/6] zram: support idle page writeback Minchan Kim
@ 2018-11-21  4:55   ` Sergey Senozhatsky
  2018-11-21 13:34     ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-21  4:55 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, LKML, Sergey Senozhatsky

On (11/16/18 16:20), Minchan Kim wrote:
> +		zram_set_flag(zram, index, ZRAM_UNDER_WB);
> +		zram_slot_unlock(zram, index);
> +		if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
> +			zram_slot_lock(zram, index);
> +			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> +			zram_slot_unlock(zram, index);
> +			continue;
> +		}
> +
> +		bio_init(&bio, &bio_vec, 1);
> +		bio_set_dev(&bio, zram->bdev);
> +		bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> +		bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> +
> +		bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
> +				bvec.bv_offset);
> +		/*
> +		 * XXX: A single page IO would be inefficient for write
> +		 * but it would be not bad as starter.
> +		 */
> +		ret = submit_bio_wait(&bio);
> +		if (ret) {
> +			zram_slot_lock(zram, index);
> +			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> +			zram_slot_unlock(zram, index);
> +			continue;
> +		}

Just a thought,

I wonder if it will make sense (and if it will be possible) to writeback
idle _compressed_ objects. Right now we decompress, say, a perfectly
fine 400-byte compressed object to a PAGE_SIZE-d object and then push
it to the WB device. In this particular case it has a x10 bigger IO
pressure on flash. If we can write/read compressed object then we
will write and read 400-bytes, instead of PAGE_SIZE.

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-21  4:55   ` Sergey Senozhatsky
@ 2018-11-21 13:34     ` Minchan Kim
  2018-11-22  2:14       ` Sergey Senozhatsky
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-21 13:34 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Wed, Nov 21, 2018 at 01:55:51PM +0900, Sergey Senozhatsky wrote:
> On (11/16/18 16:20), Minchan Kim wrote:
> > +		zram_set_flag(zram, index, ZRAM_UNDER_WB);
> > +		zram_slot_unlock(zram, index);
> > +		if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
> > +			zram_slot_lock(zram, index);
> > +			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +			zram_slot_unlock(zram, index);
> > +			continue;
> > +		}
> > +
> > +		bio_init(&bio, &bio_vec, 1);
> > +		bio_set_dev(&bio, zram->bdev);
> > +		bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> > +		bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> > +
> > +		bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
> > +				bvec.bv_offset);
> > +		/*
> > +		 * XXX: A single page IO would be inefficient for write
> > +		 * but it would be not bad as starter.
> > +		 */
> > +		ret = submit_bio_wait(&bio);
> > +		if (ret) {
> > +			zram_slot_lock(zram, index);
> > +			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +			zram_slot_unlock(zram, index);
> > +			continue;
> > +		}
> 
> Just a thought,
> 
> I wonder if it will make sense (and if it will be possible) to writeback
> idle _compressed_ objects. Right now we decompress, say, a perfectly
> fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> it to the WB device. In this particular case it has a x10 bigger IO
> pressure on flash. If we can write/read compressed object then we
> will write and read 400-bytes, instead of PAGE_SIZE.

Although it has pros/cons, that's the my final goal although it would
add much complicated stuffs. Sometime, we should have the feature.
However, I want to go simple one first which is very valuable, too.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-21 13:34     ` Minchan Kim
@ 2018-11-22  2:14       ` Sergey Senozhatsky
  2018-11-22  5:04         ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-22  2:14 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Sergey Senozhatsky, Andrew Morton, LKML

On (11/21/18 05:34), Minchan Kim wrote:
> > 
> > Just a thought,
> > 
> > I wonder if it will make sense (and if it will be possible) to writeback
> > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > it to the WB device. In this particular case it has a x10 bigger IO
> > pressure on flash. If we can write/read compressed object then we
> > will write and read 400-bytes, instead of PAGE_SIZE.
> 
> Although it has pros/cons, that's the my final goal although it would
> add much complicated stuffs. Sometime, we should have the feature.

So you plan to switch to "compressed objects" writeback?

> However, I want to go simple one first which is very valuable, too.

Flash wearout is a serious problem; maybe less of a problem on smart
phones, but much bigger on TVs and on other embedded devices that have
lifespans of 5+ years. With "writeback idle compressed" we can remove
the existing "writeback incompressible pages" and writeback only
"idle, compressed" pages.

The existing incompressible writeback is way too aggressive, and,
additionally, it's too simple. It writes-back pages which can be
swapped in immediately; which basically means that we do pointless
PAGE_SIZE writes to a device which doesn't really like pointless
writes.

It's a whole different story with idle, compressible pages writeback.

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  2:14       ` Sergey Senozhatsky
@ 2018-11-22  5:04         ` Minchan Kim
  2018-11-22  5:40           ` Sergey Senozhatsky
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-22  5:04 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Thu, Nov 22, 2018 at 11:14:43AM +0900, Sergey Senozhatsky wrote:
> On (11/21/18 05:34), Minchan Kim wrote:
> > > 
> > > Just a thought,
> > > 
> > > I wonder if it will make sense (and if it will be possible) to writeback
> > > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > > it to the WB device. In this particular case it has a x10 bigger IO
> > > pressure on flash. If we can write/read compressed object then we
> > > will write and read 400-bytes, instead of PAGE_SIZE.
> > 
> > Although it has pros/cons, that's the my final goal although it would
> > add much complicated stuffs. Sometime, we should have the feature.
> 
> So you plan to switch to "compressed objects" writeback?

No switch. I want both finally. There are pros and cons.
Compressible write would be good for wearout of flash device(that's I want
to have it) but it has several read of a block(since the block has several
zpage) and decompression latency as well as complicated logic of management
of block. That's the unnecessary thing If backing device or system doesn't
have wearout concern.

> 
> > However, I want to go simple one first which is very valuable, too.
> 
> Flash wearout is a serious problem; maybe less of a problem on smart
> phones, but much bigger on TVs and on other embedded devices that have
> lifespans of 5+ years. With "writeback idle compressed" we can remove

Yub, It's a serious. That's why my patchset has writeback limitation,
stats as well as idle marking to help the system design.

> the existing "writeback incompressible pages" and writeback only
> "idle, compressed" pages.
> 
> The existing incompressible writeback is way too aggressive, and,

Do not agree. It depends on the system design.
Think idle page writeback. Once a day, you write out idle pages.
As a inital stage, you could write every idle pages into the storage.
it could be several hundred MB and next day? there is few MB write
because every idle pages were stored yesterday.
Even, we have a write_limit. You could estimate how per-day write
can demage your system. Your daemon can see one a day and decide
further write or not.

> additionally, it's too simple. It writes-back pages which can be
> swapped in immediately; which basically means that we do pointless
> PAGE_SIZE writes to a device which doesn't really like pointless
> writes.

This patchset aims for *IDLE page* writeback and you can define
what is IDLE page by yourself. It doesn't do pointless writeback.

> 
> It's a whole different story with idle, compressible pages writeback.

I don't understand your point.

> 
> 	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag
  2018-11-20  2:46   ` Sergey Senozhatsky
@ 2018-11-22  5:11     ` Minchan Kim
  2018-11-22  5:45       ` Sergey Senozhatsky
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-22  5:11 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Tue, Nov 20, 2018 at 11:46:59AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (11/16/18 16:20), Minchan Kim wrote:
> [..]
> > +static ssize_t idle_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > +	struct zram *zram = dev_to_zram(dev);
> > +	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> > +	int index;
> > +
> > +	down_read(&zram->init_lock);
> > +	if (!init_done(zram)) {
> > +		up_read(&zram->init_lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (index = 0; index < nr_pages; index++) {
> > +		zram_slot_lock(zram, index);
> > +		if (!zram_allocated(zram, index))
> > +			goto next;
> > +
> > +		zram_set_flag(zram, index, ZRAM_IDLE);
> > +next:
> > +		zram_slot_unlock(zram, index);
> > +	}
> > +
> > +	up_read(&zram->init_lock);
> > +
> > +	return len;
> > +}
> 
> This is one way of doing it.
> 
> The other one could, probabaly, be a bit more friendly to the cache
> lines and CPU cycles. Basically, have a static timestamp variable,
> which would keep the timestamp of last idle_store().
> 
> static idle_snapshot_ts;
> 
> static ssize_t idle_store(struct device *dev,
> 			  struct device_attribute *attr,
> 			  const char *buf, size_t len)
> {
> 	idle_snapshot_ts = ktime();
> }
> 
> And then in read_block_state() compare handle access time and
> idle_snapshot_ts (if it's not 0). If the page was not modified/access
> since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
> then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
> not idle.
> 
> Would this do the trick?

It was a option when I imagined this idea first but problem from product
division was memory waste of ac_time for every zram table.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  5:04         ` Minchan Kim
@ 2018-11-22  5:40           ` Sergey Senozhatsky
  2018-11-22  6:15             ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-22  5:40 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Sergey Senozhatsky, Andrew Morton, LKML

On (11/22/18 14:04), Minchan Kim wrote:
> 
> > additionally, it's too simple. It writes-back pages which can be
> > swapped in immediately; which basically means that we do pointless
> > PAGE_SIZE writes to a device which doesn't really like pointless
> > writes.
> 
> This patchset aims for *IDLE page* writeback and you can define
> what is IDLE page by yourself. It doesn't do pointless writeback.
> > 
> > It's a whole different story with idle, compressible pages writeback.
> 
> I don't understand your point.

Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
On the contrary, I think IDLE writeback is x100 better than writeback
which we currently have.

The "pointless writeback" comment was about the existing writeback,
when we WB pages which we couldn't compress. We can have a relative
huge percentage of incompressible pages, and not all of them will end
up being IDLE:
 - we swap out page
 - can't compress it
 - writeback PAGE_SIZE
 - swap it in two seconds later

// as example //

IDLE page writeback (this patch set) looks to me like a really
significant improvement. Especially if we can writeback compressed
objects and do, e.g., 300-bytes writes instead of PAGE_SIZE writes.

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag
  2018-11-22  5:11     ` Minchan Kim
@ 2018-11-22  5:45       ` Sergey Senozhatsky
  0 siblings, 0 replies; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-22  5:45 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Sergey Senozhatsky, Andrew Morton, LKML

On (11/22/18 14:11), Minchan Kim wrote:
> 
> It was a option when I imagined this idea first but problem from product
> division was memory waste of ac_time for every zram table.

OK, I see.

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  5:40           ` Sergey Senozhatsky
@ 2018-11-22  6:15             ` Minchan Kim
  2018-11-22  6:31               ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-22  6:15 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 14:04), Minchan Kim wrote:
> > 
> > > additionally, it's too simple. It writes-back pages which can be
> > > swapped in immediately; which basically means that we do pointless
> > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > writes.
> > 
> > This patchset aims for *IDLE page* writeback and you can define
> > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > 
> > > It's a whole different story with idle, compressible pages writeback.
> > 
> > I don't understand your point.
> 
> Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> On the contrary, I think IDLE writeback is x100 better than writeback
> which we currently have.
> 
> The "pointless writeback" comment was about the existing writeback,
> when we WB pages which we couldn't compress. We can have a relative
> huge percentage of incompressible pages, and not all of them will end
> up being IDLE:
>  - we swap out page
>  - can't compress it
>  - writeback PAGE_SIZE
>  - swap it in two seconds later

I got what you mean now. Let's call it as "incompressible page wrieback"
to prevent confusing.

"incompressible page writeback" would be orthgonal feature. The goal is
"let's save memory at the cost of *latency*". If the page is swapped-in
soon, it's unfortunate. However, the design expects once it's swapped out,
it means it's non-workingset so soonish swappined-in would be rather not
many, theoritically compared to other workingset.
If's it's too frequent, it means system were heavily overcommitted.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  6:15             ` Minchan Kim
@ 2018-11-22  6:31               ` Minchan Kim
  2018-11-22  6:59                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2018-11-22  6:31 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Thu, Nov 22, 2018 at 03:15:42PM +0900, Minchan Kim wrote:
> On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> > On (11/22/18 14:04), Minchan Kim wrote:
> > > 
> > > > additionally, it's too simple. It writes-back pages which can be
> > > > swapped in immediately; which basically means that we do pointless
> > > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > > writes.
> > > 
> > > This patchset aims for *IDLE page* writeback and you can define
> > > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > > 
> > > > It's a whole different story with idle, compressible pages writeback.
> > > 
> > > I don't understand your point.
> > 
> > Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> > On the contrary, I think IDLE writeback is x100 better than writeback
> > which we currently have.
> > 
> > The "pointless writeback" comment was about the existing writeback,
> > when we WB pages which we couldn't compress. We can have a relative
> > huge percentage of incompressible pages, and not all of them will end
> > up being IDLE:
> >  - we swap out page
> >  - can't compress it
> >  - writeback PAGE_SIZE
> >  - swap it in two seconds later
> 
> I got what you mean now. Let's call it as "incompressible page wrieback"
> to prevent confusing.
> 
> "incompressible page writeback" would be orthgonal feature. The goal is
> "let's save memory at the cost of *latency*". If the page is swapped-in
> soon, it's unfortunate. However, the design expects once it's swapped out,
> it means it's non-workingset so soonish swappined-in would be rather not
> many, theoritically compared to other workingset.
> If's it's too frequent, it means system were heavily overcommitted.

Havid said, I agree it's not a good idea to enable incompressible page
writeback with idle page writeback. If you don't oppose, I want to add
new knob to "enable incompressible page writeback" so by default,
although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
is off until we enable the knob.
It would make some regressison if someone have used the feature but
I guess we are not too late.

What do you think?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  6:31               ` Minchan Kim
@ 2018-11-22  6:59                 ` Sergey Senozhatsky
  2018-11-23  6:23                   ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2018-11-22  6:59 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Sergey Senozhatsky, Andrew Morton, LKML

On (11/22/18 15:31), Minchan Kim wrote:
> > 
> > I got what you mean now. Let's call it as "incompressible page wrieback"
> > to prevent confusing.
> > 
> > "incompressible page writeback" would be orthgonal feature. The goal is
> > "let's save memory at the cost of *latency*". If the page is swapped-in
> > soon, it's unfortunate. However, the design expects once it's swapped out,
> > it means it's non-workingset so soonish swappined-in would be rather not
> > many, theoritically compared to other workingset.
> > If's it's too frequent, it means system were heavily overcommitted.
> 
> Havid said, I agree it's not a good idea to enable incompressible page
> writeback with idle page writeback. If you don't oppose, I want to add
> new knob to "enable incompressible page writeback" so by default,
> although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> is off until we enable the knob.
> It would make some regressison if someone have used the feature but
> I guess we are not too late.
> 
> What do you think?

Yes, totally works for me!


"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.

	-ss

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] zram: support idle page writeback
  2018-11-22  6:59                 ` Sergey Senozhatsky
@ 2018-11-23  6:23                   ` Minchan Kim
  0 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2018-11-23  6:23 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML

On Thu, Nov 22, 2018 at 03:59:26PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 15:31), Minchan Kim wrote:
> > > 
> > > I got what you mean now. Let's call it as "incompressible page wrieback"
> > > to prevent confusing.
> > > 
> > > "incompressible page writeback" would be orthgonal feature. The goal is
> > > "let's save memory at the cost of *latency*". If the page is swapped-in
> > > soon, it's unfortunate. However, the design expects once it's swapped out,
> > > it means it's non-workingset so soonish swappined-in would be rather not
> > > many, theoritically compared to other workingset.
> > > If's it's too frequent, it means system were heavily overcommitted.
> > 
> > Havid said, I agree it's not a good idea to enable incompressible page
> > writeback with idle page writeback. If you don't oppose, I want to add
> > new knob to "enable incompressible page writeback" so by default,
> > although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> > is off until we enable the knob.
> > It would make some regressison if someone have used the feature but
> > I guess we are not too late.
> > 
> > What do you think?
> 
> Yes, totally works for me!
> 
> 
> "IDLE writeback" is superior to "incompressible writeback".
> 
> "incompressible writeback" is completely unpredictable and
> uncontrollable; it depens on data patterns and compression algorithms.
> While "IDLE writeback" is predictable.
> 
> I even suspect, that, *ideally*, we can remove "incompressible
> writeback". "IDLE pages" is a super set which also includes
> "incompressible" pages. So, technically, we still can do
> "incompressible writeback" from "IDLE writeback" path; but a much
> more reasonable one, based on a page idling period.
> 
> I understand that you want to keep "direct incompressible writeback"
> around. ZRAM is especially popular on devices which do suffer from
> flash wearout, so I can see "incompressible writeback" path becoming
> a dead code, long term.

Okay, both options makes regression if someone use it. Then, let's try
to remove it. It would make more clean with new idle writeback.

Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-23  6:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-16  7:20 [PATCH 0/6] zram idle page writeback Minchan Kim
2018-11-16  7:20 ` [PATCH 1/6] zram: fix lockdep warning of free block handling Minchan Kim
2018-11-16  7:20 ` [PATCH 2/6] zram: refactoring flags and writeback stuff Minchan Kim
2018-11-16  7:20 ` [PATCH 3/6] zram: introduce ZRAM_IDLE flag Minchan Kim
2018-11-20  2:46   ` Sergey Senozhatsky
2018-11-22  5:11     ` Minchan Kim
2018-11-22  5:45       ` Sergey Senozhatsky
2018-11-16  7:20 ` [PATCH 4/6] zram: support idle page writeback Minchan Kim
2018-11-21  4:55   ` Sergey Senozhatsky
2018-11-21 13:34     ` Minchan Kim
2018-11-22  2:14       ` Sergey Senozhatsky
2018-11-22  5:04         ` Minchan Kim
2018-11-22  5:40           ` Sergey Senozhatsky
2018-11-22  6:15             ` Minchan Kim
2018-11-22  6:31               ` Minchan Kim
2018-11-22  6:59                 ` Sergey Senozhatsky
2018-11-23  6:23                   ` Minchan Kim
2018-11-16  7:20 ` [PATCH 5/6] zram: add bd_stat statistics Minchan Kim
2018-11-16  7:20 ` [PATCH 6/6] zram: writeback throttle Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).