linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/5] stop anon reclaim when zram is full
@ 2014-09-22  0:03 Minchan Kim
  2014-09-22  0:03 ` [PATCH v1 1/5] zram: generalize swap_slot_free_notify Minchan Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

For zram-swap, there is size gap between virtual disksize
and available physical memory size for zram so that VM
can try to reclaim anonymous pages even though zram is full.
It makes system alomost hang(ie, unresponsible) easily in
my kernel build test(ie, 1G DRAM, CPU 12, 4G zram swap,
50M zram limit). VM should have killed someone.

This patch adds new hint SWAP_FULL so VM can ask fullness
to zram and if it founds zram is full, VM doesn't reclaim
anonymous pages until zram-swap gets new free space.

With this patch, I see OOM when zram-swap is full instead of
hang with no response.

Minchan Kim (5):
  zram: generalize swap_slot_free_notify
  mm: add full variable in swap_info_struct
  mm: VM can be aware of zram fullness
  zram: add swap full hint
  zram: add fullness knob to control swap full

 Documentation/ABI/testing/sysfs-block-zram |  10 +++
 Documentation/filesystems/Locking          |   4 +-
 drivers/block/zram/zram_drv.c              | 114 +++++++++++++++++++++++++++--
 drivers/block/zram/zram_drv.h              |   2 +
 include/linux/blkdev.h                     |   8 +-
 include/linux/swap.h                       |   1 +
 mm/page_io.c                               |   6 +-
 mm/swapfile.c                              |  77 ++++++++++++++-----
 8 files changed, 189 insertions(+), 33 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 1/5] zram: generalize swap_slot_free_notify
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
@ 2014-09-22  0:03 ` Minchan Kim
  2014-09-22 20:41   ` Andrew Morton
  2014-09-22  0:03 ` [PATCH v1 2/5] mm: add full variable in swap_info_struct Minchan Kim
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

Currently, swap_slot_free_notify is used for zram to free
duplicated copy page for memory efficiency when it knows
there is no reference to the swap slot.

This patch generalizes it to be able to use for other
swap hint to communicate with VM.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |  4 ++--
 drivers/block/zram/zram_drv.c     | 18 ++++++++++++++++--
 include/linux/blkdev.h            |  7 +++++--
 mm/page_io.c                      |  6 +++---
 mm/swapfile.c                     |  6 +++---
 5 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 94d93b1f8b53..c262bfbeafa9 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -405,7 +405,7 @@ prototypes:
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint) (struct block_device *, unsigned int, void *);
 
 locking rules:
 			bd_mutex
@@ -418,7 +418,7 @@ media_changed:		no
 unlock_native_capacity:	no
 revalidate_disk:	no
 getgeo:			no
-swap_slot_free_notify:	no	(see below)
+swap_hint:		no	(see below)
 
 media_changed, unlock_native_capacity and revalidate_disk are called only from
 check_disk_change().
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d78b245bae06..22a37764c409 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -926,7 +926,8 @@ error:
 	bio_io_error(bio);
 }
 
-static void zram_slot_free_notify(struct block_device *bdev,
+/* this callback is with swap_lock and sometimes page table lock held */
+static int zram_slot_free_notify(struct block_device *bdev,
 				unsigned long index)
 {
 	struct zram *zram;
@@ -939,10 +940,23 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram_free_page(zram, index);
 	bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
 	atomic64_inc(&zram->stats.notify_free);
+
+	return 0;
+}
+
+static int zram_swap_hint(struct block_device *bdev,
+				unsigned int hint, void *arg)
+{
+	int ret = -EINVAL;
+
+	if (hint == SWAP_FREE)
+		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+
+	return ret;
 }
 
 static const struct block_device_operations zram_devops = {
-	.swap_slot_free_notify = zram_slot_free_notify,
+	.swap_hint = zram_swap_hint,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e267bf0db559..c7220409456c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1609,6 +1609,10 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+enum swap_blk_hint {
+	SWAP_FREE,
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1624,8 +1628,7 @@ struct block_device_operations {
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	/* this callback is with swap_lock and sometimes page table lock held */
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint)(struct block_device *, unsigned int, void *);
 	struct module *owner;
 };
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b0d497..c6cc19655e97 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -114,7 +114,7 @@ void end_swap_bio_read(struct bio *bio, int err)
 			 * we again wish to reclaim it.
 			 */
 			struct gendisk *disk = sis->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify) {
+			if (disk->fops->swap_hint) {
 				swp_entry_t entry;
 				unsigned long offset;
 
@@ -122,8 +122,8 @@ void end_swap_bio_read(struct bio *bio, int err)
 				offset = swp_offset(entry);
 
 				SetPageDirty(page);
-				disk->fops->swap_slot_free_notify(sis->bdev,
-						offset);
+				disk->fops->swap_hint(sis->bdev,
+						SWAP_FREE, (void *)offset);
 			}
 		}
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..c07f7f4912e9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -816,9 +816,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		frontswap_invalidate_page(p->type, offset);
 		if (p->flags & SWP_BLKDEV) {
 			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+			if (disk->fops->swap_hint)
+				disk->fops->swap_hint(p->bdev,
+						SWAP_FREE, (void *)offset);
 		}
 	}
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 2/5] mm: add full variable in swap_info_struct
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
  2014-09-22  0:03 ` [PATCH v1 1/5] zram: generalize swap_slot_free_notify Minchan Kim
@ 2014-09-22  0:03 ` Minchan Kim
  2014-09-22 20:45   ` Andrew Morton
  2014-09-24  2:53   ` Dan Streetman
  2014-09-22  0:03 ` [PATCH v1 3/5] mm: VM can be aware of zram fullness Minchan Kim
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

Now, swap leans on !p->highest_bit to indicate a swap is full.
It works well for normal swap because every slot on swap device
is used up when the swap is full but in case of zram, swap sees
still many empty slot although backed device(ie, zram) is full
since zram's limit is over so that it could make trouble when
swap use highest_bit to select new slot via free_cluster.

This patch introduces full varaiable in swap_info_struct
to solve the problem.

Suggested-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 33 +++++++++++++++++++--------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ea4f926e6b9b..a3c11c051495 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,6 +224,7 @@ struct swap_info_struct {
 	struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
+	bool	full;			/* whether swap is full or not */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
 	unsigned int cluster_next;	/* likely index for next allocation */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c07f7f4912e9..209112cf8b83 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -558,7 +558,7 @@ checks:
 	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
-	if (!si->highest_bit)
+	if (si->full)
 		goto no_page;
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
@@ -589,6 +589,7 @@ checks:
 		spin_lock(&swap_avail_lock);
 		plist_del(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
+		si->full = true;
 	}
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
@@ -653,14 +654,14 @@ start_over:
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
-		if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
+		if (si->full || !(si->flags & SWP_WRITEOK)) {
 			spin_lock(&swap_avail_lock);
 			if (plist_node_empty(&si->avail_list)) {
 				spin_unlock(&si->lock);
 				goto nextsi;
 			}
-			WARN(!si->highest_bit,
-			     "swap_info %d in list but !highest_bit\n",
+			WARN(si->full,
+			     "swap_info %d in list but swap is full\n",
 			     si->type);
 			WARN(!(si->flags & SWP_WRITEOK),
 			     "swap_info %d in list but !SWP_WRITEOK\n",
@@ -796,21 +797,25 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
+		bool was_full;
+
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
+		if (offset > p->highest_bit)
 			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
+		was_full = p->full;
+
+		if (was_full && (p->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&p->avail_list));
+			if (plist_node_empty(&p->avail_list))
+				plist_add(&p->avail_list,
+					  &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+			p->full = false;
 		}
+
 		atomic_long_inc(&nr_swap_pages);
 		p->inuse_pages--;
 		frontswap_invalidate_page(p->type, offset);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 3/5] mm: VM can be aware of zram fullness
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
  2014-09-22  0:03 ` [PATCH v1 1/5] zram: generalize swap_slot_free_notify Minchan Kim
  2014-09-22  0:03 ` [PATCH v1 2/5] mm: add full variable in swap_info_struct Minchan Kim
@ 2014-09-22  0:03 ` Minchan Kim
  2014-09-24 14:12   ` Dan Streetman
  2014-09-22  0:03 ` [PATCH v1 4/5] zram: add swap full hint Minchan Kim
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

VM uses nr_swap_pages to throttle amount of swap when it reclaims
anonymous pages because the nr_swap_pages means freeable space
of swap disk.

However, it's a problem for zram because zram can limit memory
usage by knob(ie, mem_limit) so that swap out can fail although
VM can see lots of free space from zram disk but no more free
space in zram by the limit. If it happens, VM should notice it
and stop reclaimaing until zram can obtain more free space but
we don't have a way to communicate between VM and zram.

This patch adds new hint SWAP_FULL so that zram can say to VM
"I'm full" from now on. Then VM cannot reclaim annoymous page
any more. If VM notice swap is full, it can remove swap_info_struct
from swap_avail_head and substract remained freeable space from
nr_swap_pages so that VM can think swap is full until VM frees a
swap and increase nr_swap_pages again.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/blkdev.h |  1 +
 mm/swapfile.c          | 44 ++++++++++++++++++++++++++++++++++++++------
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c7220409456c..39f074e0acd7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 enum swap_blk_hint {
 	SWAP_FREE,
+	SWAP_FULL,
 };
 
 struct block_device_operations {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 209112cf8b83..71e3df0431b6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -493,6 +493,29 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	int latency_ration = LATENCY_LIMIT;
 
 	/*
+	 * If zram is full, we don't need to scan and want to stop swap.
+	 * For it, we removes si from swap_avail_head and decreases
+	 * nr_swap_pages to prevent further anonymous reclaim so that
+	 * VM can restart swap out if zram has a free space.
+	 * Look at swap_entry_free.
+	 */
+	if (si->flags & SWP_BLKDEV) {
+		struct gendisk *disk = si->bdev->bd_disk;
+
+		if (disk->fops->swap_hint && disk->fops->swap_hint(
+				si->bdev, SWAP_FULL, NULL)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(plist_node_empty(&si->avail_list));
+			plist_del(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+			atomic_long_sub(si->pages - si->inuse_pages,
+						&nr_swap_pages);
+			si->full = true;
+			return 0;
+		}
+	}
+
+	/*
 	 * We try to cluster swap pages by allocating them sequentially
 	 * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
 	 * way, however, we resort to first-free allocation, starting
@@ -798,6 +821,14 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	/* free if no reference */
 	if (!usage) {
 		bool was_full;
+		struct gendisk *virt_swap = NULL;
+
+		/* Check virtual swap */
+		if (p->flags & SWP_BLKDEV) {
+			virt_swap = p->bdev->bd_disk;
+			if (!virt_swap->fops->swap_hint)
+				virt_swap = NULL;
+		}
 
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
@@ -814,17 +845,18 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 					  &swap_avail_head);
 			spin_unlock(&swap_avail_lock);
 			p->full = false;
+			if (virt_swap)
+				atomic_long_add(p->pages -
+						p->inuse_pages,
+						&nr_swap_pages);
 		}
 
 		atomic_long_inc(&nr_swap_pages);
 		p->inuse_pages--;
 		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_hint)
-				disk->fops->swap_hint(p->bdev,
-						SWAP_FREE, (void *)offset);
-		}
+		if (virt_swap)
+			virt_swap->fops->swap_hint(p->bdev,
+					SWAP_FREE, (void *)offset);
 	}
 
 	return usage;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 4/5] zram: add swap full hint
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
                   ` (2 preceding siblings ...)
  2014-09-22  0:03 ` [PATCH v1 3/5] mm: VM can be aware of zram fullness Minchan Kim
@ 2014-09-22  0:03 ` Minchan Kim
  2014-09-22 21:11   ` Andrew Morton
  2014-09-24 14:01   ` Dan Streetman
  2014-09-22  0:03 ` [PATCH v1 5/5] zram: add fullness knob to control swap full Minchan Kim
  2014-12-02  3:04 ` [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
  5 siblings, 2 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

This patch implement SWAP_FULL handler in zram so that VM can
know whether zram is full or not and use it to stop anonymous
page reclaim.

How to judge fullness is below,

fullness = (100 * used space / total space)

It means the higher fullness is, the slower we reach zram full.
Now, default of fullness is 80 so that it biased more momory
consumption rather than early OOM kill.

Above logic works only when used space of zram hit over the limit
but zram also pretend to be full once 32 consecutive allocation
fail happens. It's safe guard to prevent system hang caused by
fragment uncertainty.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 22a37764c409..649cad9d0b1c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 
+/*
+ * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
+ * we regards it as zram-full. It means that the higher
+ * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
+ */
+#define ZRAM_FULLNESS_PERCENT 80
+
+/*
+ * If zram fails to allocate memory consecutively up to this,
+ * we regard it as zram-full. It's safe guard to prevent too
+ * many swap write fail due to lack of fragmentation uncertainty.
+ */
+#define ALLOC_FAIL_MAX	32
+
 #define ZRAM_ATTR_RO(name)						\
 static ssize_t zram_attr_##name##_show(struct device *d,		\
 				struct device_attribute *attr, char *b)	\
@@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
 
 	down_write(&zram->init_lock);
 	zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
+	atomic_set(&zram->alloc_fail, 0);
 	up_write(&zram->init_lock);
 
 	return len;
@@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 	atomic64_sub(zram_get_obj_size(meta, index),
 			&zram->stats.compr_data_size);
 	atomic64_dec(&zram->stats.pages_stored);
+	atomic_set(&zram->alloc_fail, 0);
 
 	meta->table[index].handle = 0;
 	zram_set_obj_size(meta, index, 0);
@@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	}
 
 	alloced_pages = zs_get_total_pages(meta->mem_pool);
-	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
-		zs_free(meta->mem_pool, handle);
-		ret = -ENOMEM;
-		goto out;
+	if (zram->limit_pages) {
+		if (alloced_pages > zram->limit_pages) {
+			zs_free(meta->mem_pool, handle);
+			atomic_inc(&zram->alloc_fail);
+			ret = -ENOMEM;
+			goto out;
+		} else {
+			atomic_set(&zram->alloc_fail, 0);
+		}
 	}
 
 	update_used_max(zram, alloced_pages);
@@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
 	down_write(&zram->init_lock);
 
 	zram->limit_pages = 0;
+	atomic_set(&zram->alloc_fail, 0);
 
 	if (!init_done(zram)) {
 		up_write(&zram->init_lock);
@@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
 	return 0;
 }
 
+static int zram_full(struct block_device *bdev, void *arg)
+{
+	struct zram *zram;
+	struct zram_meta *meta;
+	unsigned long total_pages, compr_pages;
+
+	zram = bdev->bd_disk->private_data;
+	if (!zram->limit_pages)
+		return 0;
+
+	meta = zram->meta;
+	total_pages = zs_get_total_pages(meta->mem_pool);
+
+	if (total_pages >= zram->limit_pages) {
+
+		compr_pages = atomic64_read(&zram->stats.compr_data_size)
+					>> PAGE_SHIFT;
+		if ((100 * compr_pages / total_pages)
+			>= ZRAM_FULLNESS_PERCENT)
+			return 1;
+	}
+
+	if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
+		return 1;
+
+	return 0;
+}
+
 static int zram_swap_hint(struct block_device *bdev,
 				unsigned int hint, void *arg)
 {
@@ -951,6 +1001,8 @@ static int zram_swap_hint(struct block_device *bdev,
 
 	if (hint == SWAP_FREE)
 		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+	else if (hint == SWAP_FULL)
+		ret = zram_full(bdev, arg);
 
 	return ret;
 }
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index c6ee271317f5..fcf3176a9f15 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -113,6 +113,7 @@ struct zram {
 	u64 disksize;	/* bytes */
 	int max_comp_streams;
 	struct zram_stats stats;
+	atomic_t alloc_fail;
 	/*
 	 * the number of pages zram can consume for storing compressed data
 	 */
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 5/5] zram: add fullness knob to control swap full
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
                   ` (3 preceding siblings ...)
  2014-09-22  0:03 ` [PATCH v1 4/5] zram: add swap full hint Minchan Kim
@ 2014-09-22  0:03 ` Minchan Kim
  2014-09-22 21:17   ` Andrew Morton
  2014-12-02  3:04 ` [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
  5 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-22  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi, Minchan Kim

Some zram usecase could want lower fullness than default 80 to
avoid unnecessary swapout-and-fail-recover overhead.

A typical example is that mutliple swap with high piroirty
zram-swap and low priority HDD-swap so it could still enough
free swap space although one of swap devices is full(ie, zram).
It would be better to fail over to HDD-swap rather than failing
swap write to zram in this case.

This patch exports fullness to user so user can control it
via the knob.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/ABI/testing/sysfs-block-zram | 10 ++++++++
 drivers/block/zram/zram_drv.c              | 38 +++++++++++++++++++++++++++++-
 drivers/block/zram/zram_drv.h              |  1 +
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index b13dc993291f..817738d14061 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -138,3 +138,13 @@ Description:
 		amount of memory ZRAM can use to store the compressed data.  The
 		limit could be changed in run time and "0" means disable the
 		limit.  No limit is the initial state.  Unit: bytes
+
+What:		/sys/block/zram<id>/fullness
+Date:		August 2014
+Contact:	Minchan Kim <minchan@kernel.org>
+Description:
+		The fullness file is read/write and specifies how easily
+		zram become full state so if you set it to lower value,
+		zram can reach full state easily compared to higher value.
+		Curretnly, initial value is 80% but it could be changed.
+		Unit: Percentage
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 649cad9d0b1c..ec3656f6891d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -136,6 +136,37 @@ static ssize_t max_comp_streams_show(struct device *dev,
 	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
 }
 
+static ssize_t fullness_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	int val;
+	struct zram *zram = dev_to_zram(dev);
+
+	down_read(&zram->init_lock);
+	val = zram->fullness;
+	up_read(&zram->init_lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t fullness_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	int err;
+	unsigned long val;
+	struct zram *zram = dev_to_zram(dev);
+
+	err = kstrtoul(buf, 10, &val);
+	if (err || val > 100)
+		return -EINVAL;
+
+	down_write(&zram->init_lock);
+	zram->fullness = val;
+	up_write(&zram->init_lock);
+
+	return len;
+}
+
 static ssize_t mem_limit_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -733,6 +764,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
 
 	zram->limit_pages = 0;
 	atomic_set(&zram->alloc_fail, 0);
+	zram->fullness = ZRAM_FULLNESS_PERCENT;
 
 	if (!init_done(zram)) {
 		up_write(&zram->init_lock);
@@ -984,7 +1016,7 @@ static int zram_full(struct block_device *bdev, void *arg)
 		compr_pages = atomic64_read(&zram->stats.compr_data_size)
 					>> PAGE_SHIFT;
 		if ((100 * compr_pages / total_pages)
-			>= ZRAM_FULLNESS_PERCENT)
+			>= zram->fullness)
 			return 1;
 	}
 
@@ -1020,6 +1052,8 @@ static DEVICE_ATTR(orig_data_size, S_IRUGO, orig_data_size_show, NULL);
 static DEVICE_ATTR(mem_used_total, S_IRUGO, mem_used_total_show, NULL);
 static DEVICE_ATTR(mem_limit, S_IRUGO | S_IWUSR, mem_limit_show,
 		mem_limit_store);
+static DEVICE_ATTR(fullness, S_IRUGO | S_IWUSR, fullness_show,
+		fullness_store);
 static DEVICE_ATTR(mem_used_max, S_IRUGO | S_IWUSR, mem_used_max_show,
 		mem_used_max_store);
 static DEVICE_ATTR(max_comp_streams, S_IRUGO | S_IWUSR,
@@ -1051,6 +1085,7 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_compr_data_size.attr,
 	&dev_attr_mem_used_total.attr,
 	&dev_attr_mem_limit.attr,
+	&dev_attr_fullness.attr,
 	&dev_attr_mem_used_max.attr,
 	&dev_attr_max_comp_streams.attr,
 	&dev_attr_comp_algorithm.attr,
@@ -1132,6 +1167,7 @@ static int create_device(struct zram *zram, int device_id)
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 	zram->meta = NULL;
 	zram->max_comp_streams = 1;
+	zram->fullness = ZRAM_FULLNESS_PERCENT;
 	return 0;
 
 out_free_disk:
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index fcf3176a9f15..6a9f383d0d78 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -119,6 +119,7 @@ struct zram {
 	 */
 	unsigned long limit_pages;
 
+	int fullness;
 	char compressor[10];
 };
 #endif
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/5] zram: generalize swap_slot_free_notify
  2014-09-22  0:03 ` [PATCH v1 1/5] zram: generalize swap_slot_free_notify Minchan Kim
@ 2014-09-22 20:41   ` Andrew Morton
  2014-09-23  4:45     ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2014-09-22 20:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, 22 Sep 2014 09:03:07 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Currently, swap_slot_free_notify is used for zram to free
> duplicated copy page for memory efficiency when it knows
> there is no reference to the swap slot.
> 
> This patch generalizes it to be able to use for other
> swap hint to communicate with VM.
> 

I really think we need to do a better job of documenting the code.

> index 94d93b1f8b53..c262bfbeafa9 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -405,7 +405,7 @@ prototypes:
>  	void (*unlock_native_capacity) (struct gendisk *);
>  	int (*revalidate_disk) (struct gendisk *);
>  	int (*getgeo)(struct block_device *, struct hd_geometry *);
> -	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> +	int (*swap_hint) (struct block_device *, unsigned int, void *);
>  
>  locking rules:
>  			bd_mutex
> @@ -418,7 +418,7 @@ media_changed:		no
>  unlock_native_capacity:	no
>  revalidate_disk:	no
>  getgeo:			no
> -swap_slot_free_notify:	no	(see below)
> +swap_hint:		no	(see below)

This didn't tell anyone anythnig much.

> index d78b245bae06..22a37764c409 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -926,7 +926,8 @@ error:
>  	bio_io_error(bio);
>  }
>  
> -static void zram_slot_free_notify(struct block_device *bdev,
> +/* this callback is with swap_lock and sometimes page table lock held */

OK, that was useful.

It's called "page_table_lock".

Also *which* page_table_lock?  current->mm?

> +static int zram_slot_free_notify(struct block_device *bdev,
>  				unsigned long index)
>  {
>  	struct zram *zram;
>
> ...
>
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1609,6 +1609,10 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>  
>  #endif /* CONFIG_BLK_DEV_INTEGRITY */
>  
> +enum swap_blk_hint {
> +	SWAP_FREE,
> +};

This would be a great place to document SWAP_FREE.

>  struct block_device_operations {
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
> @@ -1624,8 +1628,7 @@ struct block_device_operations {
>  	void (*unlock_native_capacity) (struct gendisk *);
>  	int (*revalidate_disk) (struct gendisk *);
>  	int (*getgeo)(struct block_device *, struct hd_geometry *);
> -	/* this callback is with swap_lock and sometimes page table lock held */
> -	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> +	int (*swap_hint)(struct block_device *, unsigned int, void *);

And this would be a suitable place to document ->swap_hint().

- Hint from who to who?  Is it the caller providing the callee a hint
  or is the caller asking the callee for a hint?

- What is the meaning of the return value?

- What are the meaning of the arguments?

Please don't omit the argument names like this.  They are useful!  How
is a reader to know what that "unsigned int" and "void *" actually
*do*?

The second arg-which-doesn't-have-a-name should have had type
swap_blk_hint, yes?

swap_blk_hint should be called swap_block_hint.  I assume that's what
"blk" means.  Why does the name have "block" in there anyway?  It has
something to do with disk blocks?  How is anyone supposed to work that
out?

->swap_hint was converted to return an `int', but all the callers
simply ignore the return value.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct
  2014-09-22  0:03 ` [PATCH v1 2/5] mm: add full variable in swap_info_struct Minchan Kim
@ 2014-09-22 20:45   ` Andrew Morton
  2014-09-23  4:45     ` Minchan Kim
  2014-09-24  2:53   ` Dan Streetman
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2014-09-22 20:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, 22 Sep 2014 09:03:08 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Now, swap leans on !p->highest_bit to indicate a swap is full.
> It works well for normal swap because every slot on swap device
> is used up when the swap is full but in case of zram, swap sees
> still many empty slot although backed device(ie, zram) is full
> since zram's limit is over so that it could make trouble when
> swap use highest_bit to select new slot via free_cluster.
> 
> This patch introduces full varaiable in swap_info_struct
> to solve the problem.
> 
> ...
>
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -224,6 +224,7 @@ struct swap_info_struct {
>  	struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
>  	unsigned int lowest_bit;	/* index of first free in swap_map */
>  	unsigned int highest_bit;	/* index of last free in swap_map */
> +	bool	full;			/* whether swap is full or not */

This is protected by swap_info_struct.lock, I worked out.

There's a large comment at swap_info_struct.lock which could be updated.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-22  0:03 ` [PATCH v1 4/5] zram: add swap full hint Minchan Kim
@ 2014-09-22 21:11   ` Andrew Morton
  2014-09-23  4:56     ` Minchan Kim
  2014-09-24 14:01   ` Dan Streetman
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2014-09-22 21:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, 22 Sep 2014 09:03:10 +0900 Minchan Kim <minchan@kernel.org> wrote:

> This patch implement SWAP_FULL handler in zram so that VM can
> know whether zram is full or not and use it to stop anonymous
> page reclaim.
> 
> How to judge fullness is below,
> 
> fullness = (100 * used space / total space)
> 
> It means the higher fullness is, the slower we reach zram full.
> Now, default of fullness is 80 so that it biased more momory
> consumption rather than early OOM kill.

It's unclear to me why this is being done.  What's wrong with "use it
until it's full then stop", which is what I assume the current code
does?  Why add this stuff?  What goes wrong with the current code and
how does this fix it?

ie: better explanation and justification in the chagnelogs, please.

> Above logic works only when used space of zram hit over the limit
> but zram also pretend to be full once 32 consecutive allocation
> fail happens. It's safe guard to prevent system hang caused by
> fragment uncertainty.

So allocation requests are of variable size, yes?  If so, the above
statement should read "32 consecutive allocation attempts for regions
or size 2 or more slots".  Because a failure of a single-slot
allocation attempt is an immediate failure.

The 32-in-a-row thing sounds like a hack.  Why can't we do this
deterministically?  If one request for four slots fails then the next
one will as well, so why bother retrying?

> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  
> +/*
> + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> + * we regards it as zram-full. It means that the higher
> + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> + */

I just don't understand this patch :( To me, the above implies that the
user who sets 80% has elected to never use 20% of the zram capacity. 
Why on earth would anyone do that?  This chagnelog doesn't tell me.

> +#define ZRAM_FULLNESS_PERCENT 80

We've had problems in the past where 1% is just too large an increment
for large systems.

> @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>  	}
>  
>  	alloced_pages = zs_get_total_pages(meta->mem_pool);
> -	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> -		zs_free(meta->mem_pool, handle);
> -		ret = -ENOMEM;
> -		goto out;
> +	if (zram->limit_pages) {
> +		if (alloced_pages > zram->limit_pages) {

This is all a bit racy, isn't it?  pool->pages_allocated and
zram->limit_pages could be changing under our feet.

> +			zs_free(meta->mem_pool, handle);
> +			atomic_inc(&zram->alloc_fail);
> +			ret = -ENOMEM;
> +			goto out;
> +		} else {
> +			atomic_set(&zram->alloc_fail, 0);
> +		}
 	}
 
 	update_used_max(zram, alloced_pages);

> @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
>  	down_write(&zram->init_lock);
>  
>  	zram->limit_pages = 0;
> +	atomic_set(&zram->alloc_fail, 0);
>  
>  	if (!init_done(zram)) {
>  		up_write(&zram->init_lock);
> @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
>  	return 0;
>  }
>  
> +static int zram_full(struct block_device *bdev, void *arg)

This could return a bool.  That implies that zram_swap_hint should
return bool too, but as we haven't been told what the zram_swap_hint
return value does, I'm a bit stumped.

And why include the unusefully-named "void *arg"?  It doesn't get used here.

> +{
> +	struct zram *zram;
> +	struct zram_meta *meta;
> +	unsigned long total_pages, compr_pages;
> +
> +	zram = bdev->bd_disk->private_data;
> +	if (!zram->limit_pages)
> +		return 0;
> +
> +	meta = zram->meta;
> +	total_pages = zs_get_total_pages(meta->mem_pool);
> +
> +	if (total_pages >= zram->limit_pages) {
> +
> +		compr_pages = atomic64_read(&zram->stats.compr_data_size)
> +					>> PAGE_SHIFT;
> +		if ((100 * compr_pages / total_pages)
> +			>= ZRAM_FULLNESS_PERCENT)
> +			return 1;
> +	}
> +
> +	if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static int zram_swap_hint(struct block_device *bdev,
>  				unsigned int hint, void *arg)
>  {


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 5/5] zram: add fullness knob to control swap full
  2014-09-22  0:03 ` [PATCH v1 5/5] zram: add fullness knob to control swap full Minchan Kim
@ 2014-09-22 21:17   ` Andrew Morton
  2014-09-23  4:57     ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2014-09-22 21:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, 22 Sep 2014 09:03:11 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Some zram usecase could want lower fullness than default 80 to
> avoid unnecessary swapout-and-fail-recover overhead.
> 
> A typical example is that mutliple swap with high piroirty
> zram-swap and low priority HDD-swap so it could still enough
> free swap space although one of swap devices is full(ie, zram).
> It would be better to fail over to HDD-swap rather than failing
> swap write to zram in this case.
> 
> This patch exports fullness to user so user can control it
> via the knob.

Adding new userspace interfaces requires a pretty strong justification
and it's unclear to me that this is being met.  In fact the whole
patchset reads like "we have some problem, don't know how to fix it so
let's add a userspace knob to make it someone else's problem".

> index b13dc993291f..817738d14061 100644
> --- a/Documentation/ABI/testing/sysfs-block-zram
> +++ b/Documentation/ABI/testing/sysfs-block-zram
> @@ -138,3 +138,13 @@ Description:
>  		amount of memory ZRAM can use to store the compressed data.  The
>  		limit could be changed in run time and "0" means disable the
>  		limit.  No limit is the initial state.  Unit: bytes
> +
> +What:		/sys/block/zram<id>/fullness
> +Date:		August 2014
> +Contact:	Minchan Kim <minchan@kernel.org>
> +Description:
> +		The fullness file is read/write and specifies how easily
> +		zram become full state so if you set it to lower value,
> +		zram can reach full state easily compared to higher value.
> +		Curretnly, initial value is 80% but it could be changed.
> +		Unit: Percentage

And I don't think that there is sufficient information here for a user
to be able to work out what to do with this tunable.

> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -136,6 +136,37 @@ static ssize_t max_comp_streams_show(struct device *dev,
>  	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
>  }
>  
> +static ssize_t fullness_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	int val;
> +	struct zram *zram = dev_to_zram(dev);
> +
> +	down_read(&zram->init_lock);
> +	val = zram->fullness;
> +	up_read(&zram->init_lock);

Did we really need to take a lock to display a value which became
out-of-date as soon as we released that lock?

> +	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> +}
> +
> +static ssize_t fullness_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t len)
> +{
> +	int err;
> +	unsigned long val;
> +	struct zram *zram = dev_to_zram(dev);
> +
> +	err = kstrtoul(buf, 10, &val);
> +	if (err || val > 100)
> +		return -EINVAL;

This overwrites the kstrtoul() return value.

> +
> +	down_write(&zram->init_lock);
> +	zram->fullness = val;
> +	up_write(&zram->init_lock);
> +
> +	return len;
> +}
> +


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/5] zram: generalize swap_slot_free_notify
  2014-09-22 20:41   ` Andrew Morton
@ 2014-09-23  4:45     ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-23  4:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

Hi Andrew,

On Mon, Sep 22, 2014 at 01:41:09PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:07 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Currently, swap_slot_free_notify is used for zram to free
> > duplicated copy page for memory efficiency when it knows
> > there is no reference to the swap slot.
> > 
> > This patch generalizes it to be able to use for other
> > swap hint to communicate with VM.
> > 
> 
> I really think we need to do a better job of documenting the code.
> 
> > index 94d93b1f8b53..c262bfbeafa9 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -405,7 +405,7 @@ prototypes:
> >  	void (*unlock_native_capacity) (struct gendisk *);
> >  	int (*revalidate_disk) (struct gendisk *);
> >  	int (*getgeo)(struct block_device *, struct hd_geometry *);
> > -	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > +	int (*swap_hint) (struct block_device *, unsigned int, void *);
> >  
> >  locking rules:
> >  			bd_mutex
> > @@ -418,7 +418,7 @@ media_changed:		no
> >  unlock_native_capacity:	no
> >  revalidate_disk:	no
> >  getgeo:			no
> > -swap_slot_free_notify:	no	(see below)
> > +swap_hint:		no	(see below)
> 
> This didn't tell anyone anythnig much.

Yeb. :(

> 
> > index d78b245bae06..22a37764c409 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -926,7 +926,8 @@ error:
> >  	bio_io_error(bio);
> >  }
> >  
> > -static void zram_slot_free_notify(struct block_device *bdev,
> > +/* this callback is with swap_lock and sometimes page table lock held */
> 
> OK, that was useful.
> 
> It's called "page_table_lock".
> 
> Also *which* page_table_lock?  current->mm?

It depends on ALLOC_SPLIT_PTLOCKS so it could be page->ptl, too.
So, it would be better to call it as *ptlock*?
Since it's ptlock, it isn't related to which mm struct.
What we should make sure is just ptlock which belong to the page
table pointed to this swap page.

So, I want this.

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index c262bfbeafa9..19d2726e34f4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -423,8 +423,8 @@ swap_hint:		no	(see below)
 media_changed, unlock_native_capacity and revalidate_disk are called only from
 check_disk_change().
 
-swap_slot_free_notify is called with swap_lock and sometimes the page lock
-held.
+swap_hint is called with swap_info_struct->lock and sometimes the ptlock
+of the page table pointed to the swap page.
 
 
 --------------------------- file_operations -------------------------------

> 
> > +static int zram_slot_free_notify(struct block_device *bdev,
> >  				unsigned long index)
> >  {
> >  	struct zram *zram;
> >
> > ...
> >
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1609,6 +1609,10 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >  
> >  #endif /* CONFIG_BLK_DEV_INTEGRITY */
> >  
> > +enum swap_blk_hint {
> > +	SWAP_FREE,
> > +};
> 
> This would be a great place to document SWAP_FREE.

Yes,

> 
> >  struct block_device_operations {
> >  	int (*open) (struct block_device *, fmode_t);
> >  	void (*release) (struct gendisk *, fmode_t);
> > @@ -1624,8 +1628,7 @@ struct block_device_operations {
> >  	void (*unlock_native_capacity) (struct gendisk *);
> >  	int (*revalidate_disk) (struct gendisk *);
> >  	int (*getgeo)(struct block_device *, struct hd_geometry *);
> > -	/* this callback is with swap_lock and sometimes page table lock held */
> > -	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > +	int (*swap_hint)(struct block_device *, unsigned int, void *);
> 
> And this would be a suitable place to document ->swap_hint().

If we consider to be able to add more hints in future so it could
be verbose, IMO, it would be better to desribe it in enum swap_hint. :)

> 
> - Hint from who to who?  Is it the caller providing the callee a hint
>   or is the caller asking the callee for a hint?
> 
> - What is the meaning of the return value?
> 
> - What are the meaning of the arguments?

Okay.

> 
> Please don't omit the argument names like this.  They are useful!  How
> is a reader to know what that "unsigned int" and "void *" actually
> *do*?

Yes.

> 
> The second arg-which-doesn't-have-a-name should have had type
> swap_blk_hint, yes?

Yes.

> 
> swap_blk_hint should be called swap_block_hint.  I assume that's what
> "blk" means.  Why does the name have "block" in there anyway?  It has
> something to do with disk blocks?  How is anyone supposed to work that
> out?

Yeb, I think we don't need block in name. I will remove it.

> 
> ->swap_hint was converted to return an `int', but all the callers
> simply ignore the return value.

You're right. All caller doesn't use it in this patch but this patch
makes it generalize and in later patch, SWAP_FULL will use it so
I want to make return as int. One more thought, SWAP_FULL could use
void * as output param so we could remove return value, finally.
I will consider it, too.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct
  2014-09-22 20:45   ` Andrew Morton
@ 2014-09-23  4:45     ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-23  4:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, Sep 22, 2014 at 01:45:22PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:08 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Now, swap leans on !p->highest_bit to indicate a swap is full.
> > It works well for normal swap because every slot on swap device
> > is used up when the swap is full but in case of zram, swap sees
> > still many empty slot although backed device(ie, zram) is full
> > since zram's limit is over so that it could make trouble when
> > swap use highest_bit to select new slot via free_cluster.
> > 
> > This patch introduces full varaiable in swap_info_struct
> > to solve the problem.
> > 
> > ...
> >
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -224,6 +224,7 @@ struct swap_info_struct {
> >  	struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
> >  	unsigned int lowest_bit;	/* index of first free in swap_map */
> >  	unsigned int highest_bit;	/* index of last free in swap_map */
> > +	bool	full;			/* whether swap is full or not */
> 
> This is protected by swap_info_struct.lock, I worked out.
> 
> There's a large comment at swap_info_struct.lock which could be updated.

Sure.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-22 21:11   ` Andrew Morton
@ 2014-09-23  4:56     ` Minchan Kim
  2014-09-23 21:17       ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-23  4:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, Sep 22, 2014 at 02:11:18PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:10 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > This patch implement SWAP_FULL handler in zram so that VM can
> > know whether zram is full or not and use it to stop anonymous
> > page reclaim.
> > 
> > How to judge fullness is below,
> > 
> > fullness = (100 * used space / total space)
> > 
> > It means the higher fullness is, the slower we reach zram full.
> > Now, default of fullness is 80 so that it biased more momory
> > consumption rather than early OOM kill.
> 
> It's unclear to me why this is being done.  What's wrong with "use it
> until it's full then stop", which is what I assume the current code
> does?  Why add this stuff?  What goes wrong with the current code and
> how does this fix it?
> 
> ie: better explanation and justification in the chagnelogs, please.

My bad. I should have wrote down about zram allocator's fragmentation
problem.

zsmalloc has various size class so it has a fragmentation problem.
For example, a page swap out -> comprssed 32 byte -> has a empty slot
of zsmalloc's 32 size class -> successful write.

Another swap out -> compressed 256 byte -> no empty slot in zsmalloc's
256 size class -> zsmalloc should allocate new zspage but it would be
over limit so it would be failed.

The problem is swap layer cannot know compressed size of the page
in advance so it couldn't expect whether swap-write will be successful
while it could get empty swap slot easily since zram's virtual disk
size is fairy enough.

Given that zsmalloc's fragmentation, it would be *early-OOM* if zram
says *full* as soon as it reaches page limit because it could have
empty slots in various size classes. IOW, it doesn't consider fragment
problem so this patch suggests two condition to solve it.

	if (total_pages >= zram->limit_pages) {

		compr_pages = atomic64_read(&zram->stats.compr_data_size)
					>> PAGE_SHIFT;
		if ((100 * compr_pages / total_pages)
			>= zram->fullness)
			return 1;
	}

First of all, zram-consumed page should reach *limit* and then we
consider fullness. If used space is over 80%, we regards it as full
in this implementation because I want to focus more memory usage to
avoid early OOM kill when I consider zram's popular usecase in
embedded.

> 
> > Above logic works only when used space of zram hit over the limit
> > but zram also pretend to be full once 32 consecutive allocation
> > fail happens. It's safe guard to prevent system hang caused by
> > fragment uncertainty.
> 
> So allocation requests are of variable size, yes?  If so, the above
> statement should read "32 consecutive allocation attempts for regions
> or size 2 or more slots".  Because a failure of a single-slot
> allocation attempt is an immediate failure.
> 
> The 32-in-a-row thing sounds like a hack.  Why can't we do this
> deterministically?  If one request for four slots fails then the next
> one will as well, so why bother retrying?

The problem is swap layer cannot expect what compressed size in the end
in advance without compressing. If the page is compressed to the size
zsmalloc has empty slot in size class, it would be successful.

> 
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
> >  /* Module params (documentation at end) */
> >  static unsigned int num_devices = 1;
> >  
> > +/*
> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> > + * we regards it as zram-full. It means that the higher
> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> > + */
> 
> I just don't understand this patch :( To me, the above implies that the
> user who sets 80% has elected to never use 20% of the zram capacity. 
> Why on earth would anyone do that?  This chagnelog doesn't tell me.

Hope above my words make you clear.

> 
> > +#define ZRAM_FULLNESS_PERCENT 80
> 
> We've had problems in the past where 1% is just too large an increment
> for large systems.

So, do you want fullness_bytes like dirty_bytes?

> 
> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >  	}
> >  
> >  	alloced_pages = zs_get_total_pages(meta->mem_pool);
> > -	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> > -		zs_free(meta->mem_pool, handle);
> > -		ret = -ENOMEM;
> > -		goto out;
> > +	if (zram->limit_pages) {
> > +		if (alloced_pages > zram->limit_pages) {
> 
> This is all a bit racy, isn't it?  pool->pages_allocated and
> zram->limit_pages could be changing under our feet.

limit_pages cannot be changed by init_lock but pool->pages_allocated
is yes but the result by the race is not critical.

1. swap write fail so swap layer could make the page dirty again
   so it's no problem.
Or
2. alloc_fail race so zram could be full if consecutive alloc_fail is
   higher 32 and there is already over the limit currently.
   I think it is rare and if it happens, it's not a big problem, IMO.

> 
> > +			zs_free(meta->mem_pool, handle);
> > +			atomic_inc(&zram->alloc_fail);
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		} else {
> > +			atomic_set(&zram->alloc_fail, 0);
> > +		}
>  	}
>  
>  	update_used_max(zram, alloced_pages);
> 
> > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> >  	down_write(&zram->init_lock);
> >  
> >  	zram->limit_pages = 0;
> > +	atomic_set(&zram->alloc_fail, 0);
> >  
> >  	if (!init_done(zram)) {
> >  		up_write(&zram->init_lock);
> > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >  	return 0;
> >  }
> >  
> > +static int zram_full(struct block_device *bdev, void *arg)
> 
> This could return a bool.  That implies that zram_swap_hint should
> return bool too, but as we haven't been told what the zram_swap_hint
> return value does, I'm a bit stumped.

Hmm, currently, SWAP_FREE doesn't use return and SWAP_FULL uses return
as bool so in the end, we can change it as bool but I want to remain it
as int for the future. At least, we might use it as propagating error
in future. Instead, I will use *arg to return the result instead of
return val. But I'm not strong so if you want to remove return val,
I will do it. For clarifictaion, please tell me again if you want.

> 
> And why include the unusefully-named "void *arg"?  It doesn't get used here.
> 
> > +{
> > +	struct zram *zram;
> > +	struct zram_meta *meta;
> > +	unsigned long total_pages, compr_pages;
> > +
> > +	zram = bdev->bd_disk->private_data;
> > +	if (!zram->limit_pages)
> > +		return 0;
> > +
> > +	meta = zram->meta;
> > +	total_pages = zs_get_total_pages(meta->mem_pool);
> > +
> > +	if (total_pages >= zram->limit_pages) {
> > +
> > +		compr_pages = atomic64_read(&zram->stats.compr_data_size)
> > +					>> PAGE_SHIFT;
> > +		if ((100 * compr_pages / total_pages)
> > +			>= ZRAM_FULLNESS_PERCENT)
> > +			return 1;
> > +	}
> > +
> > +	if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> >  static int zram_swap_hint(struct block_device *bdev,
> >  				unsigned int hint, void *arg)
> >  {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 5/5] zram: add fullness knob to control swap full
  2014-09-22 21:17   ` Andrew Morton
@ 2014-09-23  4:57     ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-23  4:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, Sep 22, 2014 at 02:17:33PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:11 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Some zram usecase could want lower fullness than default 80 to
> > avoid unnecessary swapout-and-fail-recover overhead.
> > 
> > A typical example is that mutliple swap with high piroirty
> > zram-swap and low priority HDD-swap so it could still enough
> > free swap space although one of swap devices is full(ie, zram).
> > It would be better to fail over to HDD-swap rather than failing
> > swap write to zram in this case.
> > 
> > This patch exports fullness to user so user can control it
> > via the knob.
> 
> Adding new userspace interfaces requires a pretty strong justification
> and it's unclear to me that this is being met.  In fact the whole
> patchset reads like "we have some problem, don't know how to fix it so
> let's add a userspace knob to make it someone else's problem".

I explained rationale in 4/5's reply but if it's not enough or wrong,
please tell me.

> 
> > index b13dc993291f..817738d14061 100644
> > --- a/Documentation/ABI/testing/sysfs-block-zram
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -138,3 +138,13 @@ Description:
> >  		amount of memory ZRAM can use to store the compressed data.  The
> >  		limit could be changed in run time and "0" means disable the
> >  		limit.  No limit is the initial state.  Unit: bytes
> > +
> > +What:		/sys/block/zram<id>/fullness
> > +Date:		August 2014
> > +Contact:	Minchan Kim <minchan@kernel.org>
> > +Description:
> > +		The fullness file is read/write and specifies how easily
> > +		zram become full state so if you set it to lower value,
> > +		zram can reach full state easily compared to higher value.
> > +		Curretnly, initial value is 80% but it could be changed.
> > +		Unit: Percentage
> 
> And I don't think that there is sufficient information here for a user
> to be able to work out what to do with this tunable.

I will put more words.

> 
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -136,6 +136,37 @@ static ssize_t max_comp_streams_show(struct device *dev,
> >  	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> >  }
> >  
> > +static ssize_t fullness_show(struct device *dev,
> > +		struct device_attribute *attr, char *buf)
> > +{
> > +	int val;
> > +	struct zram *zram = dev_to_zram(dev);
> > +
> > +	down_read(&zram->init_lock);
> > +	val = zram->fullness;
> > +	up_read(&zram->init_lock);
> 
> Did we really need to take a lock to display a value which became
> out-of-date as soon as we released that lock?
> 
> > +	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> > +}
> > +
> > +static ssize_t fullness_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > +	int err;
> > +	unsigned long val;
> > +	struct zram *zram = dev_to_zram(dev);
> > +
> > +	err = kstrtoul(buf, 10, &val);
> > +	if (err || val > 100)
> > +		return -EINVAL;
> 
> This overwrites the kstrtoul() return value.

Will fix.

Thanks for the reivew, Andrew.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-23  4:56     ` Minchan Kim
@ 2014-09-23 21:17       ` Andrew Morton
  2014-09-24  7:57         ` Minchan Kim
  2014-09-24 15:10         ` Jerome Marchand
  0 siblings, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2014-09-23 21:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Tue, 23 Sep 2014 13:56:02 +0900 Minchan Kim <minchan@kernel.org> wrote:

> > 
> > > +#define ZRAM_FULLNESS_PERCENT 80
> > 
> > We've had problems in the past where 1% is just too large an increment
> > for large systems.
> 
> So, do you want fullness_bytes like dirty_bytes?

Firstly I'd like you to think about whether we're ever likely to have
similar granularity problems with this tunable.  If not then forget
about it.

If yes then we should do something.  I don't like the "bytes" thing
much because it requires that the operator know the pool size
beforehand, and any time that changes, the "bytes" needs hanging too. 
Ratios are nice but percent is too coarse.  Maybe kernel should start
using "ppm" for ratios, parts per million.  hrm.

> > > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> > >  	down_write(&zram->init_lock);
> > >  
> > >  	zram->limit_pages = 0;
> > > +	atomic_set(&zram->alloc_fail, 0);
> > >  
> > >  	if (!init_done(zram)) {
> > >  		up_write(&zram->init_lock);
> > > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> > >  	return 0;
> > >  }
> > >  
> > > +static int zram_full(struct block_device *bdev, void *arg)
> > 
> > This could return a bool.  That implies that zram_swap_hint should
> > return bool too, but as we haven't been told what the zram_swap_hint
> > return value does, I'm a bit stumped.
> 
> Hmm, currently, SWAP_FREE doesn't use return and SWAP_FULL uses return
> as bool so in the end, we can change it as bool but I want to remain it
> as int for the future. At least, we might use it as propagating error
> in future. Instead, I will use *arg to return the result instead of
> return val. But I'm not strong so if you want to remove return val,
> I will do it. For clarifictaion, please tell me again if you want.

I'm easy, as long as it makes sense, is understandable by people other
than he-who-wrote-it and doesn't use argument names such as "arg".



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct
  2014-09-22  0:03 ` [PATCH v1 2/5] mm: add full variable in swap_info_struct Minchan Kim
  2014-09-22 20:45   ` Andrew Morton
@ 2014-09-24  2:53   ` Dan Streetman
  2014-09-24  7:57     ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Streetman @ 2014-09-24  2:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> Now, swap leans on !p->highest_bit to indicate a swap is full.
> It works well for normal swap because every slot on swap device
> is used up when the swap is full but in case of zram, swap sees
> still many empty slot although backed device(ie, zram) is full
> since zram's limit is over so that it could make trouble when
> swap use highest_bit to select new slot via free_cluster.
>
> This patch introduces full varaiable in swap_info_struct
> to solve the problem.
>
> Suggested-by: Dan Streetman <ddstreet@ieee.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h |  1 +
>  mm/swapfile.c        | 33 +++++++++++++++++++--------------
>  2 files changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ea4f926e6b9b..a3c11c051495 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -224,6 +224,7 @@ struct swap_info_struct {
>         struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
>         unsigned int lowest_bit;        /* index of first free in swap_map */
>         unsigned int highest_bit;       /* index of last free in swap_map */
> +       bool    full;                   /* whether swap is full or not */
>         unsigned int pages;             /* total of usable pages of swap */
>         unsigned int inuse_pages;       /* number of those currently in use */
>         unsigned int cluster_next;      /* likely index for next allocation */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c07f7f4912e9..209112cf8b83 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -558,7 +558,7 @@ checks:
>         }
>         if (!(si->flags & SWP_WRITEOK))
>                 goto no_page;
> -       if (!si->highest_bit)
> +       if (si->full)
>                 goto no_page;
>         if (offset > si->highest_bit)
>                 scan_base = offset = si->lowest_bit;
> @@ -589,6 +589,7 @@ checks:
>                 spin_lock(&swap_avail_lock);
>                 plist_del(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
> +               si->full = true;
>         }
>         si->swap_map[offset] = usage;
>         inc_cluster_info_page(si, si->cluster_info, offset);
> @@ -653,14 +654,14 @@ start_over:
>                 plist_requeue(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
>                 spin_lock(&si->lock);
> -               if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> +               if (si->full || !(si->flags & SWP_WRITEOK)) {
>                         spin_lock(&swap_avail_lock);
>                         if (plist_node_empty(&si->avail_list)) {
>                                 spin_unlock(&si->lock);
>                                 goto nextsi;
>                         }
> -                       WARN(!si->highest_bit,
> -                            "swap_info %d in list but !highest_bit\n",
> +                       WARN(si->full,
> +                            "swap_info %d in list but swap is full\n",
>                              si->type);
>                         WARN(!(si->flags & SWP_WRITEOK),
>                              "swap_info %d in list but !SWP_WRITEOK\n",
> @@ -796,21 +797,25 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>
>         /* free if no reference */
>         if (!usage) {
> +               bool was_full;
> +
>                 dec_cluster_info_page(p, p->cluster_info, offset);
>                 if (offset < p->lowest_bit)
>                         p->lowest_bit = offset;
> -               if (offset > p->highest_bit) {
> -                       bool was_full = !p->highest_bit;
> +               if (offset > p->highest_bit)
>                         p->highest_bit = offset;
> -                       if (was_full && (p->flags & SWP_WRITEOK)) {
> -                               spin_lock(&swap_avail_lock);
> -                               WARN_ON(!plist_node_empty(&p->avail_list));
> -                               if (plist_node_empty(&p->avail_list))
> -                                       plist_add(&p->avail_list,
> -                                                 &swap_avail_head);
> -                               spin_unlock(&swap_avail_lock);
> -                       }
> +               was_full = p->full;
> +
> +               if (was_full && (p->flags & SWP_WRITEOK)) {

was_full was only needed because highest_bit was reset to offset right
before checking for fullness, so now that ->full is used instead of
!highest_bit, was_full isn't needed anymore, you can just check
p->full.


> +                       spin_lock(&swap_avail_lock);
> +                       WARN_ON(!plist_node_empty(&p->avail_list));
> +                       if (plist_node_empty(&p->avail_list))
> +                               plist_add(&p->avail_list,
> +                                         &swap_avail_head);
> +                       spin_unlock(&swap_avail_lock);
> +                       p->full = false;
>                 }
> +
>                 atomic_long_inc(&nr_swap_pages);
>                 p->inuse_pages--;
>                 frontswap_invalidate_page(p->type, offset);
> --
> 2.0.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-23 21:17       ` Andrew Morton
@ 2014-09-24  7:57         ` Minchan Kim
  2014-09-24 15:10         ` Jerome Marchand
  1 sibling, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-24  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Tue, Sep 23, 2014 at 02:17:55PM -0700, Andrew Morton wrote:
> On Tue, 23 Sep 2014 13:56:02 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > 
> > > > +#define ZRAM_FULLNESS_PERCENT 80
> > > 
> > > We've had problems in the past where 1% is just too large an increment
> > > for large systems.
> > 
> > So, do you want fullness_bytes like dirty_bytes?
> 
> Firstly I'd like you to think about whether we're ever likely to have
> similar granularity problems with this tunable.  If not then forget
> about it.

When I think the usecase for zram-swap, it is used for small memory
but not sure because these days, mobile phone DRAM size tend to be
big(ex, 3G) and they want to use zRAM for swap due to wear-leveling
of nand. When I consier the trend, they might set zram-swap to about
500M in future. In that case, 1% is 5M and given zram comp ratio(ie,
max 5:1), it could be 25M which is never small for the application.
So, IMO, we need more fine-grained knob.

> 
> If yes then we should do something.  I don't like the "bytes" thing
> much because it requires that the operator know the pool size
> beforehand, and any time that changes, the "bytes" needs hanging too. 
> Ratios are nice but percent is too coarse.  Maybe kernel should start
> using "ppm" for ratios, parts per million.  hrm.

Okay, I will consider it more in next spin.

> 
> > > > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> > > >  	down_write(&zram->init_lock);
> > > >  
> > > >  	zram->limit_pages = 0;
> > > > +	atomic_set(&zram->alloc_fail, 0);
> > > >  
> > > >  	if (!init_done(zram)) {
> > > >  		up_write(&zram->init_lock);
> > > > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int zram_full(struct block_device *bdev, void *arg)
> > > 
> > > This could return a bool.  That implies that zram_swap_hint should
> > > return bool too, but as we haven't been told what the zram_swap_hint
> > > return value does, I'm a bit stumped.
> > 
> > Hmm, currently, SWAP_FREE doesn't use return and SWAP_FULL uses return
> > as bool so in the end, we can change it as bool but I want to remain it
> > as int for the future. At least, we might use it as propagating error
> > in future. Instead, I will use *arg to return the result instead of
> > return val. But I'm not strong so if you want to remove return val,
> > I will do it. For clarifictaion, please tell me again if you want.
> 
> I'm easy, as long as it makes sense, is understandable by people other
> than he-who-wrote-it and doesn't use argument names such as "arg".

Yeb.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct
  2014-09-24  2:53   ` Dan Streetman
@ 2014-09-24  7:57     ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-24  7:57 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Tue, Sep 23, 2014 at 10:53:05PM -0400, Dan Streetman wrote:
> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Now, swap leans on !p->highest_bit to indicate a swap is full.
> > It works well for normal swap because every slot on swap device
> > is used up when the swap is full but in case of zram, swap sees
> > still many empty slot although backed device(ie, zram) is full
> > since zram's limit is over so that it could make trouble when
> > swap use highest_bit to select new slot via free_cluster.
> >
> > This patch introduces full varaiable in swap_info_struct
> > to solve the problem.
> >
> > Suggested-by: Dan Streetman <ddstreet@ieee.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h |  1 +
> >  mm/swapfile.c        | 33 +++++++++++++++++++--------------
> >  2 files changed, 20 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ea4f926e6b9b..a3c11c051495 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -224,6 +224,7 @@ struct swap_info_struct {
> >         struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
> >         unsigned int lowest_bit;        /* index of first free in swap_map */
> >         unsigned int highest_bit;       /* index of last free in swap_map */
> > +       bool    full;                   /* whether swap is full or not */
> >         unsigned int pages;             /* total of usable pages of swap */
> >         unsigned int inuse_pages;       /* number of those currently in use */
> >         unsigned int cluster_next;      /* likely index for next allocation */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c07f7f4912e9..209112cf8b83 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -558,7 +558,7 @@ checks:
> >         }
> >         if (!(si->flags & SWP_WRITEOK))
> >                 goto no_page;
> > -       if (!si->highest_bit)
> > +       if (si->full)
> >                 goto no_page;
> >         if (offset > si->highest_bit)
> >                 scan_base = offset = si->lowest_bit;
> > @@ -589,6 +589,7 @@ checks:
> >                 spin_lock(&swap_avail_lock);
> >                 plist_del(&si->avail_list, &swap_avail_head);
> >                 spin_unlock(&swap_avail_lock);
> > +               si->full = true;
> >         }
> >         si->swap_map[offset] = usage;
> >         inc_cluster_info_page(si, si->cluster_info, offset);
> > @@ -653,14 +654,14 @@ start_over:
> >                 plist_requeue(&si->avail_list, &swap_avail_head);
> >                 spin_unlock(&swap_avail_lock);
> >                 spin_lock(&si->lock);
> > -               if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> > +               if (si->full || !(si->flags & SWP_WRITEOK)) {
> >                         spin_lock(&swap_avail_lock);
> >                         if (plist_node_empty(&si->avail_list)) {
> >                                 spin_unlock(&si->lock);
> >                                 goto nextsi;
> >                         }
> > -                       WARN(!si->highest_bit,
> > -                            "swap_info %d in list but !highest_bit\n",
> > +                       WARN(si->full,
> > +                            "swap_info %d in list but swap is full\n",
> >                              si->type);
> >                         WARN(!(si->flags & SWP_WRITEOK),
> >                              "swap_info %d in list but !SWP_WRITEOK\n",
> > @@ -796,21 +797,25 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >
> >         /* free if no reference */
> >         if (!usage) {
> > +               bool was_full;
> > +
> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >                 if (offset < p->lowest_bit)
> >                         p->lowest_bit = offset;
> > -               if (offset > p->highest_bit) {
> > -                       bool was_full = !p->highest_bit;
> > +               if (offset > p->highest_bit)
> >                         p->highest_bit = offset;
> > -                       if (was_full && (p->flags & SWP_WRITEOK)) {
> > -                               spin_lock(&swap_avail_lock);
> > -                               WARN_ON(!plist_node_empty(&p->avail_list));
> > -                               if (plist_node_empty(&p->avail_list))
> > -                                       plist_add(&p->avail_list,
> > -                                                 &swap_avail_head);
> > -                               spin_unlock(&swap_avail_lock);
> > -                       }
> > +               was_full = p->full;
> > +
> > +               if (was_full && (p->flags & SWP_WRITEOK)) {
> 
> was_full was only needed because highest_bit was reset to offset right
> before checking for fullness, so now that ->full is used instead of
> !highest_bit, was_full isn't needed anymore, you can just check
> p->full.

Okay.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-22  0:03 ` [PATCH v1 4/5] zram: add swap full hint Minchan Kim
  2014-09-22 21:11   ` Andrew Morton
@ 2014-09-24 14:01   ` Dan Streetman
  2014-09-25  1:02     ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Streetman @ 2014-09-24 14:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> This patch implement SWAP_FULL handler in zram so that VM can
> know whether zram is full or not and use it to stop anonymous
> page reclaim.
>
> How to judge fullness is below,
>
> fullness = (100 * used space / total space)
>
> It means the higher fullness is, the slower we reach zram full.
> Now, default of fullness is 80 so that it biased more momory
> consumption rather than early OOM kill.
>
> Above logic works only when used space of zram hit over the limit
> but zram also pretend to be full once 32 consecutive allocation
> fail happens. It's safe guard to prevent system hang caused by
> fragment uncertainty.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
>  drivers/block/zram/zram_drv.h |  1 +
>  2 files changed, 57 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 22a37764c409..649cad9d0b1c 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>
> +/*
> + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> + * we regards it as zram-full. It means that the higher
> + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> + */
> +#define ZRAM_FULLNESS_PERCENT 80

As Andrew said, this (or the user-configurable fullness param from the
next patch) should have more detail about exactly why it's needed and
what it does.  The details of how zram considers itself "full" should
be clear, which probably includes explaining zsmalloc fragmentation.
It should be also clear this param only matters when limit_pages is
set, and this param is only checked when zsmalloc's total size has
reached that limit.

Also, since the next patch changes it to be used only as a default,
shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?

> +
> +/*
> + * If zram fails to allocate memory consecutively up to this,
> + * we regard it as zram-full. It's safe guard to prevent too
> + * many swap write fail due to lack of fragmentation uncertainty.
> + */
> +#define ALLOC_FAIL_MAX 32
> +
>  #define ZRAM_ATTR_RO(name)                                             \
>  static ssize_t zram_attr_##name##_show(struct device *d,               \
>                                 struct device_attribute *attr, char *b) \
> @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
>
>         down_write(&zram->init_lock);
>         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
> +       atomic_set(&zram->alloc_fail, 0);
>         up_write(&zram->init_lock);
>
>         return len;
> @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
>         atomic64_sub(zram_get_obj_size(meta, index),
>                         &zram->stats.compr_data_size);
>         atomic64_dec(&zram->stats.pages_stored);
> +       atomic_set(&zram->alloc_fail, 0);
>
>         meta->table[index].handle = 0;
>         zram_set_obj_size(meta, index, 0);
> @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>         }
>
>         alloced_pages = zs_get_total_pages(meta->mem_pool);
> -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> -               zs_free(meta->mem_pool, handle);
> -               ret = -ENOMEM;
> -               goto out;
> +       if (zram->limit_pages) {
> +               if (alloced_pages > zram->limit_pages) {
> +                       zs_free(meta->mem_pool, handle);
> +                       atomic_inc(&zram->alloc_fail);
> +                       ret = -ENOMEM;
> +                       goto out;
> +               } else {
> +                       atomic_set(&zram->alloc_fail, 0);
> +               }

So, with zram_full() checking for alloced_pages >= limit_pages, this
will need to be changed; the way it is now it prevents that from ever
being true.

Instead I believe this check has to be moved to before zs_malloc(), so
that alloced_pages > limit_pages is true.


>         }
>
>         update_used_max(zram, alloced_pages);
> @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
>         down_write(&zram->init_lock);
>
>         zram->limit_pages = 0;
> +       atomic_set(&zram->alloc_fail, 0);
>
>         if (!init_done(zram)) {
>                 up_write(&zram->init_lock);
> @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
>         return 0;
>  }
>
> +static int zram_full(struct block_device *bdev, void *arg)
> +{
> +       struct zram *zram;
> +       struct zram_meta *meta;
> +       unsigned long total_pages, compr_pages;
> +
> +       zram = bdev->bd_disk->private_data;
> +       if (!zram->limit_pages)
> +               return 0;
> +
> +       meta = zram->meta;
> +       total_pages = zs_get_total_pages(meta->mem_pool);
> +
> +       if (total_pages >= zram->limit_pages) {
> +
> +               compr_pages = atomic64_read(&zram->stats.compr_data_size)
> +                                       >> PAGE_SHIFT;
> +               if ((100 * compr_pages / total_pages)
> +                       >= ZRAM_FULLNESS_PERCENT)
> +                       return 1;
> +       }
> +
> +       if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
> +               return 1;
> +
> +       return 0;
> +}
> +
>  static int zram_swap_hint(struct block_device *bdev,
>                                 unsigned int hint, void *arg)
>  {
> @@ -951,6 +1001,8 @@ static int zram_swap_hint(struct block_device *bdev,
>
>         if (hint == SWAP_FREE)
>                 ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> +       else if (hint == SWAP_FULL)
> +               ret = zram_full(bdev, arg);
>
>         return ret;
>  }
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index c6ee271317f5..fcf3176a9f15 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -113,6 +113,7 @@ struct zram {
>         u64 disksize;   /* bytes */
>         int max_comp_streams;
>         struct zram_stats stats;
> +       atomic_t alloc_fail;
>         /*
>          * the number of pages zram can consume for storing compressed data
>          */
> --
> 2.0.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/5] mm: VM can be aware of zram fullness
  2014-09-22  0:03 ` [PATCH v1 3/5] mm: VM can be aware of zram fullness Minchan Kim
@ 2014-09-24 14:12   ` Dan Streetman
  2014-09-25  1:06     ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Streetman @ 2014-09-24 14:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> VM uses nr_swap_pages to throttle amount of swap when it reclaims
> anonymous pages because the nr_swap_pages means freeable space
> of swap disk.
>
> However, it's a problem for zram because zram can limit memory
> usage by knob(ie, mem_limit) so that swap out can fail although
> VM can see lots of free space from zram disk but no more free
> space in zram by the limit. If it happens, VM should notice it
> and stop reclaimaing until zram can obtain more free space but
> we don't have a way to communicate between VM and zram.
>
> This patch adds new hint SWAP_FULL so that zram can say to VM
> "I'm full" from now on. Then VM cannot reclaim annoymous page
> any more. If VM notice swap is full, it can remove swap_info_struct
> from swap_avail_head and substract remained freeable space from
> nr_swap_pages so that VM can think swap is full until VM frees a
> swap and increase nr_swap_pages again.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/blkdev.h |  1 +
>  mm/swapfile.c          | 44 ++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 39 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c7220409456c..39f074e0acd7 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>
>  enum swap_blk_hint {
>         SWAP_FREE,
> +       SWAP_FULL,
>  };
>
>  struct block_device_operations {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 209112cf8b83..71e3df0431b6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -493,6 +493,29 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
>         int latency_ration = LATENCY_LIMIT;
>
>         /*
> +        * If zram is full, we don't need to scan and want to stop swap.
> +        * For it, we removes si from swap_avail_head and decreases
> +        * nr_swap_pages to prevent further anonymous reclaim so that
> +        * VM can restart swap out if zram has a free space.
> +        * Look at swap_entry_free.
> +        */
> +       if (si->flags & SWP_BLKDEV) {
> +               struct gendisk *disk = si->bdev->bd_disk;
> +
> +               if (disk->fops->swap_hint && disk->fops->swap_hint(
> +                               si->bdev, SWAP_FULL, NULL)) {
> +                       spin_lock(&swap_avail_lock);
> +                       WARN_ON(plist_node_empty(&si->avail_list));
> +                       plist_del(&si->avail_list, &swap_avail_head);
> +                       spin_unlock(&swap_avail_lock);
> +                       atomic_long_sub(si->pages - si->inuse_pages,
> +                                               &nr_swap_pages);
> +                       si->full = true;
> +                       return 0;
> +               }
> +       }
> +
> +       /*
>          * We try to cluster swap pages by allocating them sequentially
>          * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
>          * way, however, we resort to first-free allocation, starting
> @@ -798,6 +821,14 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>         /* free if no reference */
>         if (!usage) {
>                 bool was_full;
> +               struct gendisk *virt_swap = NULL;
> +
> +               /* Check virtual swap */
> +               if (p->flags & SWP_BLKDEV) {
> +                       virt_swap = p->bdev->bd_disk;
> +                       if (!virt_swap->fops->swap_hint)

not a big deal, but can't you just combine these two if's to simplify this?

> +                               virt_swap = NULL;
> +               }
>
>                 dec_cluster_info_page(p, p->cluster_info, offset);
>                 if (offset < p->lowest_bit)
> @@ -814,17 +845,18 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>                                           &swap_avail_head);
>                         spin_unlock(&swap_avail_lock);
>                         p->full = false;
> +                       if (virt_swap)
> +                               atomic_long_add(p->pages -
> +                                               p->inuse_pages,
> +                                               &nr_swap_pages);

a comment here might be good, to clarify it relies on the check at the
top of scan_swap_map previously subtracting the same number of pages.


>                 }
>
>                 atomic_long_inc(&nr_swap_pages);
>                 p->inuse_pages--;
>                 frontswap_invalidate_page(p->type, offset);
> -               if (p->flags & SWP_BLKDEV) {
> -                       struct gendisk *disk = p->bdev->bd_disk;
> -                       if (disk->fops->swap_hint)
> -                               disk->fops->swap_hint(p->bdev,
> -                                               SWAP_FREE, (void *)offset);
> -               }
> +               if (virt_swap)
> +                       virt_swap->fops->swap_hint(p->bdev,
> +                                       SWAP_FREE, (void *)offset);
>         }
>
>         return usage;
> --
> 2.0.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-23 21:17       ` Andrew Morton
  2014-09-24  7:57         ` Minchan Kim
@ 2014-09-24 15:10         ` Jerome Marchand
  2014-09-25  1:07           ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Jerome Marchand @ 2014-09-24 15:10 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Sergey Senozhatsky, Dan Streetman, Nitin Gupta, Luigi Semenzato,
	juno.choi

[-- Attachment #1: Type: text/plain, Size: 2232 bytes --]

On 09/23/2014 11:17 PM, Andrew Morton wrote:
> On Tue, 23 Sep 2014 13:56:02 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
>>>
>>>> +#define ZRAM_FULLNESS_PERCENT 80
>>>
>>> We've had problems in the past where 1% is just too large an increment
>>> for large systems.
>>
>> So, do you want fullness_bytes like dirty_bytes?
> 
> Firstly I'd like you to think about whether we're ever likely to have
> similar granularity problems with this tunable.  If not then forget
> about it.
> 
> If yes then we should do something.  I don't like the "bytes" thing
> much because it requires that the operator know the pool size
> beforehand, and any time that changes, the "bytes" needs hanging too. 
> Ratios are nice but percent is too coarse.  Maybe kernel should start
> using "ppm" for ratios, parts per million.  hrm.

An other possibility is to use decimal fractions. AFAIK, lustre fs uses
them already for its procfs entries.

> 
>>>> @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
>>>>  	down_write(&zram->init_lock);
>>>>  
>>>>  	zram->limit_pages = 0;
>>>> +	atomic_set(&zram->alloc_fail, 0);
>>>>  
>>>>  	if (!init_done(zram)) {
>>>>  		up_write(&zram->init_lock);
>>>> @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +static int zram_full(struct block_device *bdev, void *arg)
>>>
>>> This could return a bool.  That implies that zram_swap_hint should
>>> return bool too, but as we haven't been told what the zram_swap_hint
>>> return value does, I'm a bit stumped.
>>
>> Hmm, currently, SWAP_FREE doesn't use return and SWAP_FULL uses return
>> as bool so in the end, we can change it as bool but I want to remain it
>> as int for the future. At least, we might use it as propagating error
>> in future. Instead, I will use *arg to return the result instead of
>> return val. But I'm not strong so if you want to remove return val,
>> I will do it. For clarifictaion, please tell me again if you want.
> 
> I'm easy, as long as it makes sense, is understandable by people other
> than he-who-wrote-it and doesn't use argument names such as "arg".
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-24 14:01   ` Dan Streetman
@ 2014-09-25  1:02     ` Minchan Kim
  2014-09-25 15:52       ` Dan Streetman
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-25  1:02 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Wed, Sep 24, 2014 at 10:01:03AM -0400, Dan Streetman wrote:
> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> > This patch implement SWAP_FULL handler in zram so that VM can
> > know whether zram is full or not and use it to stop anonymous
> > page reclaim.
> >
> > How to judge fullness is below,
> >
> > fullness = (100 * used space / total space)
> >
> > It means the higher fullness is, the slower we reach zram full.
> > Now, default of fullness is 80 so that it biased more momory
> > consumption rather than early OOM kill.
> >
> > Above logic works only when used space of zram hit over the limit
> > but zram also pretend to be full once 32 consecutive allocation
> > fail happens. It's safe guard to prevent system hang caused by
> > fragment uncertainty.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 57 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 22a37764c409..649cad9d0b1c 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
> >  /* Module params (documentation at end) */
> >  static unsigned int num_devices = 1;
> >
> > +/*
> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> > + * we regards it as zram-full. It means that the higher
> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> > + */
> > +#define ZRAM_FULLNESS_PERCENT 80
> 
> As Andrew said, this (or the user-configurable fullness param from the
> next patch) should have more detail about exactly why it's needed and
> what it does.  The details of how zram considers itself "full" should
> be clear, which probably includes explaining zsmalloc fragmentation.
> It should be also clear this param only matters when limit_pages is
> set, and this param is only checked when zsmalloc's total size has
> reached that limit.

Sure, How about this?

		The fullness file is read/write and specifies how easily
		zram become full state. Normally, we can think "full"
		once all of memory is consumed but it's not simple with
		zram because zsmalloc has some issue by internal design
		so that write could fail once consumed *page* by zram
		reaches the mem_limit and zsmalloc cannot have a empty
		slot for the compressed object's size on fragmenet space
		although it has more empty slots for other sizes.

		We regard zram as full once consumed *page* reaches the
		mem_limit and consumed memory until now is higher the value
		resulted from the knob. So, if you set the value high,
		you can squeeze more pages into fragment space so you could
		avoid early OOM while you could see more write-fail warning,
		overhead to fail-write recovering by VM and reclaim latency.
		If you set the value low, you can see OOM kill easily
		even though there are memory space in zram but you could
		avoid shortcomings mentioned above.

		This knobs is valid ony if you set mem_limit.
		Currently, initial value is 80% but it could be changed.

I didn't decide how to change it from percent.
Decimal fraction Jerome mentioned does make sense to me so please ignore
percent part in above.

> 
> Also, since the next patch changes it to be used only as a default,
> shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?

Okay, I will do it in 5/5.

> 
> > +
> > +/*
> > + * If zram fails to allocate memory consecutively up to this,
> > + * we regard it as zram-full. It's safe guard to prevent too
> > + * many swap write fail due to lack of fragmentation uncertainty.
> > + */
> > +#define ALLOC_FAIL_MAX 32
> > +
> >  #define ZRAM_ATTR_RO(name)                                             \
> >  static ssize_t zram_attr_##name##_show(struct device *d,               \
> >                                 struct device_attribute *attr, char *b) \
> > @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
> >
> >         down_write(&zram->init_lock);
> >         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
> > +       atomic_set(&zram->alloc_fail, 0);
> >         up_write(&zram->init_lock);
> >
> >         return len;
> > @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
> >         atomic64_sub(zram_get_obj_size(meta, index),
> >                         &zram->stats.compr_data_size);
> >         atomic64_dec(&zram->stats.pages_stored);
> > +       atomic_set(&zram->alloc_fail, 0);
> >
> >         meta->table[index].handle = 0;
> >         zram_set_obj_size(meta, index, 0);
> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >         }
> >
> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> > -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> > -               zs_free(meta->mem_pool, handle);
> > -               ret = -ENOMEM;
> > -               goto out;
> > +       if (zram->limit_pages) {
> > +               if (alloced_pages > zram->limit_pages) {
> > +                       zs_free(meta->mem_pool, handle);
> > +                       atomic_inc(&zram->alloc_fail);
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               } else {
> > +                       atomic_set(&zram->alloc_fail, 0);
> > +               }
> 
> So, with zram_full() checking for alloced_pages >= limit_pages, this
> will need to be changed; the way it is now it prevents that from ever
> being true.
> 
> Instead I believe this check has to be moved to before zs_malloc(), so
> that alloced_pages > limit_pages is true.

I don't get it why you said "it prevents that from ever being true".
Now, zram can use up until limit_pages (ie, used memory == zram->limit_pages)
and trying to get more is failed. so zram_full checks it as
toal_pages >= zram->limit_pages so what is problem?
If I miss your point, could you explain more?

> 
> 
> >         }
> >
> >         update_used_max(zram, alloced_pages);
> > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> >         down_write(&zram->init_lock);
> >
> >         zram->limit_pages = 0;
> > +       atomic_set(&zram->alloc_fail, 0);
> >
> >         if (!init_done(zram)) {
> >                 up_write(&zram->init_lock);
> > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >         return 0;
> >  }
> >
> > +static int zram_full(struct block_device *bdev, void *arg)
> > +{
> > +       struct zram *zram;
> > +       struct zram_meta *meta;
> > +       unsigned long total_pages, compr_pages;
> > +
> > +       zram = bdev->bd_disk->private_data;
> > +       if (!zram->limit_pages)
> > +               return 0;
> > +
> > +       meta = zram->meta;
> > +       total_pages = zs_get_total_pages(meta->mem_pool);
> > +
> > +       if (total_pages >= zram->limit_pages) {
> > +
> > +               compr_pages = atomic64_read(&zram->stats.compr_data_size)
> > +                                       >> PAGE_SHIFT;
> > +               if ((100 * compr_pages / total_pages)
> > +                       >= ZRAM_FULLNESS_PERCENT)
> > +                       return 1;
> > +       }
> > +
> > +       if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
> > +               return 1;
> > +
> > +       return 0;
> > +}
> > +
> >  static int zram_swap_hint(struct block_device *bdev,
> >                                 unsigned int hint, void *arg)
> >  {
> > @@ -951,6 +1001,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >
> >         if (hint == SWAP_FREE)
> >                 ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> > +       else if (hint == SWAP_FULL)
> > +               ret = zram_full(bdev, arg);
> >
> >         return ret;
> >  }
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index c6ee271317f5..fcf3176a9f15 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -113,6 +113,7 @@ struct zram {
> >         u64 disksize;   /* bytes */
> >         int max_comp_streams;
> >         struct zram_stats stats;
> > +       atomic_t alloc_fail;
> >         /*
> >          * the number of pages zram can consume for storing compressed data
> >          */
> > --
> > 2.0.0
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/5] mm: VM can be aware of zram fullness
  2014-09-24 14:12   ` Dan Streetman
@ 2014-09-25  1:06     ` Minchan Kim
  2014-09-25  1:31       ` Dan Streetman
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-09-25  1:06 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Wed, Sep 24, 2014 at 10:12:18AM -0400, Dan Streetman wrote:
> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> > VM uses nr_swap_pages to throttle amount of swap when it reclaims
> > anonymous pages because the nr_swap_pages means freeable space
> > of swap disk.
> >
> > However, it's a problem for zram because zram can limit memory
> > usage by knob(ie, mem_limit) so that swap out can fail although
> > VM can see lots of free space from zram disk but no more free
> > space in zram by the limit. If it happens, VM should notice it
> > and stop reclaimaing until zram can obtain more free space but
> > we don't have a way to communicate between VM and zram.
> >
> > This patch adds new hint SWAP_FULL so that zram can say to VM
> > "I'm full" from now on. Then VM cannot reclaim annoymous page
> > any more. If VM notice swap is full, it can remove swap_info_struct
> > from swap_avail_head and substract remained freeable space from
> > nr_swap_pages so that VM can think swap is full until VM frees a
> > swap and increase nr_swap_pages again.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/blkdev.h |  1 +
> >  mm/swapfile.c          | 44 ++++++++++++++++++++++++++++++++++++++------
> >  2 files changed, 39 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index c7220409456c..39f074e0acd7 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >
> >  enum swap_blk_hint {
> >         SWAP_FREE,
> > +       SWAP_FULL,
> >  };
> >
> >  struct block_device_operations {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 209112cf8b83..71e3df0431b6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -493,6 +493,29 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
> >         int latency_ration = LATENCY_LIMIT;
> >
> >         /*
> > +        * If zram is full, we don't need to scan and want to stop swap.
> > +        * For it, we removes si from swap_avail_head and decreases
> > +        * nr_swap_pages to prevent further anonymous reclaim so that
> > +        * VM can restart swap out if zram has a free space.
> > +        * Look at swap_entry_free.
> > +        */
> > +       if (si->flags & SWP_BLKDEV) {
> > +               struct gendisk *disk = si->bdev->bd_disk;
> > +
> > +               if (disk->fops->swap_hint && disk->fops->swap_hint(
> > +                               si->bdev, SWAP_FULL, NULL)) {
> > +                       spin_lock(&swap_avail_lock);
> > +                       WARN_ON(plist_node_empty(&si->avail_list));
> > +                       plist_del(&si->avail_list, &swap_avail_head);
> > +                       spin_unlock(&swap_avail_lock);
> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> > +                                               &nr_swap_pages);
> > +                       si->full = true;
> > +                       return 0;
> > +               }
> > +       }
> > +
> > +       /*
> >          * We try to cluster swap pages by allocating them sequentially
> >          * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
> >          * way, however, we resort to first-free allocation, starting
> > @@ -798,6 +821,14 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >         /* free if no reference */
> >         if (!usage) {
> >                 bool was_full;
> > +               struct gendisk *virt_swap = NULL;
> > +
> > +               /* Check virtual swap */
> > +               if (p->flags & SWP_BLKDEV) {
> > +                       virt_swap = p->bdev->bd_disk;
> > +                       if (!virt_swap->fops->swap_hint)
> 
> not a big deal, but can't you just combine these two if's to simplify this?

Do you want this?

if (p->flags & SWP_BLKDEV && p->bdev->bd_disk->fops->swap_hint)
        virt_swap = p->bdev->bd_disk;


\x02
> 
> > +                               virt_swap = NULL;
> > +               }
> >
> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >                 if (offset < p->lowest_bit)
> > @@ -814,17 +845,18 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >                                           &swap_avail_head);
> >                         spin_unlock(&swap_avail_lock);
> >                         p->full = false;
> > +                       if (virt_swap)
> > +                               atomic_long_add(p->pages -
> > +                                               p->inuse_pages,
> > +                                               &nr_swap_pages);
> 
> a comment here might be good, to clarify it relies on the check at the
> top of scan_swap_map previously subtracting the same number of pages.

Okay.

> 
> 
> >                 }
> >
> >                 atomic_long_inc(&nr_swap_pages);
> >                 p->inuse_pages--;
> >                 frontswap_invalidate_page(p->type, offset);
> > -               if (p->flags & SWP_BLKDEV) {
> > -                       struct gendisk *disk = p->bdev->bd_disk;
> > -                       if (disk->fops->swap_hint)
> > -                               disk->fops->swap_hint(p->bdev,
> > -                                               SWAP_FREE, (void *)offset);
> > -               }
> > +               if (virt_swap)
> > +                       virt_swap->fops->swap_hint(p->bdev,
> > +                                       SWAP_FREE, (void *)offset);
> >         }
> >
> >         return usage;
> > --
> > 2.0.0
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-24 15:10         ` Jerome Marchand
@ 2014-09-25  1:07           ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-09-25  1:07 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Sergey Senozhatsky, Dan Streetman, Nitin Gupta, Luigi Semenzato,
	juno.choi

On Wed, Sep 24, 2014 at 05:10:22PM +0200, Jerome Marchand wrote:
> On 09/23/2014 11:17 PM, Andrew Morton wrote:
> > On Tue, 23 Sep 2014 13:56:02 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > 
> >>>
> >>>> +#define ZRAM_FULLNESS_PERCENT 80
> >>>
> >>> We've had problems in the past where 1% is just too large an increment
> >>> for large systems.
> >>
> >> So, do you want fullness_bytes like dirty_bytes?
> > 
> > Firstly I'd like you to think about whether we're ever likely to have
> > similar granularity problems with this tunable.  If not then forget
> > about it.
> > 
> > If yes then we should do something.  I don't like the "bytes" thing
> > much because it requires that the operator know the pool size
> > beforehand, and any time that changes, the "bytes" needs hanging too. 
> > Ratios are nice but percent is too coarse.  Maybe kernel should start
> > using "ppm" for ratios, parts per million.  hrm.
> 
> An other possibility is to use decimal fractions. AFAIK, lustre fs uses
> them already for its procfs entries.

Looks good to me. If anyone doesn't have better idea or objection,
I want to approach this way.

Thanks for the hint!

> 
> > 
> >>>> @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> >>>>  	down_write(&zram->init_lock);
> >>>>  
> >>>>  	zram->limit_pages = 0;
> >>>> +	atomic_set(&zram->alloc_fail, 0);
> >>>>  
> >>>>  	if (!init_done(zram)) {
> >>>>  		up_write(&zram->init_lock);
> >>>> @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> +static int zram_full(struct block_device *bdev, void *arg)
> >>>
> >>> This could return a bool.  That implies that zram_swap_hint should
> >>> return bool too, but as we haven't been told what the zram_swap_hint
> >>> return value does, I'm a bit stumped.
> >>
> >> Hmm, currently, SWAP_FREE doesn't use return and SWAP_FULL uses return
> >> as bool so in the end, we can change it as bool but I want to remain it
> >> as int for the future. At least, we might use it as propagating error
> >> in future. Instead, I will use *arg to return the result instead of
> >> return val. But I'm not strong so if you want to remove return val,
> >> I will do it. For clarifictaion, please tell me again if you want.
> > 
> > I'm easy, as long as it makes sense, is understandable by people other
> > than he-who-wrote-it and doesn't use argument names such as "arg".
> > 
> > 
> 
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/5] mm: VM can be aware of zram fullness
  2014-09-25  1:06     ` Minchan Kim
@ 2014-09-25  1:31       ` Dan Streetman
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2014-09-25  1:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Wed, Sep 24, 2014 at 9:06 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Wed, Sep 24, 2014 at 10:12:18AM -0400, Dan Streetman wrote:
>> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > VM uses nr_swap_pages to throttle amount of swap when it reclaims
>> > anonymous pages because the nr_swap_pages means freeable space
>> > of swap disk.
>> >
>> > However, it's a problem for zram because zram can limit memory
>> > usage by knob(ie, mem_limit) so that swap out can fail although
>> > VM can see lots of free space from zram disk but no more free
>> > space in zram by the limit. If it happens, VM should notice it
>> > and stop reclaimaing until zram can obtain more free space but
>> > we don't have a way to communicate between VM and zram.
>> >
>> > This patch adds new hint SWAP_FULL so that zram can say to VM
>> > "I'm full" from now on. Then VM cannot reclaim annoymous page
>> > any more. If VM notice swap is full, it can remove swap_info_struct
>> > from swap_avail_head and substract remained freeable space from
>> > nr_swap_pages so that VM can think swap is full until VM frees a
>> > swap and increase nr_swap_pages again.
>> >
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> >  include/linux/blkdev.h |  1 +
>> >  mm/swapfile.c          | 44 ++++++++++++++++++++++++++++++++++++++------
>> >  2 files changed, 39 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> > index c7220409456c..39f074e0acd7 100644
>> > --- a/include/linux/blkdev.h
>> > +++ b/include/linux/blkdev.h
>> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>> >
>> >  enum swap_blk_hint {
>> >         SWAP_FREE,
>> > +       SWAP_FULL,
>> >  };
>> >
>> >  struct block_device_operations {
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 209112cf8b83..71e3df0431b6 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -493,6 +493,29 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
>> >         int latency_ration = LATENCY_LIMIT;
>> >
>> >         /*
>> > +        * If zram is full, we don't need to scan and want to stop swap.
>> > +        * For it, we removes si from swap_avail_head and decreases
>> > +        * nr_swap_pages to prevent further anonymous reclaim so that
>> > +        * VM can restart swap out if zram has a free space.
>> > +        * Look at swap_entry_free.
>> > +        */
>> > +       if (si->flags & SWP_BLKDEV) {
>> > +               struct gendisk *disk = si->bdev->bd_disk;
>> > +
>> > +               if (disk->fops->swap_hint && disk->fops->swap_hint(
>> > +                               si->bdev, SWAP_FULL, NULL)) {
>> > +                       spin_lock(&swap_avail_lock);
>> > +                       WARN_ON(plist_node_empty(&si->avail_list));
>> > +                       plist_del(&si->avail_list, &swap_avail_head);
>> > +                       spin_unlock(&swap_avail_lock);
>> > +                       atomic_long_sub(si->pages - si->inuse_pages,
>> > +                                               &nr_swap_pages);
>> > +                       si->full = true;
>> > +                       return 0;
>> > +               }
>> > +       }
>> > +
>> > +       /*
>> >          * We try to cluster swap pages by allocating them sequentially
>> >          * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
>> >          * way, however, we resort to first-free allocation, starting
>> > @@ -798,6 +821,14 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >         /* free if no reference */
>> >         if (!usage) {
>> >                 bool was_full;
>> > +               struct gendisk *virt_swap = NULL;
>> > +
>> > +               /* Check virtual swap */
>> > +               if (p->flags & SWP_BLKDEV) {
>> > +                       virt_swap = p->bdev->bd_disk;
>> > +                       if (!virt_swap->fops->swap_hint)
>>
>> not a big deal, but can't you just combine these two if's to simplify this?
>
> Do you want this?
>
> if (p->flags & SWP_BLKDEV && p->bdev->bd_disk->fops->swap_hint)
>         virt_swap = p->bdev->bd_disk;

ah sorry, i wasn't paying attention there, you're right accessing
virt_swap after you set it is probably better.


>
>
>
>>
>> > +                               virt_swap = NULL;
>> > +               }
>> >
>> >                 dec_cluster_info_page(p, p->cluster_info, offset);
>> >                 if (offset < p->lowest_bit)
>> > @@ -814,17 +845,18 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >                                           &swap_avail_head);
>> >                         spin_unlock(&swap_avail_lock);
>> >                         p->full = false;
>> > +                       if (virt_swap)
>> > +                               atomic_long_add(p->pages -
>> > +                                               p->inuse_pages,
>> > +                                               &nr_swap_pages);
>>
>> a comment here might be good, to clarify it relies on the check at the
>> top of scan_swap_map previously subtracting the same number of pages.
>
> Okay.
>
>>
>>
>> >                 }
>> >
>> >                 atomic_long_inc(&nr_swap_pages);
>> >                 p->inuse_pages--;
>> >                 frontswap_invalidate_page(p->type, offset);
>> > -               if (p->flags & SWP_BLKDEV) {
>> > -                       struct gendisk *disk = p->bdev->bd_disk;
>> > -                       if (disk->fops->swap_hint)
>> > -                               disk->fops->swap_hint(p->bdev,
>> > -                                               SWAP_FREE, (void *)offset);
>> > -               }
>> > +               if (virt_swap)
>> > +                       virt_swap->fops->swap_hint(p->bdev,
>> > +                                       SWAP_FREE, (void *)offset);
>> >         }
>> >
>> >         return usage;
>> > --
>> > 2.0.0
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-25  1:02     ` Minchan Kim
@ 2014-09-25 15:52       ` Dan Streetman
  2014-10-06 23:36         ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Streetman @ 2014-09-25 15:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Wed, Sep 24, 2014 at 9:02 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Wed, Sep 24, 2014 at 10:01:03AM -0400, Dan Streetman wrote:
>> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > This patch implement SWAP_FULL handler in zram so that VM can
>> > know whether zram is full or not and use it to stop anonymous
>> > page reclaim.
>> >
>> > How to judge fullness is below,
>> >
>> > fullness = (100 * used space / total space)
>> >
>> > It means the higher fullness is, the slower we reach zram full.
>> > Now, default of fullness is 80 so that it biased more momory
>> > consumption rather than early OOM kill.
>> >
>> > Above logic works only when used space of zram hit over the limit
>> > but zram also pretend to be full once 32 consecutive allocation
>> > fail happens. It's safe guard to prevent system hang caused by
>> > fragment uncertainty.
>> >
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> >  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
>> >  drivers/block/zram/zram_drv.h |  1 +
>> >  2 files changed, 57 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> > index 22a37764c409..649cad9d0b1c 100644
>> > --- a/drivers/block/zram/zram_drv.c
>> > +++ b/drivers/block/zram/zram_drv.c
>> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
>> >  /* Module params (documentation at end) */
>> >  static unsigned int num_devices = 1;
>> >
>> > +/*
>> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
>> > + * we regards it as zram-full. It means that the higher
>> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
>> > + */
>> > +#define ZRAM_FULLNESS_PERCENT 80
>>
>> As Andrew said, this (or the user-configurable fullness param from the
>> next patch) should have more detail about exactly why it's needed and
>> what it does.  The details of how zram considers itself "full" should
>> be clear, which probably includes explaining zsmalloc fragmentation.
>> It should be also clear this param only matters when limit_pages is
>> set, and this param is only checked when zsmalloc's total size has
>> reached that limit.
>
> Sure, How about this?
>
>                 The fullness file is read/write and specifies how easily
>                 zram become full state. Normally, we can think "full"
>                 once all of memory is consumed but it's not simple with
>                 zram because zsmalloc has some issue by internal design
>                 so that write could fail once consumed *page* by zram
>                 reaches the mem_limit and zsmalloc cannot have a empty
>                 slot for the compressed object's size on fragmenet space
>                 although it has more empty slots for other sizes.

I understand that, but it might be confusing or unclear to anyone
who's not familiar with how zsmalloc works.

Maybe it could be explained by referencing the existing
compr_data_size and mem_used_total?  In addition to some or all of the
above, you could add something like:

This controls when zram decides that it is "full".  It is a percent
value, checked against compr_data_size / mem_used_total.  When
mem_used_total is equal to mem_limit, the fullness is checked and if
the compr_data_size / mem_used_total percentage is higher than this
specified fullness value, zram is considered "full".


>
>                 We regard zram as full once consumed *page* reaches the
>                 mem_limit and consumed memory until now is higher the value
>                 resulted from the knob. So, if you set the value high,
>                 you can squeeze more pages into fragment space so you could
>                 avoid early OOM while you could see more write-fail warning,
>                 overhead to fail-write recovering by VM and reclaim latency.
>                 If you set the value low, you can see OOM kill easily
>                 even though there are memory space in zram but you could
>                 avoid shortcomings mentioned above.

You should clarify also that this is currently only used by
swap-on-zram, and this value prevents swap from writing to zram once
it is "full".  This setting has no effect when using zram for a
mounted filesystem.

>
>                 This knobs is valid ony if you set mem_limit.
>                 Currently, initial value is 80% but it could be changed.
>
> I didn't decide how to change it from percent.
> Decimal fraction Jerome mentioned does make sense to me so please ignore
> percent part in above.
>
>>
>> Also, since the next patch changes it to be used only as a default,
>> shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?
>
> Okay, I will do it in 5/5.
>
>>
>> > +
>> > +/*
>> > + * If zram fails to allocate memory consecutively up to this,
>> > + * we regard it as zram-full. It's safe guard to prevent too
>> > + * many swap write fail due to lack of fragmentation uncertainty.
>> > + */
>> > +#define ALLOC_FAIL_MAX 32
>> > +
>> >  #define ZRAM_ATTR_RO(name)                                             \
>> >  static ssize_t zram_attr_##name##_show(struct device *d,               \
>> >                                 struct device_attribute *attr, char *b) \
>> > @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
>> >
>> >         down_write(&zram->init_lock);
>> >         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >         up_write(&zram->init_lock);
>> >
>> >         return len;
>> > @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
>> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >                         &zram->stats.compr_data_size);
>> >         atomic64_dec(&zram->stats.pages_stored);
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >
>> >         meta->table[index].handle = 0;
>> >         zram_set_obj_size(meta, index, 0);
>> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >         }
>> >
>> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> > -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> > -               zs_free(meta->mem_pool, handle);
>> > -               ret = -ENOMEM;
>> > -               goto out;
>> > +       if (zram->limit_pages) {
>> > +               if (alloced_pages > zram->limit_pages) {
>> > +                       zs_free(meta->mem_pool, handle);
>> > +                       atomic_inc(&zram->alloc_fail);
>> > +                       ret = -ENOMEM;
>> > +                       goto out;
>> > +               } else {
>> > +                       atomic_set(&zram->alloc_fail, 0);
>> > +               }
>>
>> So, with zram_full() checking for alloced_pages >= limit_pages, this
>> will need to be changed; the way it is now it prevents that from ever
>> being true.
>>
>> Instead I believe this check has to be moved to before zs_malloc(), so
>> that alloced_pages > limit_pages is true.
>
> I don't get it why you said "it prevents that from ever being true".
> Now, zram can use up until limit_pages (ie, used memory == zram->limit_pages)
> and trying to get more is failed. so zram_full checks it as
> toal_pages >= zram->limit_pages so what is problem?
> If I miss your point, could you explain more?

ok, that's true, it's possible for alloc_pages == limit_pages, but
since zsmalloc will increase its size by a full zspage, and those can
be anywhere between 1 and 4 pages in size, it's only a (very roughly)
25% chance that an alloc will cause alloc_pages == limit_pages, it's
more likely that an alloc will cause alloc_pages > limit_pages.  Now,
after some number of write failures, that 25% (-ish) probability will
be met, and alloc_pages == limit_pages will happen, but there's a
rather high chance that there will be some number of write failures
first.

To summarize or restate that, I guess what I'm saying is that for
users who don't care about some write failures and/or users with no
other swap devices except zram, it probably does not matter.  However
for them, they probably will rely on the 32 write failure limit, and
not the fullness limit.  For users where zram is only the primary swap
device, and there is a backup swap device, they probably will want
zram to fail over to the backup fairly quickly, with as few write
failures as possible (preferably, none, I would think).  And this
situation makes that highly unlikely - since there's only about a 25%
chance of alloc_pages == limit_pages with no previous write failures,
it's almost a certainty that there will be write failures before zram
is decided to be "full", even if "fullness" is set to 0.

With that said, you're right that it will eventually work, and those
few write failures while trying to get to alloc_pages == limit_pages
would probably not be noticable.  However, do remember that zram won't
stay full forever, so if it is only the primary swap device, it's
likely it will move between "full" and "not full" quite a lot, and
those few write failures may start adding up.

I suppose testing would show if those few write failures are
significant.  Also, if nobody ever uses zram with a backup (real disk)
secondary swap device, then it likely doesn't matter.

>
>>
>>
>> >         }
>> >
>> >         update_used_max(zram, alloced_pages);
>> > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
>> >         down_write(&zram->init_lock);
>> >
>> >         zram->limit_pages = 0;
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >
>> >         if (!init_done(zram)) {
>> >                 up_write(&zram->init_lock);
>> > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >         return 0;
>> >  }
>> >
>> > +static int zram_full(struct block_device *bdev, void *arg)
>> > +{
>> > +       struct zram *zram;
>> > +       struct zram_meta *meta;
>> > +       unsigned long total_pages, compr_pages;
>> > +
>> > +       zram = bdev->bd_disk->private_data;
>> > +       if (!zram->limit_pages)
>> > +               return 0;
>> > +
>> > +       meta = zram->meta;
>> > +       total_pages = zs_get_total_pages(meta->mem_pool);
>> > +
>> > +       if (total_pages >= zram->limit_pages) {
>> > +
>> > +               compr_pages = atomic64_read(&zram->stats.compr_data_size)
>> > +                                       >> PAGE_SHIFT;
>> > +               if ((100 * compr_pages / total_pages)
>> > +                       >= ZRAM_FULLNESS_PERCENT)
>> > +                       return 1;
>> > +       }
>> > +
>> > +       if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
>> > +               return 1;
>> > +
>> > +       return 0;
>> > +}
>> > +
>> >  static int zram_swap_hint(struct block_device *bdev,
>> >                                 unsigned int hint, void *arg)
>> >  {
>> > @@ -951,6 +1001,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >
>> >         if (hint == SWAP_FREE)
>> >                 ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> > +       else if (hint == SWAP_FULL)
>> > +               ret = zram_full(bdev, arg);
>> >
>> >         return ret;
>> >  }
>> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> > index c6ee271317f5..fcf3176a9f15 100644
>> > --- a/drivers/block/zram/zram_drv.h
>> > +++ b/drivers/block/zram/zram_drv.h
>> > @@ -113,6 +113,7 @@ struct zram {
>> >         u64 disksize;   /* bytes */
>> >         int max_comp_streams;
>> >         struct zram_stats stats;
>> > +       atomic_t alloc_fail;
>> >         /*
>> >          * the number of pages zram can consume for storing compressed data
>> >          */
>> > --
>> > 2.0.0
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-09-25 15:52       ` Dan Streetman
@ 2014-10-06 23:36         ` Minchan Kim
  2014-10-06 23:46           ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-10-06 23:36 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

Hello Dan,

Sorry for the delay. I had internal works which should be handled
urgent. I hope you don't lose your interest due to my bad response
latency.

On Thu, Sep 25, 2014 at 11:52:22AM -0400, Dan Streetman wrote:
> On Wed, Sep 24, 2014 at 9:02 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Wed, Sep 24, 2014 at 10:01:03AM -0400, Dan Streetman wrote:
> >> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > This patch implement SWAP_FULL handler in zram so that VM can
> >> > know whether zram is full or not and use it to stop anonymous
> >> > page reclaim.
> >> >
> >> > How to judge fullness is below,
> >> >
> >> > fullness = (100 * used space / total space)
> >> >
> >> > It means the higher fullness is, the slower we reach zram full.
> >> > Now, default of fullness is 80 so that it biased more momory
> >> > consumption rather than early OOM kill.
> >> >
> >> > Above logic works only when used space of zram hit over the limit
> >> > but zram also pretend to be full once 32 consecutive allocation
> >> > fail happens. It's safe guard to prevent system hang caused by
> >> > fragment uncertainty.
> >> >
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> > ---
> >> >  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >  2 files changed, 57 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> > index 22a37764c409..649cad9d0b1c 100644
> >> > --- a/drivers/block/zram/zram_drv.c
> >> > +++ b/drivers/block/zram/zram_drv.c
> >> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
> >> >  /* Module params (documentation at end) */
> >> >  static unsigned int num_devices = 1;
> >> >
> >> > +/*
> >> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> >> > + * we regards it as zram-full. It means that the higher
> >> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> >> > + */
> >> > +#define ZRAM_FULLNESS_PERCENT 80
> >>
> >> As Andrew said, this (or the user-configurable fullness param from the
> >> next patch) should have more detail about exactly why it's needed and
> >> what it does.  The details of how zram considers itself "full" should
> >> be clear, which probably includes explaining zsmalloc fragmentation.
> >> It should be also clear this param only matters when limit_pages is
> >> set, and this param is only checked when zsmalloc's total size has
> >> reached that limit.
> >
> > Sure, How about this?
> >
> >                 The fullness file is read/write and specifies how easily
> >                 zram become full state. Normally, we can think "full"
> >                 once all of memory is consumed but it's not simple with
> >                 zram because zsmalloc has some issue by internal design
> >                 so that write could fail once consumed *page* by zram
> >                 reaches the mem_limit and zsmalloc cannot have a empty
> >                 slot for the compressed object's size on fragmenet space
> >                 although it has more empty slots for other sizes.
> 
> I understand that, but it might be confusing or unclear to anyone
> who's not familiar with how zsmalloc works.
> 
> Maybe it could be explained by referencing the existing
> compr_data_size and mem_used_total?  In addition to some or all of the
> above, you could add something like:
> 
> This controls when zram decides that it is "full".  It is a percent
> value, checked against compr_data_size / mem_used_total.  When
> mem_used_total is equal to mem_limit, the fullness is checked and if
> the compr_data_size / mem_used_total percentage is higher than this
> specified fullness value, zram is considered "full".

Better than my verbose version.

> 
> 
> >
> >                 We regard zram as full once consumed *page* reaches the
> >                 mem_limit and consumed memory until now is higher the value
> >                 resulted from the knob. So, if you set the value high,
> >                 you can squeeze more pages into fragment space so you could
> >                 avoid early OOM while you could see more write-fail warning,
> >                 overhead to fail-write recovering by VM and reclaim latency.
> >                 If you set the value low, you can see OOM kill easily
> >                 even though there are memory space in zram but you could
> >                 avoid shortcomings mentioned above.
> 
> You should clarify also that this is currently only used by
> swap-on-zram, and this value prevents swap from writing to zram once
> it is "full".  This setting has no effect when using zram for a
> mounted filesystem.

Sure.

> 
> >
> >                 This knobs is valid ony if you set mem_limit.
> >                 Currently, initial value is 80% but it could be changed.
> >
> > I didn't decide how to change it from percent.
> > Decimal fraction Jerome mentioned does make sense to me so please ignore
> > percent part in above.
> >
> >>
> >> Also, since the next patch changes it to be used only as a default,
> >> shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?
> >
> > Okay, I will do it in 5/5.
> >
> >>
> >> > +
> >> > +/*
> >> > + * If zram fails to allocate memory consecutively up to this,
> >> > + * we regard it as zram-full. It's safe guard to prevent too
> >> > + * many swap write fail due to lack of fragmentation uncertainty.
> >> > + */
> >> > +#define ALLOC_FAIL_MAX 32
> >> > +
> >> >  #define ZRAM_ATTR_RO(name)                                             \
> >> >  static ssize_t zram_attr_##name##_show(struct device *d,               \
> >> >                                 struct device_attribute *attr, char *b) \
> >> > @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
> >> >
> >> >         down_write(&zram->init_lock);
> >> >         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >         up_write(&zram->init_lock);
> >> >
> >> >         return len;
> >> > @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >                         &zram->stats.compr_data_size);
> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >
> >> >         meta->table[index].handle = 0;
> >> >         zram_set_obj_size(meta, index, 0);
> >> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >         }
> >> >
> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> > -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> > -               zs_free(meta->mem_pool, handle);
> >> > -               ret = -ENOMEM;
> >> > -               goto out;
> >> > +       if (zram->limit_pages) {
> >> > +               if (alloced_pages > zram->limit_pages) {
> >> > +                       zs_free(meta->mem_pool, handle);
> >> > +                       atomic_inc(&zram->alloc_fail);
> >> > +                       ret = -ENOMEM;
> >> > +                       goto out;
> >> > +               } else {
> >> > +                       atomic_set(&zram->alloc_fail, 0);
> >> > +               }
> >>
> >> So, with zram_full() checking for alloced_pages >= limit_pages, this
> >> will need to be changed; the way it is now it prevents that from ever
> >> being true.
> >>
> >> Instead I believe this check has to be moved to before zs_malloc(), so
> >> that alloced_pages > limit_pages is true.
> >
> > I don't get it why you said "it prevents that from ever being true".
> > Now, zram can use up until limit_pages (ie, used memory == zram->limit_pages)
> > and trying to get more is failed. so zram_full checks it as
> > toal_pages >= zram->limit_pages so what is problem?
> > If I miss your point, could you explain more?
> 
> ok, that's true, it's possible for alloc_pages == limit_pages, but
> since zsmalloc will increase its size by a full zspage, and those can
> be anywhere between 1 and 4 pages in size, it's only a (very roughly)
> 25% chance that an alloc will cause alloc_pages == limit_pages, it's
> more likely that an alloc will cause alloc_pages > limit_pages.  Now,
> after some number of write failures, that 25% (-ish) probability will
> be met, and alloc_pages == limit_pages will happen, but there's a
> rather high chance that there will be some number of write failures
> first.
> 
> To summarize or restate that, I guess what I'm saying is that for
> users who don't care about some write failures and/or users with no
> other swap devices except zram, it probably does not matter.  However
> for them, they probably will rely on the 32 write failure limit, and
> not the fullness limit.  For users where zram is only the primary swap
> device, and there is a backup swap device, they probably will want
> zram to fail over to the backup fairly quickly, with as few write
> failures as possible (preferably, none, I would think).  And this
> situation makes that highly unlikely - since there's only about a 25%
> chance of alloc_pages == limit_pages with no previous write failures,
> it's almost a certainty that there will be write failures before zram
> is decided to be "full", even if "fullness" is set to 0.
> 
> With that said, you're right that it will eventually work, and those
> few write failures while trying to get to alloc_pages == limit_pages
> would probably not be noticable.  However, do remember that zram won't
> stay full forever, so if it is only the primary swap device, it's
> likely it will move between "full" and "not full" quite a lot, and
> those few write failures may start adding up.

Fair enough.

But it is possible to see write-failure even though we correct
it because there is potential chance for zram to fail to allocate
order-0 page by a few reason which one of them is CMA I got several
reports because zRAM cannot allocate a movable page due to lack of
migration while usersapce goes with it well. I have a plan to fix it
with zsmalloc migration work but there are another chances to make
fail order-0 page by serval ways so I don't think we cannot prevent
write-failure completely unless we have reserved memory for zram.

Having said that, I agree it would be better to reduce such fails
with small code piece so I will check zram_full as follows,

/*
 * XXX: zsmalloc_maxpages check should be removed when zsmalloc
 * implement using of fragmented spaces in last page of zspage.
 */
if (total_pages >= zram->limit_pages - zsmalloc_maxpages()) {
        ...
}

It will reduce 3 pages at maximum but shouldn't a big deal.

> 
> I suppose testing would show if those few write failures are
> significant.  Also, if nobody ever uses zram with a backup (real disk)
> secondary swap device, then it likely doesn't matter.

I think it's really reasonable scenario.

> 
> >
> >>
> >>
> >> >         }
> >> >
> >> >         update_used_max(zram, alloced_pages);
> >> > @@ -711,6 +732,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> >> >         down_write(&zram->init_lock);
> >> >
> >> >         zram->limit_pages = 0;
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >
> >> >         if (!init_done(zram)) {
> >> >                 up_write(&zram->init_lock);
> >> > @@ -944,6 +966,34 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >         return 0;
> >> >  }
> >> >
> >> > +static int zram_full(struct block_device *bdev, void *arg)
> >> > +{
> >> > +       struct zram *zram;
> >> > +       struct zram_meta *meta;
> >> > +       unsigned long total_pages, compr_pages;
> >> > +
> >> > +       zram = bdev->bd_disk->private_data;
> >> > +       if (!zram->limit_pages)
> >> > +               return 0;
> >> > +
> >> > +       meta = zram->meta;
> >> > +       total_pages = zs_get_total_pages(meta->mem_pool);
> >> > +
> >> > +       if (total_pages >= zram->limit_pages) {
> >> > +
> >> > +               compr_pages = atomic64_read(&zram->stats.compr_data_size)
> >> > +                                       >> PAGE_SHIFT;
> >> > +               if ((100 * compr_pages / total_pages)
> >> > +                       >= ZRAM_FULLNESS_PERCENT)
> >> > +                       return 1;
> >> > +       }
> >> > +
> >> > +       if (atomic_read(&zram->alloc_fail) > ALLOC_FAIL_MAX)
> >> > +               return 1;
> >> > +
> >> > +       return 0;
> >> > +}
> >> > +
> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >                                 unsigned int hint, void *arg)
> >> >  {
> >> > @@ -951,6 +1001,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >
> >> >         if (hint == SWAP_FREE)
> >> >                 ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> > +       else if (hint == SWAP_FULL)
> >> > +               ret = zram_full(bdev, arg);
> >> >
> >> >         return ret;
> >> >  }
> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> > index c6ee271317f5..fcf3176a9f15 100644
> >> > --- a/drivers/block/zram/zram_drv.h
> >> > +++ b/drivers/block/zram/zram_drv.h
> >> > @@ -113,6 +113,7 @@ struct zram {
> >> >         u64 disksize;   /* bytes */
> >> >         int max_comp_streams;
> >> >         struct zram_stats stats;
> >> > +       atomic_t alloc_fail;
> >> >         /*
> >> >          * the number of pages zram can consume for storing compressed data
> >> >          */
> >> > --
> >> > 2.0.0
> >> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-10-06 23:36         ` Minchan Kim
@ 2014-10-06 23:46           ` Minchan Kim
  2014-10-08 18:29             ` Dan Streetman
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2014-10-06 23:46 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Tue, Oct 07, 2014 at 08:36:08AM +0900, Minchan Kim wrote:
> Hello Dan,
> 
> Sorry for the delay. I had internal works which should be handled
> urgent. I hope you don't lose your interest due to my bad response
> latency.
> 
> On Thu, Sep 25, 2014 at 11:52:22AM -0400, Dan Streetman wrote:
> > On Wed, Sep 24, 2014 at 9:02 PM, Minchan Kim <minchan@kernel.org> wrote:
> > > On Wed, Sep 24, 2014 at 10:01:03AM -0400, Dan Streetman wrote:
> > >> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
> > >> > This patch implement SWAP_FULL handler in zram so that VM can
> > >> > know whether zram is full or not and use it to stop anonymous
> > >> > page reclaim.
> > >> >
> > >> > How to judge fullness is below,
> > >> >
> > >> > fullness = (100 * used space / total space)
> > >> >
> > >> > It means the higher fullness is, the slower we reach zram full.
> > >> > Now, default of fullness is 80 so that it biased more momory
> > >> > consumption rather than early OOM kill.
> > >> >
> > >> > Above logic works only when used space of zram hit over the limit
> > >> > but zram also pretend to be full once 32 consecutive allocation
> > >> > fail happens. It's safe guard to prevent system hang caused by
> > >> > fragment uncertainty.
> > >> >
> > >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > >> > ---
> > >> >  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
> > >> >  drivers/block/zram/zram_drv.h |  1 +
> > >> >  2 files changed, 57 insertions(+), 4 deletions(-)
> > >> >
> > >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > >> > index 22a37764c409..649cad9d0b1c 100644
> > >> > --- a/drivers/block/zram/zram_drv.c
> > >> > +++ b/drivers/block/zram/zram_drv.c
> > >> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
> > >> >  /* Module params (documentation at end) */
> > >> >  static unsigned int num_devices = 1;
> > >> >
> > >> > +/*
> > >> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> > >> > + * we regards it as zram-full. It means that the higher
> > >> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> > >> > + */
> > >> > +#define ZRAM_FULLNESS_PERCENT 80
> > >>
> > >> As Andrew said, this (or the user-configurable fullness param from the
> > >> next patch) should have more detail about exactly why it's needed and
> > >> what it does.  The details of how zram considers itself "full" should
> > >> be clear, which probably includes explaining zsmalloc fragmentation.
> > >> It should be also clear this param only matters when limit_pages is
> > >> set, and this param is only checked when zsmalloc's total size has
> > >> reached that limit.
> > >
> > > Sure, How about this?
> > >
> > >                 The fullness file is read/write and specifies how easily
> > >                 zram become full state. Normally, we can think "full"
> > >                 once all of memory is consumed but it's not simple with
> > >                 zram because zsmalloc has some issue by internal design
> > >                 so that write could fail once consumed *page* by zram
> > >                 reaches the mem_limit and zsmalloc cannot have a empty
> > >                 slot for the compressed object's size on fragmenet space
> > >                 although it has more empty slots for other sizes.
> > 
> > I understand that, but it might be confusing or unclear to anyone
> > who's not familiar with how zsmalloc works.
> > 
> > Maybe it could be explained by referencing the existing
> > compr_data_size and mem_used_total?  In addition to some or all of the
> > above, you could add something like:
> > 
> > This controls when zram decides that it is "full".  It is a percent
> > value, checked against compr_data_size / mem_used_total.  When
> > mem_used_total is equal to mem_limit, the fullness is checked and if
> > the compr_data_size / mem_used_total percentage is higher than this
> > specified fullness value, zram is considered "full".
> 
> Better than my verbose version.
> 
> > 
> > 
> > >
> > >                 We regard zram as full once consumed *page* reaches the
> > >                 mem_limit and consumed memory until now is higher the value
> > >                 resulted from the knob. So, if you set the value high,
> > >                 you can squeeze more pages into fragment space so you could
> > >                 avoid early OOM while you could see more write-fail warning,
> > >                 overhead to fail-write recovering by VM and reclaim latency.
> > >                 If you set the value low, you can see OOM kill easily
> > >                 even though there are memory space in zram but you could
> > >                 avoid shortcomings mentioned above.
> > 
> > You should clarify also that this is currently only used by
> > swap-on-zram, and this value prevents swap from writing to zram once
> > it is "full".  This setting has no effect when using zram for a
> > mounted filesystem.
> 
> Sure.
> 
> > 
> > >
> > >                 This knobs is valid ony if you set mem_limit.
> > >                 Currently, initial value is 80% but it could be changed.
> > >
> > > I didn't decide how to change it from percent.
> > > Decimal fraction Jerome mentioned does make sense to me so please ignore
> > > percent part in above.
> > >
> > >>
> > >> Also, since the next patch changes it to be used only as a default,
> > >> shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?
> > >
> > > Okay, I will do it in 5/5.
> > >
> > >>
> > >> > +
> > >> > +/*
> > >> > + * If zram fails to allocate memory consecutively up to this,
> > >> > + * we regard it as zram-full. It's safe guard to prevent too
> > >> > + * many swap write fail due to lack of fragmentation uncertainty.
> > >> > + */
> > >> > +#define ALLOC_FAIL_MAX 32
> > >> > +
> > >> >  #define ZRAM_ATTR_RO(name)                                             \
> > >> >  static ssize_t zram_attr_##name##_show(struct device *d,               \
> > >> >                                 struct device_attribute *attr, char *b) \
> > >> > @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
> > >> >
> > >> >         down_write(&zram->init_lock);
> > >> >         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
> > >> > +       atomic_set(&zram->alloc_fail, 0);
> > >> >         up_write(&zram->init_lock);
> > >> >
> > >> >         return len;
> > >> > @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
> > >> >         atomic64_sub(zram_get_obj_size(meta, index),
> > >> >                         &zram->stats.compr_data_size);
> > >> >         atomic64_dec(&zram->stats.pages_stored);
> > >> > +       atomic_set(&zram->alloc_fail, 0);
> > >> >
> > >> >         meta->table[index].handle = 0;
> > >> >         zram_set_obj_size(meta, index, 0);
> > >> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > >> >         }
> > >> >
> > >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> > >> > -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> > >> > -               zs_free(meta->mem_pool, handle);
> > >> > -               ret = -ENOMEM;
> > >> > -               goto out;
> > >> > +       if (zram->limit_pages) {
> > >> > +               if (alloced_pages > zram->limit_pages) {
> > >> > +                       zs_free(meta->mem_pool, handle);
> > >> > +                       atomic_inc(&zram->alloc_fail);
> > >> > +                       ret = -ENOMEM;
> > >> > +                       goto out;
> > >> > +               } else {
> > >> > +                       atomic_set(&zram->alloc_fail, 0);
> > >> > +               }
> > >>
> > >> So, with zram_full() checking for alloced_pages >= limit_pages, this
> > >> will need to be changed; the way it is now it prevents that from ever
> > >> being true.
> > >>
> > >> Instead I believe this check has to be moved to before zs_malloc(), so
> > >> that alloced_pages > limit_pages is true.
> > >
> > > I don't get it why you said "it prevents that from ever being true".
> > > Now, zram can use up until limit_pages (ie, used memory == zram->limit_pages)
> > > and trying to get more is failed. so zram_full checks it as
> > > toal_pages >= zram->limit_pages so what is problem?
> > > If I miss your point, could you explain more?
> > 
> > ok, that's true, it's possible for alloc_pages == limit_pages, but
> > since zsmalloc will increase its size by a full zspage, and those can
> > be anywhere between 1 and 4 pages in size, it's only a (very roughly)
> > 25% chance that an alloc will cause alloc_pages == limit_pages, it's
> > more likely that an alloc will cause alloc_pages > limit_pages.  Now,
> > after some number of write failures, that 25% (-ish) probability will
> > be met, and alloc_pages == limit_pages will happen, but there's a
> > rather high chance that there will be some number of write failures
> > first.
> > 
> > To summarize or restate that, I guess what I'm saying is that for
> > users who don't care about some write failures and/or users with no
> > other swap devices except zram, it probably does not matter.  However
> > for them, they probably will rely on the 32 write failure limit, and
> > not the fullness limit.  For users where zram is only the primary swap
> > device, and there is a backup swap device, they probably will want
> > zram to fail over to the backup fairly quickly, with as few write
> > failures as possible (preferably, none, I would think).  And this
> > situation makes that highly unlikely - since there's only about a 25%
> > chance of alloc_pages == limit_pages with no previous write failures,
> > it's almost a certainty that there will be write failures before zram
> > is decided to be "full", even if "fullness" is set to 0.
> > 
> > With that said, you're right that it will eventually work, and those
> > few write failures while trying to get to alloc_pages == limit_pages
> > would probably not be noticable.  However, do remember that zram won't
> > stay full forever, so if it is only the primary swap device, it's
> > likely it will move between "full" and "not full" quite a lot, and
> > those few write failures may start adding up.
> 
> Fair enough.
> 
> But it is possible to see write-failure even though we correct
> it because there is potential chance for zram to fail to allocate
> order-0 page by a few reason which one of them is CMA I got several
> reports because zRAM cannot allocate a movable page due to lack of
> migration while usersapce goes with it well. I have a plan to fix it
> with zsmalloc migration work but there are another chances to make
> fail order-0 page by serval ways so I don't think we cannot prevent
> write-failure completely unless we have reserved memory for zram.
> 
> Having said that, I agree it would be better to reduce such fails
> with small code piece so I will check zram_full as follows,
> 
> /*
>  * XXX: zsmalloc_maxpages check should be removed when zsmalloc
>  * implement using of fragmented spaces in last page of zspage.
>  */
> if (total_pages >= zram->limit_pages - zsmalloc_maxpages()) {
>         ...
> }
> 

How about this?

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 19da34aaf4f5..f03a94d7aa17 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -978,7 +978,7 @@ static void zram_full(struct block_device *bdev, bool *full)
 	meta = zram->meta;
 	total_pages = zs_get_total_pages(meta->mem_pool);
 
-	if (total_pages >= zram->limit_pages) {
+	if (total_pages > zram->limit_pages - zs_get_maxpages_per_zspage()) {
 
 		compr_pages = atomic64_read(&zram->stats.compr_data_size)
 					>> PAGE_SHIFT;
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 05c214760977..73eb87bc5a4e 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -48,4 +48,5 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
 
 unsigned long zs_get_total_pages(struct zs_pool *pool);
 
+int zs_get_maxpages_per_zspage(void);
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 839a48c3ca27..6b6653455573 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -316,6 +316,12 @@ static struct zpool_driver zs_zpool_driver = {
 MODULE_ALIAS("zpool-zsmalloc");
 #endif /* CONFIG_ZPOOL */
 
+int zs_get_maxpages_per_zspage(void)
+{
+	return ZS_MAX_PAGES_PER_ZSPAGE;
+}
+EXPORT_SYMBOL_GPL(zs_get_maxpages_per_zspage);
+
 /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
 static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
 
-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/5] zram: add swap full hint
  2014-10-06 23:46           ` Minchan Kim
@ 2014-10-08 18:29             ` Dan Streetman
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2014-10-08 18:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato, juno.choi

On Mon, Oct 6, 2014 at 7:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Oct 07, 2014 at 08:36:08AM +0900, Minchan Kim wrote:
>> Hello Dan,
>>
>> Sorry for the delay. I had internal works which should be handled
>> urgent. I hope you don't lose your interest due to my bad response
>> latency.
>>
>> On Thu, Sep 25, 2014 at 11:52:22AM -0400, Dan Streetman wrote:
>> > On Wed, Sep 24, 2014 at 9:02 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > > On Wed, Sep 24, 2014 at 10:01:03AM -0400, Dan Streetman wrote:
>> > >> On Sun, Sep 21, 2014 at 8:03 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > >> > This patch implement SWAP_FULL handler in zram so that VM can
>> > >> > know whether zram is full or not and use it to stop anonymous
>> > >> > page reclaim.
>> > >> >
>> > >> > How to judge fullness is below,
>> > >> >
>> > >> > fullness = (100 * used space / total space)
>> > >> >
>> > >> > It means the higher fullness is, the slower we reach zram full.
>> > >> > Now, default of fullness is 80 so that it biased more momory
>> > >> > consumption rather than early OOM kill.
>> > >> >
>> > >> > Above logic works only when used space of zram hit over the limit
>> > >> > but zram also pretend to be full once 32 consecutive allocation
>> > >> > fail happens. It's safe guard to prevent system hang caused by
>> > >> > fragment uncertainty.
>> > >> >
>> > >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > >> > ---
>> > >> >  drivers/block/zram/zram_drv.c | 60 ++++++++++++++++++++++++++++++++++++++++---
>> > >> >  drivers/block/zram/zram_drv.h |  1 +
>> > >> >  2 files changed, 57 insertions(+), 4 deletions(-)
>> > >> >
>> > >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> > >> > index 22a37764c409..649cad9d0b1c 100644
>> > >> > --- a/drivers/block/zram/zram_drv.c
>> > >> > +++ b/drivers/block/zram/zram_drv.c
>> > >> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
>> > >> >  /* Module params (documentation at end) */
>> > >> >  static unsigned int num_devices = 1;
>> > >> >
>> > >> > +/*
>> > >> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
>> > >> > + * we regards it as zram-full. It means that the higher
>> > >> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
>> > >> > + */
>> > >> > +#define ZRAM_FULLNESS_PERCENT 80
>> > >>
>> > >> As Andrew said, this (or the user-configurable fullness param from the
>> > >> next patch) should have more detail about exactly why it's needed and
>> > >> what it does.  The details of how zram considers itself "full" should
>> > >> be clear, which probably includes explaining zsmalloc fragmentation.
>> > >> It should be also clear this param only matters when limit_pages is
>> > >> set, and this param is only checked when zsmalloc's total size has
>> > >> reached that limit.
>> > >
>> > > Sure, How about this?
>> > >
>> > >                 The fullness file is read/write and specifies how easily
>> > >                 zram become full state. Normally, we can think "full"
>> > >                 once all of memory is consumed but it's not simple with
>> > >                 zram because zsmalloc has some issue by internal design
>> > >                 so that write could fail once consumed *page* by zram
>> > >                 reaches the mem_limit and zsmalloc cannot have a empty
>> > >                 slot for the compressed object's size on fragmenet space
>> > >                 although it has more empty slots for other sizes.
>> >
>> > I understand that, but it might be confusing or unclear to anyone
>> > who's not familiar with how zsmalloc works.
>> >
>> > Maybe it could be explained by referencing the existing
>> > compr_data_size and mem_used_total?  In addition to some or all of the
>> > above, you could add something like:
>> >
>> > This controls when zram decides that it is "full".  It is a percent
>> > value, checked against compr_data_size / mem_used_total.  When
>> > mem_used_total is equal to mem_limit, the fullness is checked and if
>> > the compr_data_size / mem_used_total percentage is higher than this
>> > specified fullness value, zram is considered "full".
>>
>> Better than my verbose version.
>>
>> >
>> >
>> > >
>> > >                 We regard zram as full once consumed *page* reaches the
>> > >                 mem_limit and consumed memory until now is higher the value
>> > >                 resulted from the knob. So, if you set the value high,
>> > >                 you can squeeze more pages into fragment space so you could
>> > >                 avoid early OOM while you could see more write-fail warning,
>> > >                 overhead to fail-write recovering by VM and reclaim latency.
>> > >                 If you set the value low, you can see OOM kill easily
>> > >                 even though there are memory space in zram but you could
>> > >                 avoid shortcomings mentioned above.
>> >
>> > You should clarify also that this is currently only used by
>> > swap-on-zram, and this value prevents swap from writing to zram once
>> > it is "full".  This setting has no effect when using zram for a
>> > mounted filesystem.
>>
>> Sure.
>>
>> >
>> > >
>> > >                 This knobs is valid ony if you set mem_limit.
>> > >                 Currently, initial value is 80% but it could be changed.
>> > >
>> > > I didn't decide how to change it from percent.
>> > > Decimal fraction Jerome mentioned does make sense to me so please ignore
>> > > percent part in above.
>> > >
>> > >>
>> > >> Also, since the next patch changes it to be used only as a default,
>> > >> shouldn't it be DEFAULT_ZRAM_FULLNESS_PERCENT or similar?
>> > >
>> > > Okay, I will do it in 5/5.
>> > >
>> > >>
>> > >> > +
>> > >> > +/*
>> > >> > + * If zram fails to allocate memory consecutively up to this,
>> > >> > + * we regard it as zram-full. It's safe guard to prevent too
>> > >> > + * many swap write fail due to lack of fragmentation uncertainty.
>> > >> > + */
>> > >> > +#define ALLOC_FAIL_MAX 32
>> > >> > +
>> > >> >  #define ZRAM_ATTR_RO(name)                                             \
>> > >> >  static ssize_t zram_attr_##name##_show(struct device *d,               \
>> > >> >                                 struct device_attribute *attr, char *b) \
>> > >> > @@ -148,6 +162,7 @@ static ssize_t mem_limit_store(struct device *dev,
>> > >> >
>> > >> >         down_write(&zram->init_lock);
>> > >> >         zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
>> > >> > +       atomic_set(&zram->alloc_fail, 0);
>> > >> >         up_write(&zram->init_lock);
>> > >> >
>> > >> >         return len;
>> > >> > @@ -410,6 +425,7 @@ static void zram_free_page(struct zram *zram, size_t index)
>> > >> >         atomic64_sub(zram_get_obj_size(meta, index),
>> > >> >                         &zram->stats.compr_data_size);
>> > >> >         atomic64_dec(&zram->stats.pages_stored);
>> > >> > +       atomic_set(&zram->alloc_fail, 0);
>> > >> >
>> > >> >         meta->table[index].handle = 0;
>> > >> >         zram_set_obj_size(meta, index, 0);
>> > >> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> > >> >         }
>> > >> >
>> > >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> > >> > -       if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> > >> > -               zs_free(meta->mem_pool, handle);
>> > >> > -               ret = -ENOMEM;
>> > >> > -               goto out;
>> > >> > +       if (zram->limit_pages) {
>> > >> > +               if (alloced_pages > zram->limit_pages) {
>> > >> > +                       zs_free(meta->mem_pool, handle);
>> > >> > +                       atomic_inc(&zram->alloc_fail);
>> > >> > +                       ret = -ENOMEM;
>> > >> > +                       goto out;
>> > >> > +               } else {
>> > >> > +                       atomic_set(&zram->alloc_fail, 0);
>> > >> > +               }
>> > >>
>> > >> So, with zram_full() checking for alloced_pages >= limit_pages, this
>> > >> will need to be changed; the way it is now it prevents that from ever
>> > >> being true.
>> > >>
>> > >> Instead I believe this check has to be moved to before zs_malloc(), so
>> > >> that alloced_pages > limit_pages is true.
>> > >
>> > > I don't get it why you said "it prevents that from ever being true".
>> > > Now, zram can use up until limit_pages (ie, used memory == zram->limit_pages)
>> > > and trying to get more is failed. so zram_full checks it as
>> > > toal_pages >= zram->limit_pages so what is problem?
>> > > If I miss your point, could you explain more?
>> >
>> > ok, that's true, it's possible for alloc_pages == limit_pages, but
>> > since zsmalloc will increase its size by a full zspage, and those can
>> > be anywhere between 1 and 4 pages in size, it's only a (very roughly)
>> > 25% chance that an alloc will cause alloc_pages == limit_pages, it's
>> > more likely that an alloc will cause alloc_pages > limit_pages.  Now,
>> > after some number of write failures, that 25% (-ish) probability will
>> > be met, and alloc_pages == limit_pages will happen, but there's a
>> > rather high chance that there will be some number of write failures
>> > first.
>> >
>> > To summarize or restate that, I guess what I'm saying is that for
>> > users who don't care about some write failures and/or users with no
>> > other swap devices except zram, it probably does not matter.  However
>> > for them, they probably will rely on the 32 write failure limit, and
>> > not the fullness limit.  For users where zram is only the primary swap
>> > device, and there is a backup swap device, they probably will want
>> > zram to fail over to the backup fairly quickly, with as few write
>> > failures as possible (preferably, none, I would think).  And this
>> > situation makes that highly unlikely - since there's only about a 25%
>> > chance of alloc_pages == limit_pages with no previous write failures,
>> > it's almost a certainty that there will be write failures before zram
>> > is decided to be "full", even if "fullness" is set to 0.
>> >
>> > With that said, you're right that it will eventually work, and those
>> > few write failures while trying to get to alloc_pages == limit_pages
>> > would probably not be noticable.  However, do remember that zram won't
>> > stay full forever, so if it is only the primary swap device, it's
>> > likely it will move between "full" and "not full" quite a lot, and
>> > those few write failures may start adding up.
>>
>> Fair enough.
>>
>> But it is possible to see write-failure even though we correct
>> it because there is potential chance for zram to fail to allocate
>> order-0 page by a few reason which one of them is CMA I got several
>> reports because zRAM cannot allocate a movable page due to lack of
>> migration while usersapce goes with it well. I have a plan to fix it
>> with zsmalloc migration work but there are another chances to make
>> fail order-0 page by serval ways so I don't think we cannot prevent
>> write-failure completely unless we have reserved memory for zram.
>>
>> Having said that, I agree it would be better to reduce such fails
>> with small code piece so I will check zram_full as follows,
>>
>> /*
>>  * XXX: zsmalloc_maxpages check should be removed when zsmalloc
>>  * implement using of fragmented spaces in last page of zspage.
>>  */
>> if (total_pages >= zram->limit_pages - zsmalloc_maxpages()) {
>>         ...
>> }
>>
>
> How about this?

yep, that looks good to me.

>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 19da34aaf4f5..f03a94d7aa17 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -978,7 +978,7 @@ static void zram_full(struct block_device *bdev, bool *full)
>         meta = zram->meta;
>         total_pages = zs_get_total_pages(meta->mem_pool);
>
> -       if (total_pages >= zram->limit_pages) {
> +       if (total_pages > zram->limit_pages - zs_get_maxpages_per_zspage()) {
>
>                 compr_pages = atomic64_read(&zram->stats.compr_data_size)
>                                         >> PAGE_SHIFT;
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 05c214760977..73eb87bc5a4e 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -48,4 +48,5 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
>
>  unsigned long zs_get_total_pages(struct zs_pool *pool);
>
> +int zs_get_maxpages_per_zspage(void);
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 839a48c3ca27..6b6653455573 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -316,6 +316,12 @@ static struct zpool_driver zs_zpool_driver = {
>  MODULE_ALIAS("zpool-zsmalloc");
>  #endif /* CONFIG_ZPOOL */
>
> +int zs_get_maxpages_per_zspage(void)
> +{
> +       return ZS_MAX_PAGES_PER_ZSPAGE;
> +}
> +EXPORT_SYMBOL_GPL(zs_get_maxpages_per_zspage);
> +
>  /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
>  static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 0/5] stop anon reclaim when zram is full
  2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
                   ` (4 preceding siblings ...)
  2014-09-22  0:03 ` [PATCH v1 5/5] zram: add fullness knob to control swap full Minchan Kim
@ 2014-12-02  3:04 ` Minchan Kim
  5 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2014-12-02  3:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, juno.choi

Hello all,

On Mon, Sep 22, 2014 at 09:03:06AM +0900, Minchan Kim wrote:
> For zram-swap, there is size gap between virtual disksize
> and available physical memory size for zram so that VM
> can try to reclaim anonymous pages even though zram is full.
> It makes system alomost hang(ie, unresponsible) easily in
> my kernel build test(ie, 1G DRAM, CPU 12, 4G zram swap,
> 50M zram limit). VM should have killed someone.
> 
> This patch adds new hint SWAP_FULL so VM can ask fullness
> to zram and if it founds zram is full, VM doesn't reclaim
> anonymous pages until zram-swap gets new free space.
> 
> With this patch, I see OOM when zram-swap is full instead of
> hang with no response.
> 
> Minchan Kim (5):
>   zram: generalize swap_slot_free_notify
>   mm: add full variable in swap_info_struct
>   mm: VM can be aware of zram fullness
>   zram: add swap full hint
>   zram: add fullness knob to control swap full

I'm sorry for long delay for this patch althogh you guys gave great
feedback to me.

The reason I was hesitant about this patchset is that I want to avoid
weird fullness knob which was introduced by zsmalloc's internal limit.
So, before this feature, I hope we get zsmalloc's compaction firstly
Then, I hope to remove fullness knob totally.

Thanks.

> 
>  Documentation/ABI/testing/sysfs-block-zram |  10 +++
>  Documentation/filesystems/Locking          |   4 +-
>  drivers/block/zram/zram_drv.c              | 114 +++++++++++++++++++++++++++--
>  drivers/block/zram/zram_drv.h              |   2 +
>  include/linux/blkdev.h                     |   8 +-
>  include/linux/swap.h                       |   1 +
>  mm/page_io.c                               |   6 +-
>  mm/swapfile.c                              |  77 ++++++++++++++-----
>  8 files changed, 189 insertions(+), 33 deletions(-)
> 
> -- 
> 2.0.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-12-02  3:03 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-22  0:03 [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim
2014-09-22  0:03 ` [PATCH v1 1/5] zram: generalize swap_slot_free_notify Minchan Kim
2014-09-22 20:41   ` Andrew Morton
2014-09-23  4:45     ` Minchan Kim
2014-09-22  0:03 ` [PATCH v1 2/5] mm: add full variable in swap_info_struct Minchan Kim
2014-09-22 20:45   ` Andrew Morton
2014-09-23  4:45     ` Minchan Kim
2014-09-24  2:53   ` Dan Streetman
2014-09-24  7:57     ` Minchan Kim
2014-09-22  0:03 ` [PATCH v1 3/5] mm: VM can be aware of zram fullness Minchan Kim
2014-09-24 14:12   ` Dan Streetman
2014-09-25  1:06     ` Minchan Kim
2014-09-25  1:31       ` Dan Streetman
2014-09-22  0:03 ` [PATCH v1 4/5] zram: add swap full hint Minchan Kim
2014-09-22 21:11   ` Andrew Morton
2014-09-23  4:56     ` Minchan Kim
2014-09-23 21:17       ` Andrew Morton
2014-09-24  7:57         ` Minchan Kim
2014-09-24 15:10         ` Jerome Marchand
2014-09-25  1:07           ` Minchan Kim
2014-09-24 14:01   ` Dan Streetman
2014-09-25  1:02     ` Minchan Kim
2014-09-25 15:52       ` Dan Streetman
2014-10-06 23:36         ` Minchan Kim
2014-10-06 23:46           ` Minchan Kim
2014-10-08 18:29             ` Dan Streetman
2014-09-22  0:03 ` [PATCH v1 5/5] zram: add fullness knob to control swap full Minchan Kim
2014-09-22 21:17   ` Andrew Morton
2014-09-23  4:57     ` Minchan Kim
2014-12-02  3:04 ` [PATCH v1 0/5] stop anon reclaim when zram is full Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).