All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] make vm aware of zram-swap
@ 2014-09-04  1:39 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

VM uses nr_swap_pages as one of information when it reclaims
anonymous page because nr_swap_pages means how many freeable
space in swap so VM is able to throttle swap out if it found
there is no more space in swap.

But for zram-swap, there is size gap between virtual disksize
and physical memory to be able to store compressed memory so
nr_swap_pages is not correct parameter to throttle swap.

It causes endless anonymous reclaim(ie, swapout) even if there
is no free space in zram-swap so it makes system unresponsive.

This patch adds new hint SWAP_GET_FREE so zram can return how
many of freeable space to VM. With using that, VM can know whether
zram is full and substract remained freeable space from
nr_swap_pages to make it less than 0. IOW, from now on, VM sees
there is no more space of zram so that it will stop anonymous
reclaiming until swap_entry_free free a page which increases
nr_swap_pages again.

With this patch, user will see OOM when zram-swap is full
instead of hang with no response.

Minchan Kim (3):
  zram: generalize swap_slot_free_notify
  mm: add swap_get_free hint for zram
  zram: add swap_get_free hint

 Documentation/filesystems/Locking |  7 ++----
 drivers/block/zram/zram_drv.c     | 36 +++++++++++++++++++++++++--
 include/linux/blkdev.h            |  8 ++++--
 mm/page_io.c                      |  7 +++---
 mm/swapfile.c                     | 52 +++++++++++++++++++++++++++++++++++----
 5 files changed, 93 insertions(+), 17 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 0/3] make vm aware of zram-swap
@ 2014-09-04  1:39 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

VM uses nr_swap_pages as one of information when it reclaims
anonymous page because nr_swap_pages means how many freeable
space in swap so VM is able to throttle swap out if it found
there is no more space in swap.

But for zram-swap, there is size gap between virtual disksize
and physical memory to be able to store compressed memory so
nr_swap_pages is not correct parameter to throttle swap.

It causes endless anonymous reclaim(ie, swapout) even if there
is no free space in zram-swap so it makes system unresponsive.

This patch adds new hint SWAP_GET_FREE so zram can return how
many of freeable space to VM. With using that, VM can know whether
zram is full and substract remained freeable space from
nr_swap_pages to make it less than 0. IOW, from now on, VM sees
there is no more space of zram so that it will stop anonymous
reclaiming until swap_entry_free free a page which increases
nr_swap_pages again.

With this patch, user will see OOM when zram-swap is full
instead of hang with no response.

Minchan Kim (3):
  zram: generalize swap_slot_free_notify
  mm: add swap_get_free hint for zram
  zram: add swap_get_free hint

 Documentation/filesystems/Locking |  7 ++----
 drivers/block/zram/zram_drv.c     | 36 +++++++++++++++++++++++++--
 include/linux/blkdev.h            |  8 ++++--
 mm/page_io.c                      |  7 +++---
 mm/swapfile.c                     | 52 +++++++++++++++++++++++++++++++++++----
 5 files changed, 93 insertions(+), 17 deletions(-)

-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 1/3] zram: generalize swap_slot_free_notify
  2014-09-04  1:39 ` Minchan Kim
@ 2014-09-04  1:39   ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

Currently, swap_slot_free_notify is used for zram to free
duplicated copy page for memory efficiency when it knows
there is no reference to the swap slot.

Let's extend it to be able to use it for other purpose
so this patch generalizes it so that zram can get more hints
from VM.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |  7 ++-----
 drivers/block/zram/zram_drv.c     | 18 ++++++++++++++++--
 include/linux/blkdev.h            |  7 +++++--
 mm/page_io.c                      |  7 ++++---
 mm/swapfile.c                     |  7 ++++---
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9da61f..e7133874fd75 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -405,7 +405,7 @@ prototypes:
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint) (struct block_device *, unsigned int, void *);
 
 locking rules:
 			bd_mutex
@@ -418,14 +418,11 @@ media_changed:		no
 unlock_native_capacity:	no
 revalidate_disk:	no
 getgeo:			no
-swap_slot_free_notify:	no	(see below)
+swap_hint:		no
 
 media_changed, unlock_native_capacity and revalidate_disk are called only from
 check_disk_change().
 
-swap_slot_free_notify is called with swap_lock and sometimes the page lock
-held.
-
 
 --------------------------- file_operations -------------------------------
 prototypes:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index be88d750b112..88661d62e46a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -933,7 +933,8 @@ error:
 	bio_io_error(bio);
 }
 
-static void zram_slot_free_notify(struct block_device *bdev,
+/* this callback is with swap_lock and sometimes page table lock held */
+static int zram_slot_free_notify(struct block_device *bdev,
 				unsigned long index)
 {
 	struct zram *zram;
@@ -946,10 +947,23 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram_free_page(zram, index);
 	bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
 	atomic64_inc(&zram->stats.notify_free);
+
+	return 0;
+}
+
+static int zram_swap_hint(struct block_device *bdev,
+				unsigned int hint, void *arg)
+{
+	int ret = -EINVAL;
+
+	if (hint == SWAP_SLOT_FREE)
+		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+
+	return ret;
 }
 
 static const struct block_device_operations zram_devops = {
-	.swap_slot_free_notify = zram_slot_free_notify,
+	.swap_hint = zram_swap_hint,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b46555b80..17437b2c18e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1609,6 +1609,10 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+enum swap_blk_hint {
+	SWAP_SLOT_FREE,
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1624,8 +1628,7 @@ struct block_device_operations {
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	/* this callback is with swap_lock and sometimes page table lock held */
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint)(struct block_device *, unsigned int, void *);
 	struct module *owner;
 };
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b0d497..88a13d74c621 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -114,7 +114,7 @@ void end_swap_bio_read(struct bio *bio, int err)
 			 * we again wish to reclaim it.
 			 */
 			struct gendisk *disk = sis->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify) {
+			if (disk->fops->swap_hint) {
 				swp_entry_t entry;
 				unsigned long offset;
 
@@ -122,8 +122,9 @@ void end_swap_bio_read(struct bio *bio, int err)
 				offset = swp_offset(entry);
 
 				SetPageDirty(page);
-				disk->fops->swap_slot_free_notify(sis->bdev,
-						offset);
+				disk->fops->swap_hint(sis->bdev,
+						SWAP_SLOT_FREE,
+						(void *)offset);
 			}
 		}
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..4bff521e649a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -816,9 +816,10 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		frontswap_invalidate_page(p->type, offset);
 		if (p->flags & SWP_BLKDEV) {
 			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+			if (disk->fops->swap_hint)
+				disk->fops->swap_hint(p->bdev,
+						SWAP_SLOT_FREE,
+						(void *)offset);
 		}
 	}
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 1/3] zram: generalize swap_slot_free_notify
@ 2014-09-04  1:39   ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

Currently, swap_slot_free_notify is used for zram to free
duplicated copy page for memory efficiency when it knows
there is no reference to the swap slot.

Let's extend it to be able to use it for other purpose
so this patch generalizes it so that zram can get more hints
from VM.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |  7 ++-----
 drivers/block/zram/zram_drv.c     | 18 ++++++++++++++++--
 include/linux/blkdev.h            |  7 +++++--
 mm/page_io.c                      |  7 ++++---
 mm/swapfile.c                     |  7 ++++---
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9da61f..e7133874fd75 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -405,7 +405,7 @@ prototypes:
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint) (struct block_device *, unsigned int, void *);
 
 locking rules:
 			bd_mutex
@@ -418,14 +418,11 @@ media_changed:		no
 unlock_native_capacity:	no
 revalidate_disk:	no
 getgeo:			no
-swap_slot_free_notify:	no	(see below)
+swap_hint:		no
 
 media_changed, unlock_native_capacity and revalidate_disk are called only from
 check_disk_change().
 
-swap_slot_free_notify is called with swap_lock and sometimes the page lock
-held.
-
 
 --------------------------- file_operations -------------------------------
 prototypes:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index be88d750b112..88661d62e46a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -933,7 +933,8 @@ error:
 	bio_io_error(bio);
 }
 
-static void zram_slot_free_notify(struct block_device *bdev,
+/* this callback is with swap_lock and sometimes page table lock held */
+static int zram_slot_free_notify(struct block_device *bdev,
 				unsigned long index)
 {
 	struct zram *zram;
@@ -946,10 +947,23 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram_free_page(zram, index);
 	bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
 	atomic64_inc(&zram->stats.notify_free);
+
+	return 0;
+}
+
+static int zram_swap_hint(struct block_device *bdev,
+				unsigned int hint, void *arg)
+{
+	int ret = -EINVAL;
+
+	if (hint == SWAP_SLOT_FREE)
+		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+
+	return ret;
 }
 
 static const struct block_device_operations zram_devops = {
-	.swap_slot_free_notify = zram_slot_free_notify,
+	.swap_hint = zram_swap_hint,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b46555b80..17437b2c18e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1609,6 +1609,10 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+enum swap_blk_hint {
+	SWAP_SLOT_FREE,
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1624,8 +1628,7 @@ struct block_device_operations {
 	void (*unlock_native_capacity) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
-	/* this callback is with swap_lock and sometimes page table lock held */
-	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+	int (*swap_hint)(struct block_device *, unsigned int, void *);
 	struct module *owner;
 };
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b0d497..88a13d74c621 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -114,7 +114,7 @@ void end_swap_bio_read(struct bio *bio, int err)
 			 * we again wish to reclaim it.
 			 */
 			struct gendisk *disk = sis->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify) {
+			if (disk->fops->swap_hint) {
 				swp_entry_t entry;
 				unsigned long offset;
 
@@ -122,8 +122,9 @@ void end_swap_bio_read(struct bio *bio, int err)
 				offset = swp_offset(entry);
 
 				SetPageDirty(page);
-				disk->fops->swap_slot_free_notify(sis->bdev,
-						offset);
+				disk->fops->swap_hint(sis->bdev,
+						SWAP_SLOT_FREE,
+						(void *)offset);
 			}
 		}
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..4bff521e649a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -816,9 +816,10 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		frontswap_invalidate_page(p->type, offset);
 		if (p->flags & SWP_BLKDEV) {
 			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+			if (disk->fops->swap_hint)
+				disk->fops->swap_hint(p->bdev,
+						SWAP_SLOT_FREE,
+						(void *)offset);
 		}
 	}
 
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-04  1:39 ` Minchan Kim
@ 2014-09-04  1:39   ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

VM uses nr_swap_pages as one of information when it does
anonymous reclaim so that VM is able to throttle amount of swap.

Normally, the nr_swap_pages is equal to freeable space of swap disk
but for zram, it doesn't match because zram can limit memory usage
by knob(ie, mem_limit) so although VM can see lots of free space
from zram disk, zram can make fail intentionally once the allocated
space is over to limit. If it happens, VM should notice it and
stop reclaimaing until zram can obtain more free space but there
is a good way to do at the moment.

This patch adds new hint SWAP_GET_FREE which zram can return how
many of freeable space it has. With using that, this patch adds
__swap_full which returns true if the zram is full and substract
remained freeable space of the zram-swap from nr_swap_pages.
IOW, VM sees there is no more swap space of zram so that it stops
anonymous reclaiming until swap_entry_free free a page and increase
nr_swap_pages again.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/blkdev.h |  1 +
 mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 17437b2c18e4..c1199806e0f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 enum swap_blk_hint {
 	SWAP_SLOT_FREE,
+	SWAP_GET_FREE,
 };
 
 struct block_device_operations {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4bff521e649a..72737e6dd5e5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -484,6 +484,22 @@ new_cluster:
 	*scan_base = tmp;
 }
 
+static bool __swap_full(struct swap_info_struct *si)
+{
+	if (si->flags & SWP_BLKDEV) {
+		long free;
+		struct gendisk *disk = si->bdev->bd_disk;
+
+		if (disk->fops->swap_hint)
+			if (!disk->fops->swap_hint(si->bdev,
+						SWAP_GET_FREE,
+						&free))
+				return free <= 0;
+	}
+
+	return si->inuse_pages == si->pages;
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -583,11 +599,21 @@ checks:
 	if (offset == si->highest_bit)
 		si->highest_bit--;
 	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
+	if (__swap_full(si)) {
+		struct gendisk *disk = si->bdev->bd_disk;
+
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 		spin_lock(&swap_avail_lock);
 		plist_del(&si->avail_list, &swap_avail_head);
+		/*
+		 * If zram is full, it decreases nr_swap_pages
+		 * for stopping anonymous page reclaim until
+		 * zram has free space. Look at swap_entry_free
+		 */
+		if (disk->fops->swap_hint)
+			atomic_long_sub(si->pages - si->inuse_pages,
+				&nr_swap_pages);
 		spin_unlock(&swap_avail_lock);
 	}
 	si->swap_map[offset] = usage;
@@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
+		struct gendisk *disk = p->bdev->bd_disk;
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
@@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 				if (plist_node_empty(&p->avail_list))
 					plist_add(&p->avail_list,
 						  &swap_avail_head);
+				if ((p->flags & SWP_BLKDEV) &&
+					disk->fops->swap_hint) {
+					atomic_long_add(p->pages -
+							p->inuse_pages,
+							&nr_swap_pages);
+					/*
+					 * reset [highest|lowest]_bit to avoid
+					 * scan_swap_map infinite looping if
+					 * cached free cluster's index by
+					 * scan_swap_map_try_ssd_cluster is
+					 * above p->highest_bit.
+					 */
+					p->highest_bit = p->max - 1;
+					p->lowest_bit = 1;
+				}
 				spin_unlock(&swap_avail_lock);
 			}
 		}
@@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		p->inuse_pages--;
 		frontswap_invalidate_page(p->type, offset);
 		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
 			if (disk->fops->swap_hint)
 				disk->fops->swap_hint(p->bdev,
 						SWAP_SLOT_FREE,
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-04  1:39   ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

VM uses nr_swap_pages as one of information when it does
anonymous reclaim so that VM is able to throttle amount of swap.

Normally, the nr_swap_pages is equal to freeable space of swap disk
but for zram, it doesn't match because zram can limit memory usage
by knob(ie, mem_limit) so although VM can see lots of free space
from zram disk, zram can make fail intentionally once the allocated
space is over to limit. If it happens, VM should notice it and
stop reclaimaing until zram can obtain more free space but there
is a good way to do at the moment.

This patch adds new hint SWAP_GET_FREE which zram can return how
many of freeable space it has. With using that, this patch adds
__swap_full which returns true if the zram is full and substract
remained freeable space of the zram-swap from nr_swap_pages.
IOW, VM sees there is no more swap space of zram so that it stops
anonymous reclaiming until swap_entry_free free a page and increase
nr_swap_pages again.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/blkdev.h |  1 +
 mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 17437b2c18e4..c1199806e0f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 enum swap_blk_hint {
 	SWAP_SLOT_FREE,
+	SWAP_GET_FREE,
 };
 
 struct block_device_operations {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4bff521e649a..72737e6dd5e5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -484,6 +484,22 @@ new_cluster:
 	*scan_base = tmp;
 }
 
+static bool __swap_full(struct swap_info_struct *si)
+{
+	if (si->flags & SWP_BLKDEV) {
+		long free;
+		struct gendisk *disk = si->bdev->bd_disk;
+
+		if (disk->fops->swap_hint)
+			if (!disk->fops->swap_hint(si->bdev,
+						SWAP_GET_FREE,
+						&free))
+				return free <= 0;
+	}
+
+	return si->inuse_pages == si->pages;
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -583,11 +599,21 @@ checks:
 	if (offset == si->highest_bit)
 		si->highest_bit--;
 	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
+	if (__swap_full(si)) {
+		struct gendisk *disk = si->bdev->bd_disk;
+
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 		spin_lock(&swap_avail_lock);
 		plist_del(&si->avail_list, &swap_avail_head);
+		/*
+		 * If zram is full, it decreases nr_swap_pages
+		 * for stopping anonymous page reclaim until
+		 * zram has free space. Look at swap_entry_free
+		 */
+		if (disk->fops->swap_hint)
+			atomic_long_sub(si->pages - si->inuse_pages,
+				&nr_swap_pages);
 		spin_unlock(&swap_avail_lock);
 	}
 	si->swap_map[offset] = usage;
@@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
+		struct gendisk *disk = p->bdev->bd_disk;
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
@@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 				if (plist_node_empty(&p->avail_list))
 					plist_add(&p->avail_list,
 						  &swap_avail_head);
+				if ((p->flags & SWP_BLKDEV) &&
+					disk->fops->swap_hint) {
+					atomic_long_add(p->pages -
+							p->inuse_pages,
+							&nr_swap_pages);
+					/*
+					 * reset [highest|lowest]_bit to avoid
+					 * scan_swap_map infinite looping if
+					 * cached free cluster's index by
+					 * scan_swap_map_try_ssd_cluster is
+					 * above p->highest_bit.
+					 */
+					p->highest_bit = p->max - 1;
+					p->lowest_bit = 1;
+				}
 				spin_unlock(&swap_avail_lock);
 			}
 		}
@@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		p->inuse_pages--;
 		frontswap_invalidate_page(p->type, offset);
 		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
 			if (disk->fops->swap_hint)
 				disk->fops->swap_hint(p->bdev,
 						SWAP_SLOT_FREE,
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 3/3] zram: add swap_get_free hint
  2014-09-04  1:39 ` Minchan Kim
@ 2014-09-04  1:39   ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

This patch implement SWAP_GET_FREE handler in zram so that VM can
know how many zram has freeable space.
VM can use it to stop anonymous reclaiming once zram is full.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 88661d62e46a..8e22b20aa2db 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
 	return 0;
 }
 
+static int zram_get_free_pages(struct block_device *bdev, long *free)
+{
+	struct zram *zram;
+	struct zram_meta *meta;
+
+	zram = bdev->bd_disk->private_data;
+	meta = zram->meta;
+
+	if (!zram->limit_pages)
+		return 1;
+
+	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
+
+	return 0;
+}
+
 static int zram_swap_hint(struct block_device *bdev,
 				unsigned int hint, void *arg)
 {
@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
 
 	if (hint == SWAP_SLOT_FREE)
 		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+	else if (hint == SWAP_GET_FREE)
+		ret = zram_get_free_pages(bdev, arg);
 
 	return ret;
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-04  1:39   ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato, Minchan Kim

This patch implement SWAP_GET_FREE handler in zram so that VM can
know how many zram has freeable space.
VM can use it to stop anonymous reclaiming once zram is full.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 88661d62e46a..8e22b20aa2db 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
 	return 0;
 }
 
+static int zram_get_free_pages(struct block_device *bdev, long *free)
+{
+	struct zram *zram;
+	struct zram_meta *meta;
+
+	zram = bdev->bd_disk->private_data;
+	meta = zram->meta;
+
+	if (!zram->limit_pages)
+		return 1;
+
+	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
+
+	return 0;
+}
+
 static int zram_swap_hint(struct block_device *bdev,
 				unsigned int hint, void *arg)
 {
@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
 
 	if (hint == SWAP_SLOT_FREE)
 		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
+	else if (hint == SWAP_GET_FREE)
+		ret = zram_get_free_pages(bdev, arg);
 
 	return ret;
 }
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-04  1:39   ` Minchan Kim
@ 2014-09-04  6:26     ` Heesub Shin
  -1 siblings, 0 replies; 40+ messages in thread
From: Heesub Shin @ 2014-09-04  6:26 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato

Hello Minchan,

First of all, I agree with the overall purpose of your patch set.

On 09/04/2014 10:39 AM, Minchan Kim wrote:
> This patch implement SWAP_GET_FREE handler in zram so that VM can
> know how many zram has freeable space.
> VM can use it to stop anonymous reclaiming once zram is full.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 88661d62e46a..8e22b20aa2db 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>   	return 0;
>   }
>
> +static int zram_get_free_pages(struct block_device *bdev, long *free)
> +{
> +	struct zram *zram;
> +	struct zram_meta *meta;
> +
> +	zram = bdev->bd_disk->private_data;
> +	meta = zram->meta;
> +
> +	if (!zram->limit_pages)
> +		return 1;
> +
> +	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);

Even if 'free' is zero here, there may be free spaces available to store 
more compressed pages into the zs_pool. I mean calculation above is not 
quite accurate and wastes memory, but have no better idea for now.

heesub

> +
> +	return 0;
> +}
> +
>   static int zram_swap_hint(struct block_device *bdev,
>   				unsigned int hint, void *arg)
>   {
> @@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>
>   	if (hint == SWAP_SLOT_FREE)
>   		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> +	else if (hint == SWAP_GET_FREE)
> +		ret = zram_get_free_pages(bdev, arg);
>
>   	return ret;
>   }
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-04  6:26     ` Heesub Shin
  0 siblings, 0 replies; 40+ messages in thread
From: Heesub Shin @ 2014-09-04  6:26 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato

Hello Minchan,

First of all, I agree with the overall purpose of your patch set.

On 09/04/2014 10:39 AM, Minchan Kim wrote:
> This patch implement SWAP_GET_FREE handler in zram so that VM can
> know how many zram has freeable space.
> VM can use it to stop anonymous reclaiming once zram is full.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 88661d62e46a..8e22b20aa2db 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>   	return 0;
>   }
>
> +static int zram_get_free_pages(struct block_device *bdev, long *free)
> +{
> +	struct zram *zram;
> +	struct zram_meta *meta;
> +
> +	zram = bdev->bd_disk->private_data;
> +	meta = zram->meta;
> +
> +	if (!zram->limit_pages)
> +		return 1;
> +
> +	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);

Even if 'free' is zero here, there may be free spaces available to store 
more compressed pages into the zs_pool. I mean calculation above is not 
quite accurate and wastes memory, but have no better idea for now.

heesub

> +
> +	return 0;
> +}
> +
>   static int zram_swap_hint(struct block_device *bdev,
>   				unsigned int hint, void *arg)
>   {
> @@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>
>   	if (hint == SWAP_SLOT_FREE)
>   		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> +	else if (hint == SWAP_GET_FREE)
> +		ret = zram_get_free_pages(bdev, arg);
>
>   	return ret;
>   }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-04  6:26     ` Heesub Shin
@ 2014-09-04 23:59       ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04 23:59 UTC (permalink / raw)
  To: Heesub Shin
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato

Hi Heesub,

On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> Hello Minchan,
> 
> First of all, I agree with the overall purpose of your patch set.

Thank you.

> 
> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >know how many zram has freeable space.
> >VM can use it to stop anonymous reclaiming once zram is full.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >index 88661d62e46a..8e22b20aa2db 100644
> >--- a/drivers/block/zram/zram_drv.c
> >+++ b/drivers/block/zram/zram_drv.c
> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >  	return 0;
> >  }
> >
> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >+{
> >+	struct zram *zram;
> >+	struct zram_meta *meta;
> >+
> >+	zram = bdev->bd_disk->private_data;
> >+	meta = zram->meta;
> >+
> >+	if (!zram->limit_pages)
> >+		return 1;
> >+
> >+	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> 
> Even if 'free' is zero here, there may be free spaces available to
> store more compressed pages into the zs_pool. I mean calculation
> above is not quite accurate and wastes memory, but have no better
> idea for now.

Yeb, good point.

Actually, I thought about that but in this patchset, I wanted to
go with conservative approach which is a safe guard to prevent
system hang which is terrible than early OOM kill.

Whole point of this patchset is to add a facility to VM and VM
collaborates with zram via the interface to avoid worst case
(ie, system hang) and logic to throttle could be enhanced by
several approaches in future but I agree my logic was too simple
and conservative.

We could improve it with [anti|de]fragmentation in future but
at the moment, below simple heuristic is not too bad for first
step. :)


---
 drivers/block/zram/zram_drv.c | 15 ++++++++++-----
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8e22b20aa2db..af9dfe6a7d2b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
 	atomic64_sub(zram_get_obj_size(meta, index),
 			&zram->stats.compr_data_size);
 	atomic64_dec(&zram->stats.pages_stored);
+	atomic_set(&zram->alloc_fail, 0);
 
 	meta->table[index].handle = 0;
 	zram_set_obj_size(meta, index, 0);
@@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	alloced_pages = zs_get_total_pages(meta->mem_pool);
 	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
 		zs_free(meta->mem_pool, handle);
+		atomic_inc(&zram->alloc_fail);
 		ret = -ENOMEM;
 		goto out;
 	}
 
+	atomic_set(&zram->alloc_fail, 0);
 	update_used_max(zram, alloced_pages);
 
 	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
@@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
 	return 0;
 }
 
+#define FULL_THRESH_HOLD 32
 static int zram_get_free_pages(struct block_device *bdev, long *free)
 {
 	struct zram *zram;
@@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
 	zram = bdev->bd_disk->private_data;
 	meta = zram->meta;
 
-	if (!zram->limit_pages)
-		return 1;
-
-	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
+	if (zram->limit_pages &&
+		(atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
+		*free = 0;
+		return 0;
+	}
 
-	return 0;
+	return 1;
 }
 
 static int zram_swap_hint(struct block_device *bdev,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 779d03fa4360..182a2544751b 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -115,6 +115,7 @@ struct zram {
 	u64 disksize;	/* bytes */
 	int max_comp_streams;
 	struct zram_stats stats;
+	atomic_t alloc_fail;
 	/*
 	 * the number of pages zram can consume for storing compressed data
 	 */
-- 
2.0.0

> 
> heesub
> 
> >+
> >+	return 0;
> >+}
> >+
> >  static int zram_swap_hint(struct block_device *bdev,
> >  				unsigned int hint, void *arg)
> >  {
> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >
> >  	if (hint == SWAP_SLOT_FREE)
> >  		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >+	else if (hint == SWAP_GET_FREE)
> >+		ret = zram_get_free_pages(bdev, arg);
> >
> >  	return ret;
> >  }
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-04 23:59       ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-04 23:59 UTC (permalink / raw)
  To: Heesub Shin
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman, Nitin Gupta,
	Luigi Semenzato

Hi Heesub,

On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> Hello Minchan,
> 
> First of all, I agree with the overall purpose of your patch set.

Thank you.

> 
> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >know how many zram has freeable space.
> >VM can use it to stop anonymous reclaiming once zram is full.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >index 88661d62e46a..8e22b20aa2db 100644
> >--- a/drivers/block/zram/zram_drv.c
> >+++ b/drivers/block/zram/zram_drv.c
> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >  	return 0;
> >  }
> >
> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >+{
> >+	struct zram *zram;
> >+	struct zram_meta *meta;
> >+
> >+	zram = bdev->bd_disk->private_data;
> >+	meta = zram->meta;
> >+
> >+	if (!zram->limit_pages)
> >+		return 1;
> >+
> >+	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> 
> Even if 'free' is zero here, there may be free spaces available to
> store more compressed pages into the zs_pool. I mean calculation
> above is not quite accurate and wastes memory, but have no better
> idea for now.

Yeb, good point.

Actually, I thought about that but in this patchset, I wanted to
go with conservative approach which is a safe guard to prevent
system hang which is terrible than early OOM kill.

Whole point of this patchset is to add a facility to VM and VM
collaborates with zram via the interface to avoid worst case
(ie, system hang) and logic to throttle could be enhanced by
several approaches in future but I agree my logic was too simple
and conservative.

We could improve it with [anti|de]fragmentation in future but
at the moment, below simple heuristic is not too bad for first
step. :)


---
 drivers/block/zram/zram_drv.c | 15 ++++++++++-----
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8e22b20aa2db..af9dfe6a7d2b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
 	atomic64_sub(zram_get_obj_size(meta, index),
 			&zram->stats.compr_data_size);
 	atomic64_dec(&zram->stats.pages_stored);
+	atomic_set(&zram->alloc_fail, 0);
 
 	meta->table[index].handle = 0;
 	zram_set_obj_size(meta, index, 0);
@@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	alloced_pages = zs_get_total_pages(meta->mem_pool);
 	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
 		zs_free(meta->mem_pool, handle);
+		atomic_inc(&zram->alloc_fail);
 		ret = -ENOMEM;
 		goto out;
 	}
 
+	atomic_set(&zram->alloc_fail, 0);
 	update_used_max(zram, alloced_pages);
 
 	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
@@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
 	return 0;
 }
 
+#define FULL_THRESH_HOLD 32
 static int zram_get_free_pages(struct block_device *bdev, long *free)
 {
 	struct zram *zram;
@@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
 	zram = bdev->bd_disk->private_data;
 	meta = zram->meta;
 
-	if (!zram->limit_pages)
-		return 1;
-
-	*free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
+	if (zram->limit_pages &&
+		(atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
+		*free = 0;
+		return 0;
+	}
 
-	return 0;
+	return 1;
 }
 
 static int zram_swap_hint(struct block_device *bdev,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 779d03fa4360..182a2544751b 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -115,6 +115,7 @@ struct zram {
 	u64 disksize;	/* bytes */
 	int max_comp_streams;
 	struct zram_stats stats;
+	atomic_t alloc_fail;
 	/*
 	 * the number of pages zram can consume for storing compressed data
 	 */
-- 
2.0.0

> 
> heesub
> 
> >+
> >+	return 0;
> >+}
> >+
> >  static int zram_swap_hint(struct block_device *bdev,
> >  				unsigned int hint, void *arg)
> >  {
> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >
> >  	if (hint == SWAP_SLOT_FREE)
> >  		ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >+	else if (hint == SWAP_GET_FREE)
> >+		ret = zram_get_free_pages(bdev, arg);
> >
> >  	return ret;
> >  }
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-04  1:39   ` Minchan Kim
@ 2014-09-13 19:01     ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-13 19:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> VM uses nr_swap_pages as one of information when it does
> anonymous reclaim so that VM is able to throttle amount of swap.
>
> Normally, the nr_swap_pages is equal to freeable space of swap disk
> but for zram, it doesn't match because zram can limit memory usage
> by knob(ie, mem_limit) so although VM can see lots of free space
> from zram disk, zram can make fail intentionally once the allocated
> space is over to limit. If it happens, VM should notice it and
> stop reclaimaing until zram can obtain more free space but there
> is a good way to do at the moment.
>
> This patch adds new hint SWAP_GET_FREE which zram can return how
> many of freeable space it has. With using that, this patch adds
> __swap_full which returns true if the zram is full and substract
> remained freeable space of the zram-swap from nr_swap_pages.
> IOW, VM sees there is no more swap space of zram so that it stops
> anonymous reclaiming until swap_entry_free free a page and increase
> nr_swap_pages again.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/blkdev.h |  1 +
>  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 17437b2c18e4..c1199806e0f1 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>
>  enum swap_blk_hint {
>         SWAP_SLOT_FREE,
> +       SWAP_GET_FREE,
>  };
>
>  struct block_device_operations {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4bff521e649a..72737e6dd5e5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -484,6 +484,22 @@ new_cluster:
>         *scan_base = tmp;
>  }
>
> +static bool __swap_full(struct swap_info_struct *si)
> +{
> +       if (si->flags & SWP_BLKDEV) {
> +               long free;
> +               struct gendisk *disk = si->bdev->bd_disk;
> +
> +               if (disk->fops->swap_hint)
> +                       if (!disk->fops->swap_hint(si->bdev,
> +                                               SWAP_GET_FREE,
> +                                               &free))
> +                               return free <= 0;
> +       }
> +
> +       return si->inuse_pages == si->pages;
> +}
> +
>  static unsigned long scan_swap_map(struct swap_info_struct *si,
>                                    unsigned char usage)
>  {
> @@ -583,11 +599,21 @@ checks:
>         if (offset == si->highest_bit)
>                 si->highest_bit--;
>         si->inuse_pages++;
> -       if (si->inuse_pages == si->pages) {
> +       if (__swap_full(si)) {

This check is done after an available offset has already been
selected.  So if the variable-size blkdev is full at this point, then
this is incorrect, as swap will try to store a page at the current
selected offset.

> +               struct gendisk *disk = si->bdev->bd_disk;
> +
>                 si->lowest_bit = si->max;
>                 si->highest_bit = 0;
>                 spin_lock(&swap_avail_lock);
>                 plist_del(&si->avail_list, &swap_avail_head);
> +               /*
> +                * If zram is full, it decreases nr_swap_pages
> +                * for stopping anonymous page reclaim until
> +                * zram has free space. Look at swap_entry_free
> +                */
> +               if (disk->fops->swap_hint)

Simply checking for the existence of swap_hint isn't enough to know
we're using zram...

> +                       atomic_long_sub(si->pages - si->inuse_pages,
> +                               &nr_swap_pages);
>                 spin_unlock(&swap_avail_lock);
>         }
>         si->swap_map[offset] = usage;
> @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>
>         /* free if no reference */
>         if (!usage) {
> +               struct gendisk *disk = p->bdev->bd_disk;
>                 dec_cluster_info_page(p, p->cluster_info, offset);
>                 if (offset < p->lowest_bit)
>                         p->lowest_bit = offset;
> @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>                                 if (plist_node_empty(&p->avail_list))
>                                         plist_add(&p->avail_list,
>                                                   &swap_avail_head);
> +                               if ((p->flags & SWP_BLKDEV) &&
> +                                       disk->fops->swap_hint) {

freeing an entry from a full variable-size blkdev doesn't mean it's
not still full.  In this case with zsmalloc, freeing one handle
doesn't actually free any memory unless it was the only handle left in
its containing zspage, and therefore it's possible that it is still
full at this point.

> +                                       atomic_long_add(p->pages -
> +                                                       p->inuse_pages,
> +                                                       &nr_swap_pages);
> +                                       /*
> +                                        * reset [highest|lowest]_bit to avoid
> +                                        * scan_swap_map infinite looping if
> +                                        * cached free cluster's index by
> +                                        * scan_swap_map_try_ssd_cluster is
> +                                        * above p->highest_bit.
> +                                        */
> +                                       p->highest_bit = p->max - 1;
> +                                       p->lowest_bit = 1;

lowest_bit and highest_bit are likely to remain at those extremes for
a long time, until 1 or max-1 is freed and re-allocated.


By adding variable-size blkdev support to swap, I don't think
highest_bit can be re-used as a "full" flag anymore.

Instead, I suggest that you add a "full" flag to struct
swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
scan_swap_map(), and if full simply turn "full" on, remove the
swap_info_struct from the avail list, reduce nr_swap_pages
appropriately, and return failure.  Don't mess with lowest_bit or
highest_bit at all.

Then in swap_entry_free(), do something like:

    dec_cluster_info_page(p, p->cluster_info, offset);
    if (offset < p->lowest_bit)
      p->lowest_bit = offset;
-   if (offset > p->highest_bit) {
-     bool was_full = !p->highest_bit;
+   if (offset > p->highest_bit)
      p->highest_bit = offset;
-     if (was_full && (p->flags & SWP_WRITEOK)) {
+   if (p->full && p->flags & SWP_WRITEOK) {
+     bool is_var_size_blkdev = is_variable_size_blkdev(p);
+     bool blkdev_full = is_variable_size_blkdev_full(p);
+
+     if (!is_var_size_blkdev || !blkdev_full) {
+       if (is_var_size_blkdev)
+         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
+       p->full = false;
        spin_lock(&swap_avail_lock);
        WARN_ON(!plist_node_empty(&p->avail_list));
        if (plist_node_empty(&p->avail_list))
          plist_add(&p->avail_list,
             &swap_avail_head);
        spin_unlock(&swap_avail_lock);
+     } else if (blkdev_full) {
+       /* still full, so this page isn't actually
+        * available yet to use; once non-full,
+        * pages-inuse_pages will be the correct
+        * number to add (above) since below will
+        * inuse_pages--
+        */
+       atomic_long_dec(&nr_swap_pages);
      }
    }
    atomic_long_inc(&nr_swap_pages);



> @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>                 p->inuse_pages--;
>                 frontswap_invalidate_page(p->type, offset);
>                 if (p->flags & SWP_BLKDEV) {
> -                       struct gendisk *disk = p->bdev->bd_disk;
>                         if (disk->fops->swap_hint)
>                                 disk->fops->swap_hint(p->bdev,
>                                                 SWAP_SLOT_FREE,
> --
> 2.0.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-13 19:01     ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-13 19:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> VM uses nr_swap_pages as one of information when it does
> anonymous reclaim so that VM is able to throttle amount of swap.
>
> Normally, the nr_swap_pages is equal to freeable space of swap disk
> but for zram, it doesn't match because zram can limit memory usage
> by knob(ie, mem_limit) so although VM can see lots of free space
> from zram disk, zram can make fail intentionally once the allocated
> space is over to limit. If it happens, VM should notice it and
> stop reclaimaing until zram can obtain more free space but there
> is a good way to do at the moment.
>
> This patch adds new hint SWAP_GET_FREE which zram can return how
> many of freeable space it has. With using that, this patch adds
> __swap_full which returns true if the zram is full and substract
> remained freeable space of the zram-swap from nr_swap_pages.
> IOW, VM sees there is no more swap space of zram so that it stops
> anonymous reclaiming until swap_entry_free free a page and increase
> nr_swap_pages again.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/blkdev.h |  1 +
>  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 17437b2c18e4..c1199806e0f1 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>
>  enum swap_blk_hint {
>         SWAP_SLOT_FREE,
> +       SWAP_GET_FREE,
>  };
>
>  struct block_device_operations {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4bff521e649a..72737e6dd5e5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -484,6 +484,22 @@ new_cluster:
>         *scan_base = tmp;
>  }
>
> +static bool __swap_full(struct swap_info_struct *si)
> +{
> +       if (si->flags & SWP_BLKDEV) {
> +               long free;
> +               struct gendisk *disk = si->bdev->bd_disk;
> +
> +               if (disk->fops->swap_hint)
> +                       if (!disk->fops->swap_hint(si->bdev,
> +                                               SWAP_GET_FREE,
> +                                               &free))
> +                               return free <= 0;
> +       }
> +
> +       return si->inuse_pages == si->pages;
> +}
> +
>  static unsigned long scan_swap_map(struct swap_info_struct *si,
>                                    unsigned char usage)
>  {
> @@ -583,11 +599,21 @@ checks:
>         if (offset == si->highest_bit)
>                 si->highest_bit--;
>         si->inuse_pages++;
> -       if (si->inuse_pages == si->pages) {
> +       if (__swap_full(si)) {

This check is done after an available offset has already been
selected.  So if the variable-size blkdev is full at this point, then
this is incorrect, as swap will try to store a page at the current
selected offset.

> +               struct gendisk *disk = si->bdev->bd_disk;
> +
>                 si->lowest_bit = si->max;
>                 si->highest_bit = 0;
>                 spin_lock(&swap_avail_lock);
>                 plist_del(&si->avail_list, &swap_avail_head);
> +               /*
> +                * If zram is full, it decreases nr_swap_pages
> +                * for stopping anonymous page reclaim until
> +                * zram has free space. Look at swap_entry_free
> +                */
> +               if (disk->fops->swap_hint)

Simply checking for the existence of swap_hint isn't enough to know
we're using zram...

> +                       atomic_long_sub(si->pages - si->inuse_pages,
> +                               &nr_swap_pages);
>                 spin_unlock(&swap_avail_lock);
>         }
>         si->swap_map[offset] = usage;
> @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>
>         /* free if no reference */
>         if (!usage) {
> +               struct gendisk *disk = p->bdev->bd_disk;
>                 dec_cluster_info_page(p, p->cluster_info, offset);
>                 if (offset < p->lowest_bit)
>                         p->lowest_bit = offset;
> @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>                                 if (plist_node_empty(&p->avail_list))
>                                         plist_add(&p->avail_list,
>                                                   &swap_avail_head);
> +                               if ((p->flags & SWP_BLKDEV) &&
> +                                       disk->fops->swap_hint) {

freeing an entry from a full variable-size blkdev doesn't mean it's
not still full.  In this case with zsmalloc, freeing one handle
doesn't actually free any memory unless it was the only handle left in
its containing zspage, and therefore it's possible that it is still
full at this point.

> +                                       atomic_long_add(p->pages -
> +                                                       p->inuse_pages,
> +                                                       &nr_swap_pages);
> +                                       /*
> +                                        * reset [highest|lowest]_bit to avoid
> +                                        * scan_swap_map infinite looping if
> +                                        * cached free cluster's index by
> +                                        * scan_swap_map_try_ssd_cluster is
> +                                        * above p->highest_bit.
> +                                        */
> +                                       p->highest_bit = p->max - 1;
> +                                       p->lowest_bit = 1;

lowest_bit and highest_bit are likely to remain at those extremes for
a long time, until 1 or max-1 is freed and re-allocated.


By adding variable-size blkdev support to swap, I don't think
highest_bit can be re-used as a "full" flag anymore.

Instead, I suggest that you add a "full" flag to struct
swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
scan_swap_map(), and if full simply turn "full" on, remove the
swap_info_struct from the avail list, reduce nr_swap_pages
appropriately, and return failure.  Don't mess with lowest_bit or
highest_bit at all.

Then in swap_entry_free(), do something like:

    dec_cluster_info_page(p, p->cluster_info, offset);
    if (offset < p->lowest_bit)
      p->lowest_bit = offset;
-   if (offset > p->highest_bit) {
-     bool was_full = !p->highest_bit;
+   if (offset > p->highest_bit)
      p->highest_bit = offset;
-     if (was_full && (p->flags & SWP_WRITEOK)) {
+   if (p->full && p->flags & SWP_WRITEOK) {
+     bool is_var_size_blkdev = is_variable_size_blkdev(p);
+     bool blkdev_full = is_variable_size_blkdev_full(p);
+
+     if (!is_var_size_blkdev || !blkdev_full) {
+       if (is_var_size_blkdev)
+         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
+       p->full = false;
        spin_lock(&swap_avail_lock);
        WARN_ON(!plist_node_empty(&p->avail_list));
        if (plist_node_empty(&p->avail_list))
          plist_add(&p->avail_list,
             &swap_avail_head);
        spin_unlock(&swap_avail_lock);
+     } else if (blkdev_full) {
+       /* still full, so this page isn't actually
+        * available yet to use; once non-full,
+        * pages-inuse_pages will be the correct
+        * number to add (above) since below will
+        * inuse_pages--
+        */
+       atomic_long_dec(&nr_swap_pages);
      }
    }
    atomic_long_inc(&nr_swap_pages);



> @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>                 p->inuse_pages--;
>                 frontswap_invalidate_page(p->type, offset);
>                 if (p->flags & SWP_BLKDEV) {
> -                       struct gendisk *disk = p->bdev->bd_disk;
>                         if (disk->fops->swap_hint)
>                                 disk->fops->swap_hint(p->bdev,
>                                                 SWAP_SLOT_FREE,
> --
> 2.0.0
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-04 23:59       ` Minchan Kim
@ 2014-09-13 19:39         ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-13 19:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Heesub,
>
> On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> Hello Minchan,
>>
>> First of all, I agree with the overall purpose of your patch set.
>
> Thank you.
>
>>
>> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >know how many zram has freeable space.
>> >VM can use it to stop anonymous reclaiming once zram is full.
>> >
>> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >---
>> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >  1 file changed, 18 insertions(+)
>> >
>> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >index 88661d62e46a..8e22b20aa2db 100644
>> >--- a/drivers/block/zram/zram_drv.c
>> >+++ b/drivers/block/zram/zram_drv.c
>> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >     return 0;
>> >  }
>> >
>> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >+{
>> >+    struct zram *zram;
>> >+    struct zram_meta *meta;
>> >+
>> >+    zram = bdev->bd_disk->private_data;
>> >+    meta = zram->meta;
>> >+
>> >+    if (!zram->limit_pages)
>> >+            return 1;
>> >+
>> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>>
>> Even if 'free' is zero here, there may be free spaces available to
>> store more compressed pages into the zs_pool. I mean calculation
>> above is not quite accurate and wastes memory, but have no better
>> idea for now.
>
> Yeb, good point.
>
> Actually, I thought about that but in this patchset, I wanted to
> go with conservative approach which is a safe guard to prevent
> system hang which is terrible than early OOM kill.
>
> Whole point of this patchset is to add a facility to VM and VM
> collaborates with zram via the interface to avoid worst case
> (ie, system hang) and logic to throttle could be enhanced by
> several approaches in future but I agree my logic was too simple
> and conservative.
>
> We could improve it with [anti|de]fragmentation in future but
> at the moment, below simple heuristic is not too bad for first
> step. :)
>
>
> ---
>  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>  drivers/block/zram/zram_drv.h |  1 +
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 8e22b20aa2db..af9dfe6a7d2b 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>         atomic64_sub(zram_get_obj_size(meta, index),
>                         &zram->stats.compr_data_size);
>         atomic64_dec(&zram->stats.pages_stored);
> +       atomic_set(&zram->alloc_fail, 0);
>
>         meta->table[index].handle = 0;
>         zram_set_obj_size(meta, index, 0);
> @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>         alloced_pages = zs_get_total_pages(meta->mem_pool);
>         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>                 zs_free(meta->mem_pool, handle);
> +               atomic_inc(&zram->alloc_fail);
>                 ret = -ENOMEM;
>                 goto out;
>         }

This isn't going to work well at all with swap.  There will be,
minimum, 32 failures to write a swap page before GET_FREE finally
indicates it's full, and even then a single free during those 32
failures will restart the counter, so it could be dozens or hundreds
(or more) swap write failures before the zram device is marked as
full.  And then, a single zram free will move it back to non-full and
start the write failures over again.

I think it would be better to just check for actual fullness (i.e.
alloced_pages > limit_pages) at the start of write, and fail if so.
That will allow a single write to succeed when it crosses into
fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
the same check (alloced_pages > limit_pages), then swap shouldn't see
any write failures (or very few), and zram will stay full until enough
pages are freed that it really does move under limit_pages.



>
> +       atomic_set(&zram->alloc_fail, 0);
>         update_used_max(zram, alloced_pages);
>
>         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>         return 0;
>  }
>
> +#define FULL_THRESH_HOLD 32
>  static int zram_get_free_pages(struct block_device *bdev, long *free)
>  {
>         struct zram *zram;
> @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>         zram = bdev->bd_disk->private_data;
>         meta = zram->meta;
>
> -       if (!zram->limit_pages)
> -               return 1;
> -
> -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> +       if (zram->limit_pages &&
> +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> +               *free = 0;
> +               return 0;
> +       }
>
> -       return 0;
> +       return 1;

There's no way that zram can even provide a accurate number of free
pages, since it can't know how compressible future stored pages will
be.  It would be better to simply change this swap_hint from GET_FREE
to IS_FULL, and return either true or false.


>  }
>
>  static int zram_swap_hint(struct block_device *bdev,
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 779d03fa4360..182a2544751b 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -115,6 +115,7 @@ struct zram {
>         u64 disksize;   /* bytes */
>         int max_comp_streams;
>         struct zram_stats stats;
> +       atomic_t alloc_fail;
>         /*
>          * the number of pages zram can consume for storing compressed data
>          */
> --
> 2.0.0
>
>>
>> heesub
>>
>> >+
>> >+    return 0;
>> >+}
>> >+
>> >  static int zram_swap_hint(struct block_device *bdev,
>> >                             unsigned int hint, void *arg)
>> >  {
>> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >
>> >     if (hint == SWAP_SLOT_FREE)
>> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >+    else if (hint == SWAP_GET_FREE)
>> >+            ret = zram_get_free_pages(bdev, arg);
>> >
>> >     return ret;
>> >  }
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-13 19:39         ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-13 19:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Heesub,
>
> On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> Hello Minchan,
>>
>> First of all, I agree with the overall purpose of your patch set.
>
> Thank you.
>
>>
>> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >know how many zram has freeable space.
>> >VM can use it to stop anonymous reclaiming once zram is full.
>> >
>> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >---
>> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >  1 file changed, 18 insertions(+)
>> >
>> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >index 88661d62e46a..8e22b20aa2db 100644
>> >--- a/drivers/block/zram/zram_drv.c
>> >+++ b/drivers/block/zram/zram_drv.c
>> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >     return 0;
>> >  }
>> >
>> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >+{
>> >+    struct zram *zram;
>> >+    struct zram_meta *meta;
>> >+
>> >+    zram = bdev->bd_disk->private_data;
>> >+    meta = zram->meta;
>> >+
>> >+    if (!zram->limit_pages)
>> >+            return 1;
>> >+
>> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>>
>> Even if 'free' is zero here, there may be free spaces available to
>> store more compressed pages into the zs_pool. I mean calculation
>> above is not quite accurate and wastes memory, but have no better
>> idea for now.
>
> Yeb, good point.
>
> Actually, I thought about that but in this patchset, I wanted to
> go with conservative approach which is a safe guard to prevent
> system hang which is terrible than early OOM kill.
>
> Whole point of this patchset is to add a facility to VM and VM
> collaborates with zram via the interface to avoid worst case
> (ie, system hang) and logic to throttle could be enhanced by
> several approaches in future but I agree my logic was too simple
> and conservative.
>
> We could improve it with [anti|de]fragmentation in future but
> at the moment, below simple heuristic is not too bad for first
> step. :)
>
>
> ---
>  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>  drivers/block/zram/zram_drv.h |  1 +
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 8e22b20aa2db..af9dfe6a7d2b 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>         atomic64_sub(zram_get_obj_size(meta, index),
>                         &zram->stats.compr_data_size);
>         atomic64_dec(&zram->stats.pages_stored);
> +       atomic_set(&zram->alloc_fail, 0);
>
>         meta->table[index].handle = 0;
>         zram_set_obj_size(meta, index, 0);
> @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>         alloced_pages = zs_get_total_pages(meta->mem_pool);
>         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>                 zs_free(meta->mem_pool, handle);
> +               atomic_inc(&zram->alloc_fail);
>                 ret = -ENOMEM;
>                 goto out;
>         }

This isn't going to work well at all with swap.  There will be,
minimum, 32 failures to write a swap page before GET_FREE finally
indicates it's full, and even then a single free during those 32
failures will restart the counter, so it could be dozens or hundreds
(or more) swap write failures before the zram device is marked as
full.  And then, a single zram free will move it back to non-full and
start the write failures over again.

I think it would be better to just check for actual fullness (i.e.
alloced_pages > limit_pages) at the start of write, and fail if so.
That will allow a single write to succeed when it crosses into
fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
the same check (alloced_pages > limit_pages), then swap shouldn't see
any write failures (or very few), and zram will stay full until enough
pages are freed that it really does move under limit_pages.



>
> +       atomic_set(&zram->alloc_fail, 0);
>         update_used_max(zram, alloced_pages);
>
>         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>         return 0;
>  }
>
> +#define FULL_THRESH_HOLD 32
>  static int zram_get_free_pages(struct block_device *bdev, long *free)
>  {
>         struct zram *zram;
> @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>         zram = bdev->bd_disk->private_data;
>         meta = zram->meta;
>
> -       if (!zram->limit_pages)
> -               return 1;
> -
> -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> +       if (zram->limit_pages &&
> +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> +               *free = 0;
> +               return 0;
> +       }
>
> -       return 0;
> +       return 1;

There's no way that zram can even provide a accurate number of free
pages, since it can't know how compressible future stored pages will
be.  It would be better to simply change this swap_hint from GET_FREE
to IS_FULL, and return either true or false.


>  }
>
>  static int zram_swap_hint(struct block_device *bdev,
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 779d03fa4360..182a2544751b 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -115,6 +115,7 @@ struct zram {
>         u64 disksize;   /* bytes */
>         int max_comp_streams;
>         struct zram_stats stats;
> +       atomic_t alloc_fail;
>         /*
>          * the number of pages zram can consume for storing compressed data
>          */
> --
> 2.0.0
>
>>
>> heesub
>>
>> >+
>> >+    return 0;
>> >+}
>> >+
>> >  static int zram_swap_hint(struct block_device *bdev,
>> >                             unsigned int hint, void *arg)
>> >  {
>> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >
>> >     if (hint == SWAP_SLOT_FREE)
>> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >+    else if (hint == SWAP_GET_FREE)
>> >+            ret = zram_get_free_pages(bdev, arg);
>> >
>> >     return ret;
>> >  }
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-13 19:01     ` Dan Streetman
@ 2014-09-15  0:30       ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-15  0:30 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> > VM uses nr_swap_pages as one of information when it does
> > anonymous reclaim so that VM is able to throttle amount of swap.
> >
> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> > but for zram, it doesn't match because zram can limit memory usage
> > by knob(ie, mem_limit) so although VM can see lots of free space
> > from zram disk, zram can make fail intentionally once the allocated
> > space is over to limit. If it happens, VM should notice it and
> > stop reclaimaing until zram can obtain more free space but there
> > is a good way to do at the moment.
> >
> > This patch adds new hint SWAP_GET_FREE which zram can return how
> > many of freeable space it has. With using that, this patch adds
> > __swap_full which returns true if the zram is full and substract
> > remained freeable space of the zram-swap from nr_swap_pages.
> > IOW, VM sees there is no more swap space of zram so that it stops
> > anonymous reclaiming until swap_entry_free free a page and increase
> > nr_swap_pages again.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/blkdev.h |  1 +
> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 17437b2c18e4..c1199806e0f1 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >
> >  enum swap_blk_hint {
> >         SWAP_SLOT_FREE,
> > +       SWAP_GET_FREE,
> >  };
> >
> >  struct block_device_operations {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4bff521e649a..72737e6dd5e5 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -484,6 +484,22 @@ new_cluster:
> >         *scan_base = tmp;
> >  }
> >
> > +static bool __swap_full(struct swap_info_struct *si)
> > +{
> > +       if (si->flags & SWP_BLKDEV) {
> > +               long free;
> > +               struct gendisk *disk = si->bdev->bd_disk;
> > +
> > +               if (disk->fops->swap_hint)
> > +                       if (!disk->fops->swap_hint(si->bdev,
> > +                                               SWAP_GET_FREE,
> > +                                               &free))
> > +                               return free <= 0;
> > +       }
> > +
> > +       return si->inuse_pages == si->pages;
> > +}
> > +
> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >                                    unsigned char usage)
> >  {
> > @@ -583,11 +599,21 @@ checks:
> >         if (offset == si->highest_bit)
> >                 si->highest_bit--;
> >         si->inuse_pages++;
> > -       if (si->inuse_pages == si->pages) {
> > +       if (__swap_full(si)) {
> 
> This check is done after an available offset has already been
> selected.  So if the variable-size blkdev is full at this point, then
> this is incorrect, as swap will try to store a page at the current
> selected offset.

So the result is just fail of a write then what happens?
Page become redirty and keep it in memory so there is no harm.

> 
> > +               struct gendisk *disk = si->bdev->bd_disk;
> > +
> >                 si->lowest_bit = si->max;
> >                 si->highest_bit = 0;
> >                 spin_lock(&swap_avail_lock);
> >                 plist_del(&si->avail_list, &swap_avail_head);
> > +               /*
> > +                * If zram is full, it decreases nr_swap_pages
> > +                * for stopping anonymous page reclaim until
> > +                * zram has free space. Look at swap_entry_free
> > +                */
> > +               if (disk->fops->swap_hint)
> 
> Simply checking for the existence of swap_hint isn't enough to know
> we're using zram...

Yes but acutally the hint have been used for only zram for several years.
If other user is coming in future, we would add more checks if we really
need it at that time.
Do you have another idea?

> 
> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> > +                               &nr_swap_pages);
> >                 spin_unlock(&swap_avail_lock);
> >         }
> >         si->swap_map[offset] = usage;
> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >
> >         /* free if no reference */
> >         if (!usage) {
> > +               struct gendisk *disk = p->bdev->bd_disk;
> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >                 if (offset < p->lowest_bit)
> >                         p->lowest_bit = offset;
> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >                                 if (plist_node_empty(&p->avail_list))
> >                                         plist_add(&p->avail_list,
> >                                                   &swap_avail_head);
> > +                               if ((p->flags & SWP_BLKDEV) &&
> > +                                       disk->fops->swap_hint) {
> 
> freeing an entry from a full variable-size blkdev doesn't mean it's
> not still full.  In this case with zsmalloc, freeing one handle
> doesn't actually free any memory unless it was the only handle left in
> its containing zspage, and therefore it's possible that it is still
> full at this point.

No need to free a zspage in zsmalloc.
If we free a page in zspage, it means we have free space in zspage
so user can give a chance to user for writing out new page.

> 
> > +                                       atomic_long_add(p->pages -
> > +                                                       p->inuse_pages,
> > +                                                       &nr_swap_pages);
> > +                                       /*
> > +                                        * reset [highest|lowest]_bit to avoid
> > +                                        * scan_swap_map infinite looping if
> > +                                        * cached free cluster's index by
> > +                                        * scan_swap_map_try_ssd_cluster is
> > +                                        * above p->highest_bit.
> > +                                        */
> > +                                       p->highest_bit = p->max - 1;
> > +                                       p->lowest_bit = 1;
> 
> lowest_bit and highest_bit are likely to remain at those extremes for
> a long time, until 1 or max-1 is freed and re-allocated.
> 
> 
> By adding variable-size blkdev support to swap, I don't think
> highest_bit can be re-used as a "full" flag anymore.
> 
> Instead, I suggest that you add a "full" flag to struct
> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> scan_swap_map(), and if full simply turn "full" on, remove the
> swap_info_struct from the avail list, reduce nr_swap_pages
> appropriately, and return failure.  Don't mess with lowest_bit or
> highest_bit at all.

Could you explain what logic in your suggestion prevent the problem
I mentioned(ie, scan_swap_map infinite looping)?

> 
> Then in swap_entry_free(), do something like:
> 
>     dec_cluster_info_page(p, p->cluster_info, offset);
>     if (offset < p->lowest_bit)
>       p->lowest_bit = offset;
> -   if (offset > p->highest_bit) {
> -     bool was_full = !p->highest_bit;
> +   if (offset > p->highest_bit)
>       p->highest_bit = offset;
> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> +   if (p->full && p->flags & SWP_WRITEOK) {
> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> +
> +     if (!is_var_size_blkdev || !blkdev_full) {
> +       if (is_var_size_blkdev)
> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> +       p->full = false;
>         spin_lock(&swap_avail_lock);
>         WARN_ON(!plist_node_empty(&p->avail_list));
>         if (plist_node_empty(&p->avail_list))
>           plist_add(&p->avail_list,
>              &swap_avail_head);
>         spin_unlock(&swap_avail_lock);
> +     } else if (blkdev_full) {
> +       /* still full, so this page isn't actually
> +        * available yet to use; once non-full,
> +        * pages-inuse_pages will be the correct
> +        * number to add (above) since below will
> +        * inuse_pages--
> +        */
> +       atomic_long_dec(&nr_swap_pages);
>       }
>     }
>     atomic_long_inc(&nr_swap_pages);
> 
> 
> 
> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >                 p->inuse_pages--;
> >                 frontswap_invalidate_page(p->type, offset);
> >                 if (p->flags & SWP_BLKDEV) {
> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >                         if (disk->fops->swap_hint)
> >                                 disk->fops->swap_hint(p->bdev,
> >                                                 SWAP_SLOT_FREE,
> > --
> > 2.0.0
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-15  0:30       ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-15  0:30 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> > VM uses nr_swap_pages as one of information when it does
> > anonymous reclaim so that VM is able to throttle amount of swap.
> >
> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> > but for zram, it doesn't match because zram can limit memory usage
> > by knob(ie, mem_limit) so although VM can see lots of free space
> > from zram disk, zram can make fail intentionally once the allocated
> > space is over to limit. If it happens, VM should notice it and
> > stop reclaimaing until zram can obtain more free space but there
> > is a good way to do at the moment.
> >
> > This patch adds new hint SWAP_GET_FREE which zram can return how
> > many of freeable space it has. With using that, this patch adds
> > __swap_full which returns true if the zram is full and substract
> > remained freeable space of the zram-swap from nr_swap_pages.
> > IOW, VM sees there is no more swap space of zram so that it stops
> > anonymous reclaiming until swap_entry_free free a page and increase
> > nr_swap_pages again.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/blkdev.h |  1 +
> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 17437b2c18e4..c1199806e0f1 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >
> >  enum swap_blk_hint {
> >         SWAP_SLOT_FREE,
> > +       SWAP_GET_FREE,
> >  };
> >
> >  struct block_device_operations {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4bff521e649a..72737e6dd5e5 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -484,6 +484,22 @@ new_cluster:
> >         *scan_base = tmp;
> >  }
> >
> > +static bool __swap_full(struct swap_info_struct *si)
> > +{
> > +       if (si->flags & SWP_BLKDEV) {
> > +               long free;
> > +               struct gendisk *disk = si->bdev->bd_disk;
> > +
> > +               if (disk->fops->swap_hint)
> > +                       if (!disk->fops->swap_hint(si->bdev,
> > +                                               SWAP_GET_FREE,
> > +                                               &free))
> > +                               return free <= 0;
> > +       }
> > +
> > +       return si->inuse_pages == si->pages;
> > +}
> > +
> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >                                    unsigned char usage)
> >  {
> > @@ -583,11 +599,21 @@ checks:
> >         if (offset == si->highest_bit)
> >                 si->highest_bit--;
> >         si->inuse_pages++;
> > -       if (si->inuse_pages == si->pages) {
> > +       if (__swap_full(si)) {
> 
> This check is done after an available offset has already been
> selected.  So if the variable-size blkdev is full at this point, then
> this is incorrect, as swap will try to store a page at the current
> selected offset.

So the result is just fail of a write then what happens?
Page become redirty and keep it in memory so there is no harm.

> 
> > +               struct gendisk *disk = si->bdev->bd_disk;
> > +
> >                 si->lowest_bit = si->max;
> >                 si->highest_bit = 0;
> >                 spin_lock(&swap_avail_lock);
> >                 plist_del(&si->avail_list, &swap_avail_head);
> > +               /*
> > +                * If zram is full, it decreases nr_swap_pages
> > +                * for stopping anonymous page reclaim until
> > +                * zram has free space. Look at swap_entry_free
> > +                */
> > +               if (disk->fops->swap_hint)
> 
> Simply checking for the existence of swap_hint isn't enough to know
> we're using zram...

Yes but acutally the hint have been used for only zram for several years.
If other user is coming in future, we would add more checks if we really
need it at that time.
Do you have another idea?

> 
> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> > +                               &nr_swap_pages);
> >                 spin_unlock(&swap_avail_lock);
> >         }
> >         si->swap_map[offset] = usage;
> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >
> >         /* free if no reference */
> >         if (!usage) {
> > +               struct gendisk *disk = p->bdev->bd_disk;
> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >                 if (offset < p->lowest_bit)
> >                         p->lowest_bit = offset;
> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >                                 if (plist_node_empty(&p->avail_list))
> >                                         plist_add(&p->avail_list,
> >                                                   &swap_avail_head);
> > +                               if ((p->flags & SWP_BLKDEV) &&
> > +                                       disk->fops->swap_hint) {
> 
> freeing an entry from a full variable-size blkdev doesn't mean it's
> not still full.  In this case with zsmalloc, freeing one handle
> doesn't actually free any memory unless it was the only handle left in
> its containing zspage, and therefore it's possible that it is still
> full at this point.

No need to free a zspage in zsmalloc.
If we free a page in zspage, it means we have free space in zspage
so user can give a chance to user for writing out new page.

> 
> > +                                       atomic_long_add(p->pages -
> > +                                                       p->inuse_pages,
> > +                                                       &nr_swap_pages);
> > +                                       /*
> > +                                        * reset [highest|lowest]_bit to avoid
> > +                                        * scan_swap_map infinite looping if
> > +                                        * cached free cluster's index by
> > +                                        * scan_swap_map_try_ssd_cluster is
> > +                                        * above p->highest_bit.
> > +                                        */
> > +                                       p->highest_bit = p->max - 1;
> > +                                       p->lowest_bit = 1;
> 
> lowest_bit and highest_bit are likely to remain at those extremes for
> a long time, until 1 or max-1 is freed and re-allocated.
> 
> 
> By adding variable-size blkdev support to swap, I don't think
> highest_bit can be re-used as a "full" flag anymore.
> 
> Instead, I suggest that you add a "full" flag to struct
> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> scan_swap_map(), and if full simply turn "full" on, remove the
> swap_info_struct from the avail list, reduce nr_swap_pages
> appropriately, and return failure.  Don't mess with lowest_bit or
> highest_bit at all.

Could you explain what logic in your suggestion prevent the problem
I mentioned(ie, scan_swap_map infinite looping)?

> 
> Then in swap_entry_free(), do something like:
> 
>     dec_cluster_info_page(p, p->cluster_info, offset);
>     if (offset < p->lowest_bit)
>       p->lowest_bit = offset;
> -   if (offset > p->highest_bit) {
> -     bool was_full = !p->highest_bit;
> +   if (offset > p->highest_bit)
>       p->highest_bit = offset;
> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> +   if (p->full && p->flags & SWP_WRITEOK) {
> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> +
> +     if (!is_var_size_blkdev || !blkdev_full) {
> +       if (is_var_size_blkdev)
> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> +       p->full = false;
>         spin_lock(&swap_avail_lock);
>         WARN_ON(!plist_node_empty(&p->avail_list));
>         if (plist_node_empty(&p->avail_list))
>           plist_add(&p->avail_list,
>              &swap_avail_head);
>         spin_unlock(&swap_avail_lock);
> +     } else if (blkdev_full) {
> +       /* still full, so this page isn't actually
> +        * available yet to use; once non-full,
> +        * pages-inuse_pages will be the correct
> +        * number to add (above) since below will
> +        * inuse_pages--
> +        */
> +       atomic_long_dec(&nr_swap_pages);
>       }
>     }
>     atomic_long_inc(&nr_swap_pages);
> 
> 
> 
> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >                 p->inuse_pages--;
> >                 frontswap_invalidate_page(p->type, offset);
> >                 if (p->flags & SWP_BLKDEV) {
> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >                         if (disk->fops->swap_hint)
> >                                 disk->fops->swap_hint(p->bdev,
> >                                                 SWAP_SLOT_FREE,
> > --
> > 2.0.0
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-13 19:39         ` Dan Streetman
@ 2014-09-15  0:57           ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-15  0:57 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hi Heesub,
> >
> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> Hello Minchan,
> >>
> >> First of all, I agree with the overall purpose of your patch set.
> >
> > Thank you.
> >
> >>
> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >know how many zram has freeable space.
> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >
> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >---
> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >  1 file changed, 18 insertions(+)
> >> >
> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >--- a/drivers/block/zram/zram_drv.c
> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >     return 0;
> >> >  }
> >> >
> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >+{
> >> >+    struct zram *zram;
> >> >+    struct zram_meta *meta;
> >> >+
> >> >+    zram = bdev->bd_disk->private_data;
> >> >+    meta = zram->meta;
> >> >+
> >> >+    if (!zram->limit_pages)
> >> >+            return 1;
> >> >+
> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >>
> >> Even if 'free' is zero here, there may be free spaces available to
> >> store more compressed pages into the zs_pool. I mean calculation
> >> above is not quite accurate and wastes memory, but have no better
> >> idea for now.
> >
> > Yeb, good point.
> >
> > Actually, I thought about that but in this patchset, I wanted to
> > go with conservative approach which is a safe guard to prevent
> > system hang which is terrible than early OOM kill.
> >
> > Whole point of this patchset is to add a facility to VM and VM
> > collaborates with zram via the interface to avoid worst case
> > (ie, system hang) and logic to throttle could be enhanced by
> > several approaches in future but I agree my logic was too simple
> > and conservative.
> >
> > We could improve it with [anti|de]fragmentation in future but
> > at the moment, below simple heuristic is not too bad for first
> > step. :)
> >
> >
> > ---
> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >         atomic64_sub(zram_get_obj_size(meta, index),
> >                         &zram->stats.compr_data_size);
> >         atomic64_dec(&zram->stats.pages_stored);
> > +       atomic_set(&zram->alloc_fail, 0);
> >
> >         meta->table[index].handle = 0;
> >         zram_set_obj_size(meta, index, 0);
> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >                 zs_free(meta->mem_pool, handle);
> > +               atomic_inc(&zram->alloc_fail);
> >                 ret = -ENOMEM;
> >                 goto out;
> >         }
> 
> This isn't going to work well at all with swap.  There will be,
> minimum, 32 failures to write a swap page before GET_FREE finally
> indicates it's full, and even then a single free during those 32
> failures will restart the counter, so it could be dozens or hundreds
> (or more) swap write failures before the zram device is marked as
> full.  And then, a single zram free will move it back to non-full and
> start the write failures over again.
> 
> I think it would be better to just check for actual fullness (i.e.
> alloced_pages > limit_pages) at the start of write, and fail if so.
> That will allow a single write to succeed when it crosses into
> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> the same check (alloced_pages > limit_pages), then swap shouldn't see
> any write failures (or very few), and zram will stay full until enough
> pages are freed that it really does move under limit_pages.

The alloced_pages > limit_pages doesn't mean zram is full so with your
approach, it could kick OOM earlier which is not what we want.
Because our product uses zram to delay app killing by low memory killer.

> 
> 
> 
> >
> > +       atomic_set(&zram->alloc_fail, 0);
> >         update_used_max(zram, alloced_pages);
> >
> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >         return 0;
> >  }
> >
> > +#define FULL_THRESH_HOLD 32
> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >  {
> >         struct zram *zram;
> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >         zram = bdev->bd_disk->private_data;
> >         meta = zram->meta;
> >
> > -       if (!zram->limit_pages)
> > -               return 1;
> > -
> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> > +       if (zram->limit_pages &&
> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> > +               *free = 0;
> > +               return 0;
> > +       }
> >
> > -       return 0;
> > +       return 1;
> 
> There's no way that zram can even provide a accurate number of free
> pages, since it can't know how compressible future stored pages will
> be.  It would be better to simply change this swap_hint from GET_FREE
> to IS_FULL, and return either true or false.

My plan is that we can give an approximation based on
orig_data_size/compr_data_size with tweaking zero page and vmscan can use
the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
step by step so I didn't include the hint.
If you are strong against with that in this stage, I can change it and
try it later with the number.
Please, say again if you want.

Thanks for the review!


> 
> 
> >  }
> >
> >  static int zram_swap_hint(struct block_device *bdev,
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 779d03fa4360..182a2544751b 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -115,6 +115,7 @@ struct zram {
> >         u64 disksize;   /* bytes */
> >         int max_comp_streams;
> >         struct zram_stats stats;
> > +       atomic_t alloc_fail;
> >         /*
> >          * the number of pages zram can consume for storing compressed data
> >          */
> > --
> > 2.0.0
> >
> >>
> >> heesub
> >>
> >> >+
> >> >+    return 0;
> >> >+}
> >> >+
> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >                             unsigned int hint, void *arg)
> >> >  {
> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >
> >> >     if (hint == SWAP_SLOT_FREE)
> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >+    else if (hint == SWAP_GET_FREE)
> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >
> >> >     return ret;
> >> >  }
> >> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-15  0:57           ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-15  0:57 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hi Heesub,
> >
> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> Hello Minchan,
> >>
> >> First of all, I agree with the overall purpose of your patch set.
> >
> > Thank you.
> >
> >>
> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >know how many zram has freeable space.
> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >
> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >---
> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >  1 file changed, 18 insertions(+)
> >> >
> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >--- a/drivers/block/zram/zram_drv.c
> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >     return 0;
> >> >  }
> >> >
> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >+{
> >> >+    struct zram *zram;
> >> >+    struct zram_meta *meta;
> >> >+
> >> >+    zram = bdev->bd_disk->private_data;
> >> >+    meta = zram->meta;
> >> >+
> >> >+    if (!zram->limit_pages)
> >> >+            return 1;
> >> >+
> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >>
> >> Even if 'free' is zero here, there may be free spaces available to
> >> store more compressed pages into the zs_pool. I mean calculation
> >> above is not quite accurate and wastes memory, but have no better
> >> idea for now.
> >
> > Yeb, good point.
> >
> > Actually, I thought about that but in this patchset, I wanted to
> > go with conservative approach which is a safe guard to prevent
> > system hang which is terrible than early OOM kill.
> >
> > Whole point of this patchset is to add a facility to VM and VM
> > collaborates with zram via the interface to avoid worst case
> > (ie, system hang) and logic to throttle could be enhanced by
> > several approaches in future but I agree my logic was too simple
> > and conservative.
> >
> > We could improve it with [anti|de]fragmentation in future but
> > at the moment, below simple heuristic is not too bad for first
> > step. :)
> >
> >
> > ---
> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >         atomic64_sub(zram_get_obj_size(meta, index),
> >                         &zram->stats.compr_data_size);
> >         atomic64_dec(&zram->stats.pages_stored);
> > +       atomic_set(&zram->alloc_fail, 0);
> >
> >         meta->table[index].handle = 0;
> >         zram_set_obj_size(meta, index, 0);
> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >                 zs_free(meta->mem_pool, handle);
> > +               atomic_inc(&zram->alloc_fail);
> >                 ret = -ENOMEM;
> >                 goto out;
> >         }
> 
> This isn't going to work well at all with swap.  There will be,
> minimum, 32 failures to write a swap page before GET_FREE finally
> indicates it's full, and even then a single free during those 32
> failures will restart the counter, so it could be dozens or hundreds
> (or more) swap write failures before the zram device is marked as
> full.  And then, a single zram free will move it back to non-full and
> start the write failures over again.
> 
> I think it would be better to just check for actual fullness (i.e.
> alloced_pages > limit_pages) at the start of write, and fail if so.
> That will allow a single write to succeed when it crosses into
> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> the same check (alloced_pages > limit_pages), then swap shouldn't see
> any write failures (or very few), and zram will stay full until enough
> pages are freed that it really does move under limit_pages.

The alloced_pages > limit_pages doesn't mean zram is full so with your
approach, it could kick OOM earlier which is not what we want.
Because our product uses zram to delay app killing by low memory killer.

> 
> 
> 
> >
> > +       atomic_set(&zram->alloc_fail, 0);
> >         update_used_max(zram, alloced_pages);
> >
> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >         return 0;
> >  }
> >
> > +#define FULL_THRESH_HOLD 32
> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >  {
> >         struct zram *zram;
> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >         zram = bdev->bd_disk->private_data;
> >         meta = zram->meta;
> >
> > -       if (!zram->limit_pages)
> > -               return 1;
> > -
> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> > +       if (zram->limit_pages &&
> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> > +               *free = 0;
> > +               return 0;
> > +       }
> >
> > -       return 0;
> > +       return 1;
> 
> There's no way that zram can even provide a accurate number of free
> pages, since it can't know how compressible future stored pages will
> be.  It would be better to simply change this swap_hint from GET_FREE
> to IS_FULL, and return either true or false.

My plan is that we can give an approximation based on
orig_data_size/compr_data_size with tweaking zero page and vmscan can use
the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
step by step so I didn't include the hint.
If you are strong against with that in this stage, I can change it and
try it later with the number.
Please, say again if you want.

Thanks for the review!


> 
> 
> >  }
> >
> >  static int zram_swap_hint(struct block_device *bdev,
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 779d03fa4360..182a2544751b 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -115,6 +115,7 @@ struct zram {
> >         u64 disksize;   /* bytes */
> >         int max_comp_streams;
> >         struct zram_stats stats;
> > +       atomic_t alloc_fail;
> >         /*
> >          * the number of pages zram can consume for storing compressed data
> >          */
> > --
> > 2.0.0
> >
> >>
> >> heesub
> >>
> >> >+
> >> >+    return 0;
> >> >+}
> >> >+
> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >                             unsigned int hint, void *arg)
> >> >  {
> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >
> >> >     if (hint == SWAP_SLOT_FREE)
> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >+    else if (hint == SWAP_GET_FREE)
> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >
> >> >     return ret;
> >> >  }
> >> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-15  0:30       ` Minchan Kim
@ 2014-09-15 14:53         ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-15 14:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
>> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > VM uses nr_swap_pages as one of information when it does
>> > anonymous reclaim so that VM is able to throttle amount of swap.
>> >
>> > Normally, the nr_swap_pages is equal to freeable space of swap disk
>> > but for zram, it doesn't match because zram can limit memory usage
>> > by knob(ie, mem_limit) so although VM can see lots of free space
>> > from zram disk, zram can make fail intentionally once the allocated
>> > space is over to limit. If it happens, VM should notice it and
>> > stop reclaimaing until zram can obtain more free space but there
>> > is a good way to do at the moment.
>> >
>> > This patch adds new hint SWAP_GET_FREE which zram can return how
>> > many of freeable space it has. With using that, this patch adds
>> > __swap_full which returns true if the zram is full and substract
>> > remained freeable space of the zram-swap from nr_swap_pages.
>> > IOW, VM sees there is no more swap space of zram so that it stops
>> > anonymous reclaiming until swap_entry_free free a page and increase
>> > nr_swap_pages again.
>> >
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> >  include/linux/blkdev.h |  1 +
>> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>> >  2 files changed, 44 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> > index 17437b2c18e4..c1199806e0f1 100644
>> > --- a/include/linux/blkdev.h
>> > +++ b/include/linux/blkdev.h
>> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>> >
>> >  enum swap_blk_hint {
>> >         SWAP_SLOT_FREE,
>> > +       SWAP_GET_FREE,
>> >  };
>> >
>> >  struct block_device_operations {
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 4bff521e649a..72737e6dd5e5 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -484,6 +484,22 @@ new_cluster:
>> >         *scan_base = tmp;
>> >  }
>> >
>> > +static bool __swap_full(struct swap_info_struct *si)
>> > +{
>> > +       if (si->flags & SWP_BLKDEV) {
>> > +               long free;
>> > +               struct gendisk *disk = si->bdev->bd_disk;
>> > +
>> > +               if (disk->fops->swap_hint)
>> > +                       if (!disk->fops->swap_hint(si->bdev,
>> > +                                               SWAP_GET_FREE,
>> > +                                               &free))
>> > +                               return free <= 0;
>> > +       }
>> > +
>> > +       return si->inuse_pages == si->pages;
>> > +}
>> > +
>> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
>> >                                    unsigned char usage)
>> >  {
>> > @@ -583,11 +599,21 @@ checks:
>> >         if (offset == si->highest_bit)
>> >                 si->highest_bit--;
>> >         si->inuse_pages++;
>> > -       if (si->inuse_pages == si->pages) {
>> > +       if (__swap_full(si)) {
>>
>> This check is done after an available offset has already been
>> selected.  So if the variable-size blkdev is full at this point, then
>> this is incorrect, as swap will try to store a page at the current
>> selected offset.
>
> So the result is just fail of a write then what happens?
> Page become redirty and keep it in memory so there is no harm.

Happening once, it's not a big deal.  But it's not as good as not
happening at all.

>
>>
>> > +               struct gendisk *disk = si->bdev->bd_disk;
>> > +
>> >                 si->lowest_bit = si->max;
>> >                 si->highest_bit = 0;
>> >                 spin_lock(&swap_avail_lock);
>> >                 plist_del(&si->avail_list, &swap_avail_head);
>> > +               /*
>> > +                * If zram is full, it decreases nr_swap_pages
>> > +                * for stopping anonymous page reclaim until
>> > +                * zram has free space. Look at swap_entry_free
>> > +                */
>> > +               if (disk->fops->swap_hint)
>>
>> Simply checking for the existence of swap_hint isn't enough to know
>> we're using zram...
>
> Yes but acutally the hint have been used for only zram for several years.
> If other user is coming in future, we would add more checks if we really
> need it at that time.
> Do you have another idea?

Well if this hint == zram just rename it zram.  Especially if it's now
going to be explicitly used to mean it == zram.  But I don't think
that is necessary.

>
>>
>> > +                       atomic_long_sub(si->pages - si->inuse_pages,
>> > +                               &nr_swap_pages);
>> >                 spin_unlock(&swap_avail_lock);
>> >         }
>> >         si->swap_map[offset] = usage;
>> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >
>> >         /* free if no reference */
>> >         if (!usage) {
>> > +               struct gendisk *disk = p->bdev->bd_disk;
>> >                 dec_cluster_info_page(p, p->cluster_info, offset);
>> >                 if (offset < p->lowest_bit)
>> >                         p->lowest_bit = offset;
>> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >                                 if (plist_node_empty(&p->avail_list))
>> >                                         plist_add(&p->avail_list,
>> >                                                   &swap_avail_head);
>> > +                               if ((p->flags & SWP_BLKDEV) &&
>> > +                                       disk->fops->swap_hint) {
>>
>> freeing an entry from a full variable-size blkdev doesn't mean it's
>> not still full.  In this case with zsmalloc, freeing one handle
>> doesn't actually free any memory unless it was the only handle left in
>> its containing zspage, and therefore it's possible that it is still
>> full at this point.
>
> No need to free a zspage in zsmalloc.
> If we free a page in zspage, it means we have free space in zspage
> so user can give a chance to user for writing out new page.

That's not actually true, since zsmalloc has 255 different class
sizes, freeing one page means the next page to be compressed has a
1/255 chance that it will be the same size as the just-freed page
(assuming random page compressability).

>
>>
>> > +                                       atomic_long_add(p->pages -
>> > +                                                       p->inuse_pages,
>> > +                                                       &nr_swap_pages);
>> > +                                       /*
>> > +                                        * reset [highest|lowest]_bit to avoid
>> > +                                        * scan_swap_map infinite looping if
>> > +                                        * cached free cluster's index by
>> > +                                        * scan_swap_map_try_ssd_cluster is
>> > +                                        * above p->highest_bit.
>> > +                                        */
>> > +                                       p->highest_bit = p->max - 1;
>> > +                                       p->lowest_bit = 1;
>>
>> lowest_bit and highest_bit are likely to remain at those extremes for
>> a long time, until 1 or max-1 is freed and re-allocated.
>>
>>
>> By adding variable-size blkdev support to swap, I don't think
>> highest_bit can be re-used as a "full" flag anymore.
>>
>> Instead, I suggest that you add a "full" flag to struct
>> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
>> scan_swap_map(), and if full simply turn "full" on, remove the
>> swap_info_struct from the avail list, reduce nr_swap_pages
>> appropriately, and return failure.  Don't mess with lowest_bit or
>> highest_bit at all.
>
> Could you explain what logic in your suggestion prevent the problem
> I mentioned(ie, scan_swap_map infinite looping)?

scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
check is done at its start.  And it wouldn't be called again with that
swap_info_struct until non-full since it is removed from the
avail_list.

>
>>
>> Then in swap_entry_free(), do something like:
>>
>>     dec_cluster_info_page(p, p->cluster_info, offset);
>>     if (offset < p->lowest_bit)
>>       p->lowest_bit = offset;
>> -   if (offset > p->highest_bit) {
>> -     bool was_full = !p->highest_bit;
>> +   if (offset > p->highest_bit)
>>       p->highest_bit = offset;
>> -     if (was_full && (p->flags & SWP_WRITEOK)) {
>> +   if (p->full && p->flags & SWP_WRITEOK) {
>> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
>> +     bool blkdev_full = is_variable_size_blkdev_full(p);
>> +
>> +     if (!is_var_size_blkdev || !blkdev_full) {
>> +       if (is_var_size_blkdev)
>> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
>> +       p->full = false;
>>         spin_lock(&swap_avail_lock);
>>         WARN_ON(!plist_node_empty(&p->avail_list));
>>         if (plist_node_empty(&p->avail_list))
>>           plist_add(&p->avail_list,
>>              &swap_avail_head);
>>         spin_unlock(&swap_avail_lock);
>> +     } else if (blkdev_full) {
>> +       /* still full, so this page isn't actually
>> +        * available yet to use; once non-full,
>> +        * pages-inuse_pages will be the correct
>> +        * number to add (above) since below will
>> +        * inuse_pages--
>> +        */
>> +       atomic_long_dec(&nr_swap_pages);
>>       }
>>     }
>>     atomic_long_inc(&nr_swap_pages);
>>
>>
>>
>> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >                 p->inuse_pages--;
>> >                 frontswap_invalidate_page(p->type, offset);
>> >                 if (p->flags & SWP_BLKDEV) {
>> > -                       struct gendisk *disk = p->bdev->bd_disk;
>> >                         if (disk->fops->swap_hint)
>> >                                 disk->fops->swap_hint(p->bdev,
>> >                                                 SWAP_SLOT_FREE,
>> > --
>> > 2.0.0
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-15 14:53         ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-15 14:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
>> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > VM uses nr_swap_pages as one of information when it does
>> > anonymous reclaim so that VM is able to throttle amount of swap.
>> >
>> > Normally, the nr_swap_pages is equal to freeable space of swap disk
>> > but for zram, it doesn't match because zram can limit memory usage
>> > by knob(ie, mem_limit) so although VM can see lots of free space
>> > from zram disk, zram can make fail intentionally once the allocated
>> > space is over to limit. If it happens, VM should notice it and
>> > stop reclaimaing until zram can obtain more free space but there
>> > is a good way to do at the moment.
>> >
>> > This patch adds new hint SWAP_GET_FREE which zram can return how
>> > many of freeable space it has. With using that, this patch adds
>> > __swap_full which returns true if the zram is full and substract
>> > remained freeable space of the zram-swap from nr_swap_pages.
>> > IOW, VM sees there is no more swap space of zram so that it stops
>> > anonymous reclaiming until swap_entry_free free a page and increase
>> > nr_swap_pages again.
>> >
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> > ---
>> >  include/linux/blkdev.h |  1 +
>> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>> >  2 files changed, 44 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> > index 17437b2c18e4..c1199806e0f1 100644
>> > --- a/include/linux/blkdev.h
>> > +++ b/include/linux/blkdev.h
>> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>> >
>> >  enum swap_blk_hint {
>> >         SWAP_SLOT_FREE,
>> > +       SWAP_GET_FREE,
>> >  };
>> >
>> >  struct block_device_operations {
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 4bff521e649a..72737e6dd5e5 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -484,6 +484,22 @@ new_cluster:
>> >         *scan_base = tmp;
>> >  }
>> >
>> > +static bool __swap_full(struct swap_info_struct *si)
>> > +{
>> > +       if (si->flags & SWP_BLKDEV) {
>> > +               long free;
>> > +               struct gendisk *disk = si->bdev->bd_disk;
>> > +
>> > +               if (disk->fops->swap_hint)
>> > +                       if (!disk->fops->swap_hint(si->bdev,
>> > +                                               SWAP_GET_FREE,
>> > +                                               &free))
>> > +                               return free <= 0;
>> > +       }
>> > +
>> > +       return si->inuse_pages == si->pages;
>> > +}
>> > +
>> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
>> >                                    unsigned char usage)
>> >  {
>> > @@ -583,11 +599,21 @@ checks:
>> >         if (offset == si->highest_bit)
>> >                 si->highest_bit--;
>> >         si->inuse_pages++;
>> > -       if (si->inuse_pages == si->pages) {
>> > +       if (__swap_full(si)) {
>>
>> This check is done after an available offset has already been
>> selected.  So if the variable-size blkdev is full at this point, then
>> this is incorrect, as swap will try to store a page at the current
>> selected offset.
>
> So the result is just fail of a write then what happens?
> Page become redirty and keep it in memory so there is no harm.

Happening once, it's not a big deal.  But it's not as good as not
happening at all.

>
>>
>> > +               struct gendisk *disk = si->bdev->bd_disk;
>> > +
>> >                 si->lowest_bit = si->max;
>> >                 si->highest_bit = 0;
>> >                 spin_lock(&swap_avail_lock);
>> >                 plist_del(&si->avail_list, &swap_avail_head);
>> > +               /*
>> > +                * If zram is full, it decreases nr_swap_pages
>> > +                * for stopping anonymous page reclaim until
>> > +                * zram has free space. Look at swap_entry_free
>> > +                */
>> > +               if (disk->fops->swap_hint)
>>
>> Simply checking for the existence of swap_hint isn't enough to know
>> we're using zram...
>
> Yes but acutally the hint have been used for only zram for several years.
> If other user is coming in future, we would add more checks if we really
> need it at that time.
> Do you have another idea?

Well if this hint == zram just rename it zram.  Especially if it's now
going to be explicitly used to mean it == zram.  But I don't think
that is necessary.

>
>>
>> > +                       atomic_long_sub(si->pages - si->inuse_pages,
>> > +                               &nr_swap_pages);
>> >                 spin_unlock(&swap_avail_lock);
>> >         }
>> >         si->swap_map[offset] = usage;
>> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >
>> >         /* free if no reference */
>> >         if (!usage) {
>> > +               struct gendisk *disk = p->bdev->bd_disk;
>> >                 dec_cluster_info_page(p, p->cluster_info, offset);
>> >                 if (offset < p->lowest_bit)
>> >                         p->lowest_bit = offset;
>> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >                                 if (plist_node_empty(&p->avail_list))
>> >                                         plist_add(&p->avail_list,
>> >                                                   &swap_avail_head);
>> > +                               if ((p->flags & SWP_BLKDEV) &&
>> > +                                       disk->fops->swap_hint) {
>>
>> freeing an entry from a full variable-size blkdev doesn't mean it's
>> not still full.  In this case with zsmalloc, freeing one handle
>> doesn't actually free any memory unless it was the only handle left in
>> its containing zspage, and therefore it's possible that it is still
>> full at this point.
>
> No need to free a zspage in zsmalloc.
> If we free a page in zspage, it means we have free space in zspage
> so user can give a chance to user for writing out new page.

That's not actually true, since zsmalloc has 255 different class
sizes, freeing one page means the next page to be compressed has a
1/255 chance that it will be the same size as the just-freed page
(assuming random page compressability).

>
>>
>> > +                                       atomic_long_add(p->pages -
>> > +                                                       p->inuse_pages,
>> > +                                                       &nr_swap_pages);
>> > +                                       /*
>> > +                                        * reset [highest|lowest]_bit to avoid
>> > +                                        * scan_swap_map infinite looping if
>> > +                                        * cached free cluster's index by
>> > +                                        * scan_swap_map_try_ssd_cluster is
>> > +                                        * above p->highest_bit.
>> > +                                        */
>> > +                                       p->highest_bit = p->max - 1;
>> > +                                       p->lowest_bit = 1;
>>
>> lowest_bit and highest_bit are likely to remain at those extremes for
>> a long time, until 1 or max-1 is freed and re-allocated.
>>
>>
>> By adding variable-size blkdev support to swap, I don't think
>> highest_bit can be re-used as a "full" flag anymore.
>>
>> Instead, I suggest that you add a "full" flag to struct
>> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
>> scan_swap_map(), and if full simply turn "full" on, remove the
>> swap_info_struct from the avail list, reduce nr_swap_pages
>> appropriately, and return failure.  Don't mess with lowest_bit or
>> highest_bit at all.
>
> Could you explain what logic in your suggestion prevent the problem
> I mentioned(ie, scan_swap_map infinite looping)?

scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
check is done at its start.  And it wouldn't be called again with that
swap_info_struct until non-full since it is removed from the
avail_list.

>
>>
>> Then in swap_entry_free(), do something like:
>>
>>     dec_cluster_info_page(p, p->cluster_info, offset);
>>     if (offset < p->lowest_bit)
>>       p->lowest_bit = offset;
>> -   if (offset > p->highest_bit) {
>> -     bool was_full = !p->highest_bit;
>> +   if (offset > p->highest_bit)
>>       p->highest_bit = offset;
>> -     if (was_full && (p->flags & SWP_WRITEOK)) {
>> +   if (p->full && p->flags & SWP_WRITEOK) {
>> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
>> +     bool blkdev_full = is_variable_size_blkdev_full(p);
>> +
>> +     if (!is_var_size_blkdev || !blkdev_full) {
>> +       if (is_var_size_blkdev)
>> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
>> +       p->full = false;
>>         spin_lock(&swap_avail_lock);
>>         WARN_ON(!plist_node_empty(&p->avail_list));
>>         if (plist_node_empty(&p->avail_list))
>>           plist_add(&p->avail_list,
>>              &swap_avail_head);
>>         spin_unlock(&swap_avail_lock);
>> +     } else if (blkdev_full) {
>> +       /* still full, so this page isn't actually
>> +        * available yet to use; once non-full,
>> +        * pages-inuse_pages will be the correct
>> +        * number to add (above) since below will
>> +        * inuse_pages--
>> +        */
>> +       atomic_long_dec(&nr_swap_pages);
>>       }
>>     }
>>     atomic_long_inc(&nr_swap_pages);
>>
>>
>>
>> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >                 p->inuse_pages--;
>> >                 frontswap_invalidate_page(p->type, offset);
>> >                 if (p->flags & SWP_BLKDEV) {
>> > -                       struct gendisk *disk = p->bdev->bd_disk;
>> >                         if (disk->fops->swap_hint)
>> >                                 disk->fops->swap_hint(p->bdev,
>> >                                                 SWAP_SLOT_FREE,
>> > --
>> > 2.0.0
>> >
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-15  0:57           ` Minchan Kim
@ 2014-09-15 16:00             ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-15 16:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > Hi Heesub,
>> >
>> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> Hello Minchan,
>> >>
>> >> First of all, I agree with the overall purpose of your patch set.
>> >
>> > Thank you.
>> >
>> >>
>> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >know how many zram has freeable space.
>> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >
>> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >---
>> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >  1 file changed, 18 insertions(+)
>> >> >
>> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >     return 0;
>> >> >  }
>> >> >
>> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >+{
>> >> >+    struct zram *zram;
>> >> >+    struct zram_meta *meta;
>> >> >+
>> >> >+    zram = bdev->bd_disk->private_data;
>> >> >+    meta = zram->meta;
>> >> >+
>> >> >+    if (!zram->limit_pages)
>> >> >+            return 1;
>> >> >+
>> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >>
>> >> Even if 'free' is zero here, there may be free spaces available to
>> >> store more compressed pages into the zs_pool. I mean calculation
>> >> above is not quite accurate and wastes memory, but have no better
>> >> idea for now.
>> >
>> > Yeb, good point.
>> >
>> > Actually, I thought about that but in this patchset, I wanted to
>> > go with conservative approach which is a safe guard to prevent
>> > system hang which is terrible than early OOM kill.
>> >
>> > Whole point of this patchset is to add a facility to VM and VM
>> > collaborates with zram via the interface to avoid worst case
>> > (ie, system hang) and logic to throttle could be enhanced by
>> > several approaches in future but I agree my logic was too simple
>> > and conservative.
>> >
>> > We could improve it with [anti|de]fragmentation in future but
>> > at the moment, below simple heuristic is not too bad for first
>> > step. :)
>> >
>> >
>> > ---
>> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >  drivers/block/zram/zram_drv.h |  1 +
>> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> > --- a/drivers/block/zram/zram_drv.c
>> > +++ b/drivers/block/zram/zram_drv.c
>> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >                         &zram->stats.compr_data_size);
>> >         atomic64_dec(&zram->stats.pages_stored);
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >
>> >         meta->table[index].handle = 0;
>> >         zram_set_obj_size(meta, index, 0);
>> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >                 zs_free(meta->mem_pool, handle);
>> > +               atomic_inc(&zram->alloc_fail);
>> >                 ret = -ENOMEM;
>> >                 goto out;
>> >         }
>>
>> This isn't going to work well at all with swap.  There will be,
>> minimum, 32 failures to write a swap page before GET_FREE finally
>> indicates it's full, and even then a single free during those 32
>> failures will restart the counter, so it could be dozens or hundreds
>> (or more) swap write failures before the zram device is marked as
>> full.  And then, a single zram free will move it back to non-full and
>> start the write failures over again.
>>
>> I think it would be better to just check for actual fullness (i.e.
>> alloced_pages > limit_pages) at the start of write, and fail if so.
>> That will allow a single write to succeed when it crosses into
>> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> any write failures (or very few), and zram will stay full until enough
>> pages are freed that it really does move under limit_pages.
>
> The alloced_pages > limit_pages doesn't mean zram is full so with your
> approach, it could kick OOM earlier which is not what we want.
> Because our product uses zram to delay app killing by low memory killer.

With zram, the meaning of "full" isn't as obvious as other fixed-size
storage devices.  Obviously, "full" usually means "no more room to
store anything", while "not full" means "there is room to store
anything, up to the remaining free size".  With zram, its zsmalloc
pool size might be over the specified limit, but there will still be
room to store *some* things - but not *anything*.  Only compressed
pages that happen to fit inside a class with at least one zspage that
isn't full.

Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
is 100% full in all its classes.

What about waiting until there is N number of write failures, like
this patch?  That doesn't seem very fair to the writer, since each
write failure will cause them to do extra work (first, in selecting
what to write, and then in recovering from the failed write).
However, it will probably squeeze some writes into some of those empty
spaces in already-allocated zspages.

And declaring zram "full" immediately once the zsmalloc pool size
increases past the specified limit?  Since zsmalloc's classes almost
certainly contain some fragmentation, that will waste all the empty
spaces that could still store more compressed pages.  But, this is the
limit at which you cannot guarantee all writes to be able to store a
compressed page - any zsmalloc classes without a partially empty
zspage will have to increase zsmalloc's size, therefore failing the
write.

Neither definition of "full" is optimal.  Since in this case we're
talking about swap, I think forcing swap write failures to happen,
which with direct reclaim could (I believe) stop everything while the
write failures continue, should be avoided as much as possible.  Even
when zram fullness is delayed by N write failures, to try to squeeze
out as much storage from zsmalloc as possible, when it does eventually
fill if zram is the only swap device the system will OOM anyway.  And
if zram isn't the only swap device, but just the first (highest
priority), then delaying things with unneeded write failures is
certainly not better than just filling up so swap can move on to the
next swap device.  The only case where write failures delaying marking
zram as full will help is if the system stopped right at this point,
and then started decreasing how much memory was needed.  That seems
like a very unlikely coincidence, but maybe some testing would help
determine how bad the write failures affect system
performance/responsiveness and how long they delay OOM.

Since there may be different use cases that desire different things,
maybe there should be a zram runtime (or buildtime) config to choose
exactly how it decides it's full?  Either full after N write failures,
or full when alloced>limit?  That would allow the user to either defer
getting full as long as possible (at the possible cost of system
unresponsiveness during those write failures), or to just move
immediately to zram being full as soon as it can't guarantee that each
write will succeed.



>
>>
>>
>>
>> >
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >         update_used_max(zram, alloced_pages);
>> >
>> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >         return 0;
>> >  }
>> >
>> > +#define FULL_THRESH_HOLD 32
>> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >  {
>> >         struct zram *zram;
>> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >         zram = bdev->bd_disk->private_data;
>> >         meta = zram->meta;
>> >
>> > -       if (!zram->limit_pages)
>> > -               return 1;
>> > -
>> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> > +       if (zram->limit_pages &&
>> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> > +               *free = 0;
>> > +               return 0;
>> > +       }
>> >
>> > -       return 0;
>> > +       return 1;
>>
>> There's no way that zram can even provide a accurate number of free
>> pages, since it can't know how compressible future stored pages will
>> be.  It would be better to simply change this swap_hint from GET_FREE
>> to IS_FULL, and return either true or false.
>
> My plan is that we can give an approximation based on
> orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> step by step so I didn't include the hint.
> If you are strong against with that in this stage, I can change it and
> try it later with the number.
> Please, say again if you want.

since as you said zram is the only user of swap_hint, changing it
later shouldn't be a big deal.  And you could have both, IS_FULL and
GET_FREE; since the check in scan_swap_map() really only is checking
for IS_FULL, if you update vmscan later to adjust its file/anon
balance based on GET_FREE, that can be added then with no trouble,
right?


>
> Thanks for the review!
>
>
>>
>>
>> >  }
>> >
>> >  static int zram_swap_hint(struct block_device *bdev,
>> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> > index 779d03fa4360..182a2544751b 100644
>> > --- a/drivers/block/zram/zram_drv.h
>> > +++ b/drivers/block/zram/zram_drv.h
>> > @@ -115,6 +115,7 @@ struct zram {
>> >         u64 disksize;   /* bytes */
>> >         int max_comp_streams;
>> >         struct zram_stats stats;
>> > +       atomic_t alloc_fail;
>> >         /*
>> >          * the number of pages zram can consume for storing compressed data
>> >          */
>> > --
>> > 2.0.0
>> >
>> >>
>> >> heesub
>> >>
>> >> >+
>> >> >+    return 0;
>> >> >+}
>> >> >+
>> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >                             unsigned int hint, void *arg)
>> >> >  {
>> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >
>> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >
>> >> >     return ret;
>> >> >  }
>> >> >
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-15 16:00             ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-15 16:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > Hi Heesub,
>> >
>> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> Hello Minchan,
>> >>
>> >> First of all, I agree with the overall purpose of your patch set.
>> >
>> > Thank you.
>> >
>> >>
>> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >know how many zram has freeable space.
>> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >
>> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >---
>> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >  1 file changed, 18 insertions(+)
>> >> >
>> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >     return 0;
>> >> >  }
>> >> >
>> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >+{
>> >> >+    struct zram *zram;
>> >> >+    struct zram_meta *meta;
>> >> >+
>> >> >+    zram = bdev->bd_disk->private_data;
>> >> >+    meta = zram->meta;
>> >> >+
>> >> >+    if (!zram->limit_pages)
>> >> >+            return 1;
>> >> >+
>> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >>
>> >> Even if 'free' is zero here, there may be free spaces available to
>> >> store more compressed pages into the zs_pool. I mean calculation
>> >> above is not quite accurate and wastes memory, but have no better
>> >> idea for now.
>> >
>> > Yeb, good point.
>> >
>> > Actually, I thought about that but in this patchset, I wanted to
>> > go with conservative approach which is a safe guard to prevent
>> > system hang which is terrible than early OOM kill.
>> >
>> > Whole point of this patchset is to add a facility to VM and VM
>> > collaborates with zram via the interface to avoid worst case
>> > (ie, system hang) and logic to throttle could be enhanced by
>> > several approaches in future but I agree my logic was too simple
>> > and conservative.
>> >
>> > We could improve it with [anti|de]fragmentation in future but
>> > at the moment, below simple heuristic is not too bad for first
>> > step. :)
>> >
>> >
>> > ---
>> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >  drivers/block/zram/zram_drv.h |  1 +
>> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> > --- a/drivers/block/zram/zram_drv.c
>> > +++ b/drivers/block/zram/zram_drv.c
>> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >                         &zram->stats.compr_data_size);
>> >         atomic64_dec(&zram->stats.pages_stored);
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >
>> >         meta->table[index].handle = 0;
>> >         zram_set_obj_size(meta, index, 0);
>> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >                 zs_free(meta->mem_pool, handle);
>> > +               atomic_inc(&zram->alloc_fail);
>> >                 ret = -ENOMEM;
>> >                 goto out;
>> >         }
>>
>> This isn't going to work well at all with swap.  There will be,
>> minimum, 32 failures to write a swap page before GET_FREE finally
>> indicates it's full, and even then a single free during those 32
>> failures will restart the counter, so it could be dozens or hundreds
>> (or more) swap write failures before the zram device is marked as
>> full.  And then, a single zram free will move it back to non-full and
>> start the write failures over again.
>>
>> I think it would be better to just check for actual fullness (i.e.
>> alloced_pages > limit_pages) at the start of write, and fail if so.
>> That will allow a single write to succeed when it crosses into
>> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> any write failures (or very few), and zram will stay full until enough
>> pages are freed that it really does move under limit_pages.
>
> The alloced_pages > limit_pages doesn't mean zram is full so with your
> approach, it could kick OOM earlier which is not what we want.
> Because our product uses zram to delay app killing by low memory killer.

With zram, the meaning of "full" isn't as obvious as other fixed-size
storage devices.  Obviously, "full" usually means "no more room to
store anything", while "not full" means "there is room to store
anything, up to the remaining free size".  With zram, its zsmalloc
pool size might be over the specified limit, but there will still be
room to store *some* things - but not *anything*.  Only compressed
pages that happen to fit inside a class with at least one zspage that
isn't full.

Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
is 100% full in all its classes.

What about waiting until there is N number of write failures, like
this patch?  That doesn't seem very fair to the writer, since each
write failure will cause them to do extra work (first, in selecting
what to write, and then in recovering from the failed write).
However, it will probably squeeze some writes into some of those empty
spaces in already-allocated zspages.

And declaring zram "full" immediately once the zsmalloc pool size
increases past the specified limit?  Since zsmalloc's classes almost
certainly contain some fragmentation, that will waste all the empty
spaces that could still store more compressed pages.  But, this is the
limit at which you cannot guarantee all writes to be able to store a
compressed page - any zsmalloc classes without a partially empty
zspage will have to increase zsmalloc's size, therefore failing the
write.

Neither definition of "full" is optimal.  Since in this case we're
talking about swap, I think forcing swap write failures to happen,
which with direct reclaim could (I believe) stop everything while the
write failures continue, should be avoided as much as possible.  Even
when zram fullness is delayed by N write failures, to try to squeeze
out as much storage from zsmalloc as possible, when it does eventually
fill if zram is the only swap device the system will OOM anyway.  And
if zram isn't the only swap device, but just the first (highest
priority), then delaying things with unneeded write failures is
certainly not better than just filling up so swap can move on to the
next swap device.  The only case where write failures delaying marking
zram as full will help is if the system stopped right at this point,
and then started decreasing how much memory was needed.  That seems
like a very unlikely coincidence, but maybe some testing would help
determine how bad the write failures affect system
performance/responsiveness and how long they delay OOM.

Since there may be different use cases that desire different things,
maybe there should be a zram runtime (or buildtime) config to choose
exactly how it decides it's full?  Either full after N write failures,
or full when alloced>limit?  That would allow the user to either defer
getting full as long as possible (at the possible cost of system
unresponsiveness during those write failures), or to just move
immediately to zram being full as soon as it can't guarantee that each
write will succeed.



>
>>
>>
>>
>> >
>> > +       atomic_set(&zram->alloc_fail, 0);
>> >         update_used_max(zram, alloced_pages);
>> >
>> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >         return 0;
>> >  }
>> >
>> > +#define FULL_THRESH_HOLD 32
>> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >  {
>> >         struct zram *zram;
>> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >         zram = bdev->bd_disk->private_data;
>> >         meta = zram->meta;
>> >
>> > -       if (!zram->limit_pages)
>> > -               return 1;
>> > -
>> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> > +       if (zram->limit_pages &&
>> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> > +               *free = 0;
>> > +               return 0;
>> > +       }
>> >
>> > -       return 0;
>> > +       return 1;
>>
>> There's no way that zram can even provide a accurate number of free
>> pages, since it can't know how compressible future stored pages will
>> be.  It would be better to simply change this swap_hint from GET_FREE
>> to IS_FULL, and return either true or false.
>
> My plan is that we can give an approximation based on
> orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> step by step so I didn't include the hint.
> If you are strong against with that in this stage, I can change it and
> try it later with the number.
> Please, say again if you want.

since as you said zram is the only user of swap_hint, changing it
later shouldn't be a big deal.  And you could have both, IS_FULL and
GET_FREE; since the check in scan_swap_map() really only is checking
for IS_FULL, if you update vmscan later to adjust its file/anon
balance based on GET_FREE, that can be added then with no trouble,
right?


>
> Thanks for the review!
>
>
>>
>>
>> >  }
>> >
>> >  static int zram_swap_hint(struct block_device *bdev,
>> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> > index 779d03fa4360..182a2544751b 100644
>> > --- a/drivers/block/zram/zram_drv.h
>> > +++ b/drivers/block/zram/zram_drv.h
>> > @@ -115,6 +115,7 @@ struct zram {
>> >         u64 disksize;   /* bytes */
>> >         int max_comp_streams;
>> >         struct zram_stats stats;
>> > +       atomic_t alloc_fail;
>> >         /*
>> >          * the number of pages zram can consume for storing compressed data
>> >          */
>> > --
>> > 2.0.0
>> >
>> >>
>> >> heesub
>> >>
>> >> >+
>> >> >+    return 0;
>> >> >+}
>> >> >+
>> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >                             unsigned int hint, void *arg)
>> >> >  {
>> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >
>> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >
>> >> >     return ret;
>> >> >  }
>> >> >
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-15 14:53         ` Dan Streetman
@ 2014-09-16  0:33           ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-16  0:33 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > VM uses nr_swap_pages as one of information when it does
> >> > anonymous reclaim so that VM is able to throttle amount of swap.
> >> >
> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> >> > but for zram, it doesn't match because zram can limit memory usage
> >> > by knob(ie, mem_limit) so although VM can see lots of free space
> >> > from zram disk, zram can make fail intentionally once the allocated
> >> > space is over to limit. If it happens, VM should notice it and
> >> > stop reclaimaing until zram can obtain more free space but there
> >> > is a good way to do at the moment.
> >> >
> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
> >> > many of freeable space it has. With using that, this patch adds
> >> > __swap_full which returns true if the zram is full and substract
> >> > remained freeable space of the zram-swap from nr_swap_pages.
> >> > IOW, VM sees there is no more swap space of zram so that it stops
> >> > anonymous reclaiming until swap_entry_free free a page and increase
> >> > nr_swap_pages again.
> >> >
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> > ---
> >> >  include/linux/blkdev.h |  1 +
> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> > index 17437b2c18e4..c1199806e0f1 100644
> >> > --- a/include/linux/blkdev.h
> >> > +++ b/include/linux/blkdev.h
> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >> >
> >> >  enum swap_blk_hint {
> >> >         SWAP_SLOT_FREE,
> >> > +       SWAP_GET_FREE,
> >> >  };
> >> >
> >> >  struct block_device_operations {
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 4bff521e649a..72737e6dd5e5 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -484,6 +484,22 @@ new_cluster:
> >> >         *scan_base = tmp;
> >> >  }
> >> >
> >> > +static bool __swap_full(struct swap_info_struct *si)
> >> > +{
> >> > +       if (si->flags & SWP_BLKDEV) {
> >> > +               long free;
> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> > +
> >> > +               if (disk->fops->swap_hint)
> >> > +                       if (!disk->fops->swap_hint(si->bdev,
> >> > +                                               SWAP_GET_FREE,
> >> > +                                               &free))
> >> > +                               return free <= 0;
> >> > +       }
> >> > +
> >> > +       return si->inuse_pages == si->pages;
> >> > +}
> >> > +
> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >> >                                    unsigned char usage)
> >> >  {
> >> > @@ -583,11 +599,21 @@ checks:
> >> >         if (offset == si->highest_bit)
> >> >                 si->highest_bit--;
> >> >         si->inuse_pages++;
> >> > -       if (si->inuse_pages == si->pages) {
> >> > +       if (__swap_full(si)) {
> >>
> >> This check is done after an available offset has already been
> >> selected.  So if the variable-size blkdev is full at this point, then
> >> this is incorrect, as swap will try to store a page at the current
> >> selected offset.
> >
> > So the result is just fail of a write then what happens?
> > Page become redirty and keep it in memory so there is no harm.
> 
> Happening once, it's not a big deal.  But it's not as good as not
> happening at all.

With your suggestion, we should check full whevever we need new
swap slot. To me, it's more concern than just a write fail.

> 
> >
> >>
> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> > +
> >> >                 si->lowest_bit = si->max;
> >> >                 si->highest_bit = 0;
> >> >                 spin_lock(&swap_avail_lock);
> >> >                 plist_del(&si->avail_list, &swap_avail_head);
> >> > +               /*
> >> > +                * If zram is full, it decreases nr_swap_pages
> >> > +                * for stopping anonymous page reclaim until
> >> > +                * zram has free space. Look at swap_entry_free
> >> > +                */
> >> > +               if (disk->fops->swap_hint)
> >>
> >> Simply checking for the existence of swap_hint isn't enough to know
> >> we're using zram...
> >
> > Yes but acutally the hint have been used for only zram for several years.
> > If other user is coming in future, we would add more checks if we really
> > need it at that time.
> > Do you have another idea?
> 
> Well if this hint == zram just rename it zram.  Especially if it's now
> going to be explicitly used to mean it == zram.  But I don't think
> that is necessary.

I'd like to clarify your comment. So, are you okay without any change?

> 
> >
> >>
> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> >> > +                               &nr_swap_pages);
> >> >                 spin_unlock(&swap_avail_lock);
> >> >         }
> >> >         si->swap_map[offset] = usage;
> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >
> >> >         /* free if no reference */
> >> >         if (!usage) {
> >> > +               struct gendisk *disk = p->bdev->bd_disk;
> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >> >                 if (offset < p->lowest_bit)
> >> >                         p->lowest_bit = offset;
> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >                                 if (plist_node_empty(&p->avail_list))
> >> >                                         plist_add(&p->avail_list,
> >> >                                                   &swap_avail_head);
> >> > +                               if ((p->flags & SWP_BLKDEV) &&
> >> > +                                       disk->fops->swap_hint) {
> >>
> >> freeing an entry from a full variable-size blkdev doesn't mean it's
> >> not still full.  In this case with zsmalloc, freeing one handle
> >> doesn't actually free any memory unless it was the only handle left in
> >> its containing zspage, and therefore it's possible that it is still
> >> full at this point.
> >
> > No need to free a zspage in zsmalloc.
> > If we free a page in zspage, it means we have free space in zspage
> > so user can give a chance to user for writing out new page.
> 
> That's not actually true, since zsmalloc has 255 different class
> sizes, freeing one page means the next page to be compressed has a
> 1/255 chance that it will be the same size as the just-freed page
> (assuming random page compressability).

I said "a chance" so if we have a possiblity, I'd like to try it.
Pz, don't tie your thought into zsmalloc's internal. It's facility
to communitcate with swap/zram, not zram allocator.
IOW, We could change allocator of zram potentially
(ex, historically, we have already done) and the (imaginary allocator/
or enhanced zsmalloc) could have a technique to handle it.

> 
> >
> >>
> >> > +                                       atomic_long_add(p->pages -
> >> > +                                                       p->inuse_pages,
> >> > +                                                       &nr_swap_pages);
> >> > +                                       /*
> >> > +                                        * reset [highest|lowest]_bit to avoid
> >> > +                                        * scan_swap_map infinite looping if
> >> > +                                        * cached free cluster's index by
> >> > +                                        * scan_swap_map_try_ssd_cluster is
> >> > +                                        * above p->highest_bit.
> >> > +                                        */
> >> > +                                       p->highest_bit = p->max - 1;
> >> > +                                       p->lowest_bit = 1;
> >>
> >> lowest_bit and highest_bit are likely to remain at those extremes for
> >> a long time, until 1 or max-1 is freed and re-allocated.
> >>
> >>
> >> By adding variable-size blkdev support to swap, I don't think
> >> highest_bit can be re-used as a "full" flag anymore.
> >>
> >> Instead, I suggest that you add a "full" flag to struct
> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> >> scan_swap_map(), and if full simply turn "full" on, remove the
> >> swap_info_struct from the avail list, reduce nr_swap_pages
> >> appropriately, and return failure.  Don't mess with lowest_bit or
> >> highest_bit at all.
> >
> > Could you explain what logic in your suggestion prevent the problem
> > I mentioned(ie, scan_swap_map infinite looping)?
> 
> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
> check is done at its start.  And it wouldn't be called again with that
> swap_info_struct until non-full since it is removed from the
> avail_list.

Sorry for being not clear. I don't mean it.
Please consider the situation where swap is not full any more
by swap_entry_free. Newly scan_swap_map can select the slot index which
is higher than p->highest_bit because we have cached free_cluster so
scan_swap_map will reset it with p->lowest_bit and scan again and finally
pick the slot index just freed by swap_entry_free and checks again.
Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
but unfortunately, offset is higher than p->highest_bit so again it is reset
to p->lowest_bit. It loops forever :(

I'd like to solve this problem without many hooking in swap layer and
any overhead for !zram case.

> 
> >
> >>
> >> Then in swap_entry_free(), do something like:
> >>
> >>     dec_cluster_info_page(p, p->cluster_info, offset);
> >>     if (offset < p->lowest_bit)
> >>       p->lowest_bit = offset;
> >> -   if (offset > p->highest_bit) {
> >> -     bool was_full = !p->highest_bit;
> >> +   if (offset > p->highest_bit)
> >>       p->highest_bit = offset;
> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> >> +   if (p->full && p->flags & SWP_WRITEOK) {
> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> >> +
> >> +     if (!is_var_size_blkdev || !blkdev_full) {
> >> +       if (is_var_size_blkdev)
> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> >> +       p->full = false;
> >>         spin_lock(&swap_avail_lock);
> >>         WARN_ON(!plist_node_empty(&p->avail_list));
> >>         if (plist_node_empty(&p->avail_list))
> >>           plist_add(&p->avail_list,
> >>              &swap_avail_head);
> >>         spin_unlock(&swap_avail_lock);
> >> +     } else if (blkdev_full) {
> >> +       /* still full, so this page isn't actually
> >> +        * available yet to use; once non-full,
> >> +        * pages-inuse_pages will be the correct
> >> +        * number to add (above) since below will
> >> +        * inuse_pages--
> >> +        */
> >> +       atomic_long_dec(&nr_swap_pages);
> >>       }
> >>     }
> >>     atomic_long_inc(&nr_swap_pages);
> >>
> >>
> >>
> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >                 p->inuse_pages--;
> >> >                 frontswap_invalidate_page(p->type, offset);
> >> >                 if (p->flags & SWP_BLKDEV) {
> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >> >                         if (disk->fops->swap_hint)
> >> >                                 disk->fops->swap_hint(p->bdev,
> >> >                                                 SWAP_SLOT_FREE,
> >> > --
> >> > 2.0.0
> >> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-16  0:33           ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-16  0:33 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > VM uses nr_swap_pages as one of information when it does
> >> > anonymous reclaim so that VM is able to throttle amount of swap.
> >> >
> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> >> > but for zram, it doesn't match because zram can limit memory usage
> >> > by knob(ie, mem_limit) so although VM can see lots of free space
> >> > from zram disk, zram can make fail intentionally once the allocated
> >> > space is over to limit. If it happens, VM should notice it and
> >> > stop reclaimaing until zram can obtain more free space but there
> >> > is a good way to do at the moment.
> >> >
> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
> >> > many of freeable space it has. With using that, this patch adds
> >> > __swap_full which returns true if the zram is full and substract
> >> > remained freeable space of the zram-swap from nr_swap_pages.
> >> > IOW, VM sees there is no more swap space of zram so that it stops
> >> > anonymous reclaiming until swap_entry_free free a page and increase
> >> > nr_swap_pages again.
> >> >
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> > ---
> >> >  include/linux/blkdev.h |  1 +
> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> > index 17437b2c18e4..c1199806e0f1 100644
> >> > --- a/include/linux/blkdev.h
> >> > +++ b/include/linux/blkdev.h
> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >> >
> >> >  enum swap_blk_hint {
> >> >         SWAP_SLOT_FREE,
> >> > +       SWAP_GET_FREE,
> >> >  };
> >> >
> >> >  struct block_device_operations {
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 4bff521e649a..72737e6dd5e5 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -484,6 +484,22 @@ new_cluster:
> >> >         *scan_base = tmp;
> >> >  }
> >> >
> >> > +static bool __swap_full(struct swap_info_struct *si)
> >> > +{
> >> > +       if (si->flags & SWP_BLKDEV) {
> >> > +               long free;
> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> > +
> >> > +               if (disk->fops->swap_hint)
> >> > +                       if (!disk->fops->swap_hint(si->bdev,
> >> > +                                               SWAP_GET_FREE,
> >> > +                                               &free))
> >> > +                               return free <= 0;
> >> > +       }
> >> > +
> >> > +       return si->inuse_pages == si->pages;
> >> > +}
> >> > +
> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >> >                                    unsigned char usage)
> >> >  {
> >> > @@ -583,11 +599,21 @@ checks:
> >> >         if (offset == si->highest_bit)
> >> >                 si->highest_bit--;
> >> >         si->inuse_pages++;
> >> > -       if (si->inuse_pages == si->pages) {
> >> > +       if (__swap_full(si)) {
> >>
> >> This check is done after an available offset has already been
> >> selected.  So if the variable-size blkdev is full at this point, then
> >> this is incorrect, as swap will try to store a page at the current
> >> selected offset.
> >
> > So the result is just fail of a write then what happens?
> > Page become redirty and keep it in memory so there is no harm.
> 
> Happening once, it's not a big deal.  But it's not as good as not
> happening at all.

With your suggestion, we should check full whevever we need new
swap slot. To me, it's more concern than just a write fail.

> 
> >
> >>
> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> > +
> >> >                 si->lowest_bit = si->max;
> >> >                 si->highest_bit = 0;
> >> >                 spin_lock(&swap_avail_lock);
> >> >                 plist_del(&si->avail_list, &swap_avail_head);
> >> > +               /*
> >> > +                * If zram is full, it decreases nr_swap_pages
> >> > +                * for stopping anonymous page reclaim until
> >> > +                * zram has free space. Look at swap_entry_free
> >> > +                */
> >> > +               if (disk->fops->swap_hint)
> >>
> >> Simply checking for the existence of swap_hint isn't enough to know
> >> we're using zram...
> >
> > Yes but acutally the hint have been used for only zram for several years.
> > If other user is coming in future, we would add more checks if we really
> > need it at that time.
> > Do you have another idea?
> 
> Well if this hint == zram just rename it zram.  Especially if it's now
> going to be explicitly used to mean it == zram.  But I don't think
> that is necessary.

I'd like to clarify your comment. So, are you okay without any change?

> 
> >
> >>
> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> >> > +                               &nr_swap_pages);
> >> >                 spin_unlock(&swap_avail_lock);
> >> >         }
> >> >         si->swap_map[offset] = usage;
> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >
> >> >         /* free if no reference */
> >> >         if (!usage) {
> >> > +               struct gendisk *disk = p->bdev->bd_disk;
> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >> >                 if (offset < p->lowest_bit)
> >> >                         p->lowest_bit = offset;
> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >                                 if (plist_node_empty(&p->avail_list))
> >> >                                         plist_add(&p->avail_list,
> >> >                                                   &swap_avail_head);
> >> > +                               if ((p->flags & SWP_BLKDEV) &&
> >> > +                                       disk->fops->swap_hint) {
> >>
> >> freeing an entry from a full variable-size blkdev doesn't mean it's
> >> not still full.  In this case with zsmalloc, freeing one handle
> >> doesn't actually free any memory unless it was the only handle left in
> >> its containing zspage, and therefore it's possible that it is still
> >> full at this point.
> >
> > No need to free a zspage in zsmalloc.
> > If we free a page in zspage, it means we have free space in zspage
> > so user can give a chance to user for writing out new page.
> 
> That's not actually true, since zsmalloc has 255 different class
> sizes, freeing one page means the next page to be compressed has a
> 1/255 chance that it will be the same size as the just-freed page
> (assuming random page compressability).

I said "a chance" so if we have a possiblity, I'd like to try it.
Pz, don't tie your thought into zsmalloc's internal. It's facility
to communitcate with swap/zram, not zram allocator.
IOW, We could change allocator of zram potentially
(ex, historically, we have already done) and the (imaginary allocator/
or enhanced zsmalloc) could have a technique to handle it.

> 
> >
> >>
> >> > +                                       atomic_long_add(p->pages -
> >> > +                                                       p->inuse_pages,
> >> > +                                                       &nr_swap_pages);
> >> > +                                       /*
> >> > +                                        * reset [highest|lowest]_bit to avoid
> >> > +                                        * scan_swap_map infinite looping if
> >> > +                                        * cached free cluster's index by
> >> > +                                        * scan_swap_map_try_ssd_cluster is
> >> > +                                        * above p->highest_bit.
> >> > +                                        */
> >> > +                                       p->highest_bit = p->max - 1;
> >> > +                                       p->lowest_bit = 1;
> >>
> >> lowest_bit and highest_bit are likely to remain at those extremes for
> >> a long time, until 1 or max-1 is freed and re-allocated.
> >>
> >>
> >> By adding variable-size blkdev support to swap, I don't think
> >> highest_bit can be re-used as a "full" flag anymore.
> >>
> >> Instead, I suggest that you add a "full" flag to struct
> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> >> scan_swap_map(), and if full simply turn "full" on, remove the
> >> swap_info_struct from the avail list, reduce nr_swap_pages
> >> appropriately, and return failure.  Don't mess with lowest_bit or
> >> highest_bit at all.
> >
> > Could you explain what logic in your suggestion prevent the problem
> > I mentioned(ie, scan_swap_map infinite looping)?
> 
> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
> check is done at its start.  And it wouldn't be called again with that
> swap_info_struct until non-full since it is removed from the
> avail_list.

Sorry for being not clear. I don't mean it.
Please consider the situation where swap is not full any more
by swap_entry_free. Newly scan_swap_map can select the slot index which
is higher than p->highest_bit because we have cached free_cluster so
scan_swap_map will reset it with p->lowest_bit and scan again and finally
pick the slot index just freed by swap_entry_free and checks again.
Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
but unfortunately, offset is higher than p->highest_bit so again it is reset
to p->lowest_bit. It loops forever :(

I'd like to solve this problem without many hooking in swap layer and
any overhead for !zram case.

> 
> >
> >>
> >> Then in swap_entry_free(), do something like:
> >>
> >>     dec_cluster_info_page(p, p->cluster_info, offset);
> >>     if (offset < p->lowest_bit)
> >>       p->lowest_bit = offset;
> >> -   if (offset > p->highest_bit) {
> >> -     bool was_full = !p->highest_bit;
> >> +   if (offset > p->highest_bit)
> >>       p->highest_bit = offset;
> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> >> +   if (p->full && p->flags & SWP_WRITEOK) {
> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> >> +
> >> +     if (!is_var_size_blkdev || !blkdev_full) {
> >> +       if (is_var_size_blkdev)
> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> >> +       p->full = false;
> >>         spin_lock(&swap_avail_lock);
> >>         WARN_ON(!plist_node_empty(&p->avail_list));
> >>         if (plist_node_empty(&p->avail_list))
> >>           plist_add(&p->avail_list,
> >>              &swap_avail_head);
> >>         spin_unlock(&swap_avail_lock);
> >> +     } else if (blkdev_full) {
> >> +       /* still full, so this page isn't actually
> >> +        * available yet to use; once non-full,
> >> +        * pages-inuse_pages will be the correct
> >> +        * number to add (above) since below will
> >> +        * inuse_pages--
> >> +        */
> >> +       atomic_long_dec(&nr_swap_pages);
> >>       }
> >>     }
> >>     atomic_long_inc(&nr_swap_pages);
> >>
> >>
> >>
> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >                 p->inuse_pages--;
> >> >                 frontswap_invalidate_page(p->type, offset);
> >> >                 if (p->flags & SWP_BLKDEV) {
> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >> >                         if (disk->fops->swap_hint)
> >> >                                 disk->fops->swap_hint(p->bdev,
> >> >                                                 SWAP_SLOT_FREE,
> >> > --
> >> > 2.0.0
> >> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-15 16:00             ` Dan Streetman
@ 2014-09-16  1:21               ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-16  1:21 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > Hi Heesub,
> >> >
> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> Hello Minchan,
> >> >>
> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >
> >> > Thank you.
> >> >
> >> >>
> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >know how many zram has freeable space.
> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >
> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >---
> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >  1 file changed, 18 insertions(+)
> >> >> >
> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >     return 0;
> >> >> >  }
> >> >> >
> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >+{
> >> >> >+    struct zram *zram;
> >> >> >+    struct zram_meta *meta;
> >> >> >+
> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >+    meta = zram->meta;
> >> >> >+
> >> >> >+    if (!zram->limit_pages)
> >> >> >+            return 1;
> >> >> >+
> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >>
> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> idea for now.
> >> >
> >> > Yeb, good point.
> >> >
> >> > Actually, I thought about that but in this patchset, I wanted to
> >> > go with conservative approach which is a safe guard to prevent
> >> > system hang which is terrible than early OOM kill.
> >> >
> >> > Whole point of this patchset is to add a facility to VM and VM
> >> > collaborates with zram via the interface to avoid worst case
> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> > several approaches in future but I agree my logic was too simple
> >> > and conservative.
> >> >
> >> > We could improve it with [anti|de]fragmentation in future but
> >> > at the moment, below simple heuristic is not too bad for first
> >> > step. :)
> >> >
> >> >
> >> > ---
> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> > --- a/drivers/block/zram/zram_drv.c
> >> > +++ b/drivers/block/zram/zram_drv.c
> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >                         &zram->stats.compr_data_size);
> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >
> >> >         meta->table[index].handle = 0;
> >> >         zram_set_obj_size(meta, index, 0);
> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >                 zs_free(meta->mem_pool, handle);
> >> > +               atomic_inc(&zram->alloc_fail);
> >> >                 ret = -ENOMEM;
> >> >                 goto out;
> >> >         }
> >>
> >> This isn't going to work well at all with swap.  There will be,
> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> indicates it's full, and even then a single free during those 32
> >> failures will restart the counter, so it could be dozens or hundreds
> >> (or more) swap write failures before the zram device is marked as
> >> full.  And then, a single zram free will move it back to non-full and
> >> start the write failures over again.
> >>
> >> I think it would be better to just check for actual fullness (i.e.
> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> That will allow a single write to succeed when it crosses into
> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> any write failures (or very few), and zram will stay full until enough
> >> pages are freed that it really does move under limit_pages.
> >
> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> > approach, it could kick OOM earlier which is not what we want.
> > Because our product uses zram to delay app killing by low memory killer.
> 
> With zram, the meaning of "full" isn't as obvious as other fixed-size
> storage devices.  Obviously, "full" usually means "no more room to
> store anything", while "not full" means "there is room to store
> anything, up to the remaining free size".  With zram, its zsmalloc
> pool size might be over the specified limit, but there will still be
> room to store *some* things - but not *anything*.  Only compressed
> pages that happen to fit inside a class with at least one zspage that
> isn't full.
> 
> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> is 100% full in all its classes.
> 
> What about waiting until there is N number of write failures, like
> this patch?  That doesn't seem very fair to the writer, since each
> write failure will cause them to do extra work (first, in selecting
> what to write, and then in recovering from the failed write).
> However, it will probably squeeze some writes into some of those empty
> spaces in already-allocated zspages.
> 
> And declaring zram "full" immediately once the zsmalloc pool size
> increases past the specified limit?  Since zsmalloc's classes almost
> certainly contain some fragmentation, that will waste all the empty
> spaces that could still store more compressed pages.  But, this is the
> limit at which you cannot guarantee all writes to be able to store a
> compressed page - any zsmalloc classes without a partially empty
> zspage will have to increase zsmalloc's size, therefore failing the
> write.
> 
> Neither definition of "full" is optimal.  Since in this case we're
> talking about swap, I think forcing swap write failures to happen,
> which with direct reclaim could (I believe) stop everything while the
> write failures continue, should be avoided as much as possible.  Even
> when zram fullness is delayed by N write failures, to try to squeeze
> out as much storage from zsmalloc as possible, when it does eventually
> fill if zram is the only swap device the system will OOM anyway.  And
> if zram isn't the only swap device, but just the first (highest
> priority), then delaying things with unneeded write failures is
> certainly not better than just filling up so swap can move on to the
> next swap device.  The only case where write failures delaying marking
> zram as full will help is if the system stopped right at this point,
> and then started decreasing how much memory was needed.  That seems
> like a very unlikely coincidence, but maybe some testing would help
> determine how bad the write failures affect system
> performance/responsiveness and how long they delay OOM.

Please, keep in mind that swap is alreay really slow operation but
we want to use it to avoid OOM if possible so I can't buy your early
kill suggestion. If a user feel it's really slow for his product,
it means his admin was fail. He should increase the limit of zram
dynamically or statically(zram already support that ways).

The thing I'd like to solve in this patchset is to avoid system hang
where admin cannot do anyting, even ctrl+c, which is thing should
support in OS level.

> 
> Since there may be different use cases that desire different things,
> maybe there should be a zram runtime (or buildtime) config to choose
> exactly how it decides it's full?  Either full after N write failures,
> or full when alloced>limit?  That would allow the user to either defer
> getting full as long as possible (at the possible cost of system
> unresponsiveness during those write failures), or to just move
> immediately to zram being full as soon as it can't guarantee that each
> write will succeed.

Hmm, I thought it and was going to post it when I send v1.
My idea was this.

int zram_get_free_pages(...)
{
        if (zram->limit_pages &&
                zram->alloc_fail > FULL_THRESH_HOLD &&
                (100 * compr_data_size >> PAGE_SHIFT /
                        zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
                        *free = 0;
                        return 0;
        }
        ..
}

Maybe we could export FRAG_THRESHOLD.

> 
> 
> 
> >
> >>
> >>
> >>
> >> >
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >         update_used_max(zram, alloced_pages);
> >> >
> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >         return 0;
> >> >  }
> >> >
> >> > +#define FULL_THRESH_HOLD 32
> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >  {
> >> >         struct zram *zram;
> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >         zram = bdev->bd_disk->private_data;
> >> >         meta = zram->meta;
> >> >
> >> > -       if (!zram->limit_pages)
> >> > -               return 1;
> >> > -
> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> > +       if (zram->limit_pages &&
> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> > +               *free = 0;
> >> > +               return 0;
> >> > +       }
> >> >
> >> > -       return 0;
> >> > +       return 1;
> >>
> >> There's no way that zram can even provide a accurate number of free
> >> pages, since it can't know how compressible future stored pages will
> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> to IS_FULL, and return either true or false.
> >
> > My plan is that we can give an approximation based on
> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> > step by step so I didn't include the hint.
> > If you are strong against with that in this stage, I can change it and
> > try it later with the number.
> > Please, say again if you want.
> 
> since as you said zram is the only user of swap_hint, changing it
> later shouldn't be a big deal.  And you could have both, IS_FULL and
> GET_FREE; since the check in scan_swap_map() really only is checking
> for IS_FULL, if you update vmscan later to adjust its file/anon
> balance based on GET_FREE, that can be added then with no trouble,
> right?

Yeb, No problem.

> 
> 
> >
> > Thanks for the review!
> >
> >
> >>
> >>
> >> >  }
> >> >
> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> > index 779d03fa4360..182a2544751b 100644
> >> > --- a/drivers/block/zram/zram_drv.h
> >> > +++ b/drivers/block/zram/zram_drv.h
> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >         u64 disksize;   /* bytes */
> >> >         int max_comp_streams;
> >> >         struct zram_stats stats;
> >> > +       atomic_t alloc_fail;
> >> >         /*
> >> >          * the number of pages zram can consume for storing compressed data
> >> >          */
> >> > --
> >> > 2.0.0
> >> >
> >> >>
> >> >> heesub
> >> >>
> >> >> >+
> >> >> >+    return 0;
> >> >> >+}
> >> >> >+
> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >                             unsigned int hint, void *arg)
> >> >> >  {
> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >
> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >
> >> >> >     return ret;
> >> >> >  }
> >> >> >
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-16  1:21               ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-16  1:21 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > Hi Heesub,
> >> >
> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> Hello Minchan,
> >> >>
> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >
> >> > Thank you.
> >> >
> >> >>
> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >know how many zram has freeable space.
> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >
> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >---
> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >  1 file changed, 18 insertions(+)
> >> >> >
> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >     return 0;
> >> >> >  }
> >> >> >
> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >+{
> >> >> >+    struct zram *zram;
> >> >> >+    struct zram_meta *meta;
> >> >> >+
> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >+    meta = zram->meta;
> >> >> >+
> >> >> >+    if (!zram->limit_pages)
> >> >> >+            return 1;
> >> >> >+
> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >>
> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> idea for now.
> >> >
> >> > Yeb, good point.
> >> >
> >> > Actually, I thought about that but in this patchset, I wanted to
> >> > go with conservative approach which is a safe guard to prevent
> >> > system hang which is terrible than early OOM kill.
> >> >
> >> > Whole point of this patchset is to add a facility to VM and VM
> >> > collaborates with zram via the interface to avoid worst case
> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> > several approaches in future but I agree my logic was too simple
> >> > and conservative.
> >> >
> >> > We could improve it with [anti|de]fragmentation in future but
> >> > at the moment, below simple heuristic is not too bad for first
> >> > step. :)
> >> >
> >> >
> >> > ---
> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> > --- a/drivers/block/zram/zram_drv.c
> >> > +++ b/drivers/block/zram/zram_drv.c
> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >                         &zram->stats.compr_data_size);
> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >
> >> >         meta->table[index].handle = 0;
> >> >         zram_set_obj_size(meta, index, 0);
> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >                 zs_free(meta->mem_pool, handle);
> >> > +               atomic_inc(&zram->alloc_fail);
> >> >                 ret = -ENOMEM;
> >> >                 goto out;
> >> >         }
> >>
> >> This isn't going to work well at all with swap.  There will be,
> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> indicates it's full, and even then a single free during those 32
> >> failures will restart the counter, so it could be dozens or hundreds
> >> (or more) swap write failures before the zram device is marked as
> >> full.  And then, a single zram free will move it back to non-full and
> >> start the write failures over again.
> >>
> >> I think it would be better to just check for actual fullness (i.e.
> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> That will allow a single write to succeed when it crosses into
> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> any write failures (or very few), and zram will stay full until enough
> >> pages are freed that it really does move under limit_pages.
> >
> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> > approach, it could kick OOM earlier which is not what we want.
> > Because our product uses zram to delay app killing by low memory killer.
> 
> With zram, the meaning of "full" isn't as obvious as other fixed-size
> storage devices.  Obviously, "full" usually means "no more room to
> store anything", while "not full" means "there is room to store
> anything, up to the remaining free size".  With zram, its zsmalloc
> pool size might be over the specified limit, but there will still be
> room to store *some* things - but not *anything*.  Only compressed
> pages that happen to fit inside a class with at least one zspage that
> isn't full.
> 
> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> is 100% full in all its classes.
> 
> What about waiting until there is N number of write failures, like
> this patch?  That doesn't seem very fair to the writer, since each
> write failure will cause them to do extra work (first, in selecting
> what to write, and then in recovering from the failed write).
> However, it will probably squeeze some writes into some of those empty
> spaces in already-allocated zspages.
> 
> And declaring zram "full" immediately once the zsmalloc pool size
> increases past the specified limit?  Since zsmalloc's classes almost
> certainly contain some fragmentation, that will waste all the empty
> spaces that could still store more compressed pages.  But, this is the
> limit at which you cannot guarantee all writes to be able to store a
> compressed page - any zsmalloc classes without a partially empty
> zspage will have to increase zsmalloc's size, therefore failing the
> write.
> 
> Neither definition of "full" is optimal.  Since in this case we're
> talking about swap, I think forcing swap write failures to happen,
> which with direct reclaim could (I believe) stop everything while the
> write failures continue, should be avoided as much as possible.  Even
> when zram fullness is delayed by N write failures, to try to squeeze
> out as much storage from zsmalloc as possible, when it does eventually
> fill if zram is the only swap device the system will OOM anyway.  And
> if zram isn't the only swap device, but just the first (highest
> priority), then delaying things with unneeded write failures is
> certainly not better than just filling up so swap can move on to the
> next swap device.  The only case where write failures delaying marking
> zram as full will help is if the system stopped right at this point,
> and then started decreasing how much memory was needed.  That seems
> like a very unlikely coincidence, but maybe some testing would help
> determine how bad the write failures affect system
> performance/responsiveness and how long they delay OOM.

Please, keep in mind that swap is alreay really slow operation but
we want to use it to avoid OOM if possible so I can't buy your early
kill suggestion. If a user feel it's really slow for his product,
it means his admin was fail. He should increase the limit of zram
dynamically or statically(zram already support that ways).

The thing I'd like to solve in this patchset is to avoid system hang
where admin cannot do anyting, even ctrl+c, which is thing should
support in OS level.

> 
> Since there may be different use cases that desire different things,
> maybe there should be a zram runtime (or buildtime) config to choose
> exactly how it decides it's full?  Either full after N write failures,
> or full when alloced>limit?  That would allow the user to either defer
> getting full as long as possible (at the possible cost of system
> unresponsiveness during those write failures), or to just move
> immediately to zram being full as soon as it can't guarantee that each
> write will succeed.

Hmm, I thought it and was going to post it when I send v1.
My idea was this.

int zram_get_free_pages(...)
{
        if (zram->limit_pages &&
                zram->alloc_fail > FULL_THRESH_HOLD &&
                (100 * compr_data_size >> PAGE_SHIFT /
                        zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
                        *free = 0;
                        return 0;
        }
        ..
}

Maybe we could export FRAG_THRESHOLD.

> 
> 
> 
> >
> >>
> >>
> >>
> >> >
> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >         update_used_max(zram, alloced_pages);
> >> >
> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >         return 0;
> >> >  }
> >> >
> >> > +#define FULL_THRESH_HOLD 32
> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >  {
> >> >         struct zram *zram;
> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >         zram = bdev->bd_disk->private_data;
> >> >         meta = zram->meta;
> >> >
> >> > -       if (!zram->limit_pages)
> >> > -               return 1;
> >> > -
> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> > +       if (zram->limit_pages &&
> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> > +               *free = 0;
> >> > +               return 0;
> >> > +       }
> >> >
> >> > -       return 0;
> >> > +       return 1;
> >>
> >> There's no way that zram can even provide a accurate number of free
> >> pages, since it can't know how compressible future stored pages will
> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> to IS_FULL, and return either true or false.
> >
> > My plan is that we can give an approximation based on
> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> > step by step so I didn't include the hint.
> > If you are strong against with that in this stage, I can change it and
> > try it later with the number.
> > Please, say again if you want.
> 
> since as you said zram is the only user of swap_hint, changing it
> later shouldn't be a big deal.  And you could have both, IS_FULL and
> GET_FREE; since the check in scan_swap_map() really only is checking
> for IS_FULL, if you update vmscan later to adjust its file/anon
> balance based on GET_FREE, that can be added then with no trouble,
> right?

Yeb, No problem.

> 
> 
> >
> > Thanks for the review!
> >
> >
> >>
> >>
> >> >  }
> >> >
> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> > index 779d03fa4360..182a2544751b 100644
> >> > --- a/drivers/block/zram/zram_drv.h
> >> > +++ b/drivers/block/zram/zram_drv.h
> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >         u64 disksize;   /* bytes */
> >> >         int max_comp_streams;
> >> >         struct zram_stats stats;
> >> > +       atomic_t alloc_fail;
> >> >         /*
> >> >          * the number of pages zram can consume for storing compressed data
> >> >          */
> >> > --
> >> > 2.0.0
> >> >
> >> >>
> >> >> heesub
> >> >>
> >> >> >+
> >> >> >+    return 0;
> >> >> >+}
> >> >> >+
> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >                             unsigned int hint, void *arg)
> >> >> >  {
> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >
> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >
> >> >> >     return ret;
> >> >> >  }
> >> >> >
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-16  0:33           ` Minchan Kim
@ 2014-09-16 15:09             ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-16 15:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 8:33 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
>> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
>> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > VM uses nr_swap_pages as one of information when it does
>> >> > anonymous reclaim so that VM is able to throttle amount of swap.
>> >> >
>> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
>> >> > but for zram, it doesn't match because zram can limit memory usage
>> >> > by knob(ie, mem_limit) so although VM can see lots of free space
>> >> > from zram disk, zram can make fail intentionally once the allocated
>> >> > space is over to limit. If it happens, VM should notice it and
>> >> > stop reclaimaing until zram can obtain more free space but there
>> >> > is a good way to do at the moment.
>> >> >
>> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
>> >> > many of freeable space it has. With using that, this patch adds
>> >> > __swap_full which returns true if the zram is full and substract
>> >> > remained freeable space of the zram-swap from nr_swap_pages.
>> >> > IOW, VM sees there is no more swap space of zram so that it stops
>> >> > anonymous reclaiming until swap_entry_free free a page and increase
>> >> > nr_swap_pages again.
>> >> >
>> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> > ---
>> >> >  include/linux/blkdev.h |  1 +
>> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
>> >> >
>> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> >> > index 17437b2c18e4..c1199806e0f1 100644
>> >> > --- a/include/linux/blkdev.h
>> >> > +++ b/include/linux/blkdev.h
>> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>> >> >
>> >> >  enum swap_blk_hint {
>> >> >         SWAP_SLOT_FREE,
>> >> > +       SWAP_GET_FREE,
>> >> >  };
>> >> >
>> >> >  struct block_device_operations {
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 4bff521e649a..72737e6dd5e5 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -484,6 +484,22 @@ new_cluster:
>> >> >         *scan_base = tmp;
>> >> >  }
>> >> >
>> >> > +static bool __swap_full(struct swap_info_struct *si)
>> >> > +{
>> >> > +       if (si->flags & SWP_BLKDEV) {
>> >> > +               long free;
>> >> > +               struct gendisk *disk = si->bdev->bd_disk;
>> >> > +
>> >> > +               if (disk->fops->swap_hint)
>> >> > +                       if (!disk->fops->swap_hint(si->bdev,
>> >> > +                                               SWAP_GET_FREE,
>> >> > +                                               &free))
>> >> > +                               return free <= 0;
>> >> > +       }
>> >> > +
>> >> > +       return si->inuse_pages == si->pages;
>> >> > +}
>> >> > +
>> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
>> >> >                                    unsigned char usage)
>> >> >  {
>> >> > @@ -583,11 +599,21 @@ checks:
>> >> >         if (offset == si->highest_bit)
>> >> >                 si->highest_bit--;
>> >> >         si->inuse_pages++;
>> >> > -       if (si->inuse_pages == si->pages) {
>> >> > +       if (__swap_full(si)) {
>> >>
>> >> This check is done after an available offset has already been
>> >> selected.  So if the variable-size blkdev is full at this point, then
>> >> this is incorrect, as swap will try to store a page at the current
>> >> selected offset.
>> >
>> > So the result is just fail of a write then what happens?
>> > Page become redirty and keep it in memory so there is no harm.
>>
>> Happening once, it's not a big deal.  But it's not as good as not
>> happening at all.
>
> With your suggestion, we should check full whevever we need new
> swap slot. To me, it's more concern than just a write fail.

well normal device fullness, i.e. inuse_pages == pages, is checked for
each new swap slot also.  I don't see how you would get around
checking for each new swap slot.

>
>>
>> >
>> >>
>> >> > +               struct gendisk *disk = si->bdev->bd_disk;
>> >> > +
>> >> >                 si->lowest_bit = si->max;
>> >> >                 si->highest_bit = 0;
>> >> >                 spin_lock(&swap_avail_lock);
>> >> >                 plist_del(&si->avail_list, &swap_avail_head);
>> >> > +               /*
>> >> > +                * If zram is full, it decreases nr_swap_pages
>> >> > +                * for stopping anonymous page reclaim until
>> >> > +                * zram has free space. Look at swap_entry_free
>> >> > +                */
>> >> > +               if (disk->fops->swap_hint)
>> >>
>> >> Simply checking for the existence of swap_hint isn't enough to know
>> >> we're using zram...
>> >
>> > Yes but acutally the hint have been used for only zram for several years.
>> > If other user is coming in future, we would add more checks if we really
>> > need it at that time.
>> > Do you have another idea?
>>
>> Well if this hint == zram just rename it zram.  Especially if it's now
>> going to be explicitly used to mean it == zram.  But I don't think
>> that is necessary.
>
> I'd like to clarify your comment. So, are you okay without any change?

no what i meant was i don't think the code at this location is
necessary...i think you can put a single blkdev swap_hint fullness
check at the start of scan_swap_map() and remove this.


>
>>
>> >
>> >>
>> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
>> >> > +                               &nr_swap_pages);
>> >> >                 spin_unlock(&swap_avail_lock);
>> >> >         }
>> >> >         si->swap_map[offset] = usage;
>> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >
>> >> >         /* free if no reference */
>> >> >         if (!usage) {
>> >> > +               struct gendisk *disk = p->bdev->bd_disk;
>> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
>> >> >                 if (offset < p->lowest_bit)
>> >> >                         p->lowest_bit = offset;
>> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >                                 if (plist_node_empty(&p->avail_list))
>> >> >                                         plist_add(&p->avail_list,
>> >> >                                                   &swap_avail_head);
>> >> > +                               if ((p->flags & SWP_BLKDEV) &&
>> >> > +                                       disk->fops->swap_hint) {
>> >>
>> >> freeing an entry from a full variable-size blkdev doesn't mean it's
>> >> not still full.  In this case with zsmalloc, freeing one handle
>> >> doesn't actually free any memory unless it was the only handle left in
>> >> its containing zspage, and therefore it's possible that it is still
>> >> full at this point.
>> >
>> > No need to free a zspage in zsmalloc.
>> > If we free a page in zspage, it means we have free space in zspage
>> > so user can give a chance to user for writing out new page.
>>
>> That's not actually true, since zsmalloc has 255 different class
>> sizes, freeing one page means the next page to be compressed has a
>> 1/255 chance that it will be the same size as the just-freed page
>> (assuming random page compressability).
>
> I said "a chance" so if we have a possiblity, I'd like to try it.
> Pz, don't tie your thought into zsmalloc's internal. It's facility
> to communitcate with swap/zram, not zram allocator.
> IOW, We could change allocator of zram potentially
> (ex, historically, we have already done) and the (imaginary allocator/
> or enhanced zsmalloc) could have a technique to handle it.

I'm only thinking of what currently will happen.  A lot of this
depends on exactly how zram's IS_FULL (or GET_FREE) is defined.  If
it's a black/white boundary then this can certainly assume freeing one
entry should put the swap_info_struct back onto the avail_list, but if
it's a not-so-clear full/not-full boundary, then there almost
certainly will be write failures.

It seems like avoiding swap write failures is important, especially
when under heavy memory pressure, but maybe some testing would show
it's not really a big deal.

>
>>
>> >
>> >>
>> >> > +                                       atomic_long_add(p->pages -
>> >> > +                                                       p->inuse_pages,
>> >> > +                                                       &nr_swap_pages);
>> >> > +                                       /*
>> >> > +                                        * reset [highest|lowest]_bit to avoid
>> >> > +                                        * scan_swap_map infinite looping if
>> >> > +                                        * cached free cluster's index by
>> >> > +                                        * scan_swap_map_try_ssd_cluster is
>> >> > +                                        * above p->highest_bit.
>> >> > +                                        */
>> >> > +                                       p->highest_bit = p->max - 1;
>> >> > +                                       p->lowest_bit = 1;
>> >>
>> >> lowest_bit and highest_bit are likely to remain at those extremes for
>> >> a long time, until 1 or max-1 is freed and re-allocated.
>> >>
>> >>
>> >> By adding variable-size blkdev support to swap, I don't think
>> >> highest_bit can be re-used as a "full" flag anymore.
>> >>
>> >> Instead, I suggest that you add a "full" flag to struct
>> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
>> >> scan_swap_map(), and if full simply turn "full" on, remove the
>> >> swap_info_struct from the avail list, reduce nr_swap_pages
>> >> appropriately, and return failure.  Don't mess with lowest_bit or
>> >> highest_bit at all.
>> >
>> > Could you explain what logic in your suggestion prevent the problem
>> > I mentioned(ie, scan_swap_map infinite looping)?
>>
>> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
>> check is done at its start.  And it wouldn't be called again with that
>> swap_info_struct until non-full since it is removed from the
>> avail_list.
>
> Sorry for being not clear. I don't mean it.
> Please consider the situation where swap is not full any more
> by swap_entry_free. Newly scan_swap_map can select the slot index which
> is higher than p->highest_bit because we have cached free_cluster so
> scan_swap_map will reset it with p->lowest_bit and scan again and finally
> pick the slot index just freed by swap_entry_free and checks again.
> Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
> scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
> but unfortunately, offset is higher than p->highest_bit so again it is reset
> to p->lowest_bit. It loops forever :(

sorry, i don't see what you're talking about...if you don't touch the
lowest_bit or highest_bit at all when marking zram as full, and also
don't touch them when marking zram as not-full, then there should
never be any problem with either.

The only reason that lowest_bit/highest_bit are reset when
inuse_pages==pages is because when that condition is true, there
really are no more offsets available.  So highest_bit really is 0, and
lowest_bit really is max.  Then, when a single offset is made
available in swap_entry_free, both lowest_bit and highest_bit are set
to it, because that really is both the lowest and highest bit.

In the case of a variable size blkdev, when it's full there actually
still are more offsets that can be used, so neither lowest nor highest
bit should be modified.  Just leave them where they are, and when the
swap_info_struct is placed back onto the avail_list for scanning,
things will keep working correctly.

am i missing something?

>
> I'd like to solve this problem without many hooking in swap layer and
> any overhead for !zram case.

the only overhead for !zram is the (failing) check for swap_hint,
which you can't avoid.

>
>>
>> >
>> >>
>> >> Then in swap_entry_free(), do something like:
>> >>
>> >>     dec_cluster_info_page(p, p->cluster_info, offset);
>> >>     if (offset < p->lowest_bit)
>> >>       p->lowest_bit = offset;
>> >> -   if (offset > p->highest_bit) {
>> >> -     bool was_full = !p->highest_bit;
>> >> +   if (offset > p->highest_bit)
>> >>       p->highest_bit = offset;
>> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
>> >> +   if (p->full && p->flags & SWP_WRITEOK) {
>> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
>> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
>> >> +
>> >> +     if (!is_var_size_blkdev || !blkdev_full) {
>> >> +       if (is_var_size_blkdev)
>> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
>> >> +       p->full = false;
>> >>         spin_lock(&swap_avail_lock);
>> >>         WARN_ON(!plist_node_empty(&p->avail_list));
>> >>         if (plist_node_empty(&p->avail_list))
>> >>           plist_add(&p->avail_list,
>> >>              &swap_avail_head);
>> >>         spin_unlock(&swap_avail_lock);
>> >> +     } else if (blkdev_full) {
>> >> +       /* still full, so this page isn't actually
>> >> +        * available yet to use; once non-full,
>> >> +        * pages-inuse_pages will be the correct
>> >> +        * number to add (above) since below will
>> >> +        * inuse_pages--
>> >> +        */
>> >> +       atomic_long_dec(&nr_swap_pages);
>> >>       }
>> >>     }
>> >>     atomic_long_inc(&nr_swap_pages);
>> >>
>> >>
>> >>
>> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >                 p->inuse_pages--;
>> >> >                 frontswap_invalidate_page(p->type, offset);
>> >> >                 if (p->flags & SWP_BLKDEV) {
>> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
>> >> >                         if (disk->fops->swap_hint)
>> >> >                                 disk->fops->swap_hint(p->bdev,
>> >> >                                                 SWAP_SLOT_FREE,
>> >> > --
>> >> > 2.0.0
>> >> >
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-16 15:09             ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-16 15:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 8:33 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
>> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
>> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > VM uses nr_swap_pages as one of information when it does
>> >> > anonymous reclaim so that VM is able to throttle amount of swap.
>> >> >
>> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
>> >> > but for zram, it doesn't match because zram can limit memory usage
>> >> > by knob(ie, mem_limit) so although VM can see lots of free space
>> >> > from zram disk, zram can make fail intentionally once the allocated
>> >> > space is over to limit. If it happens, VM should notice it and
>> >> > stop reclaimaing until zram can obtain more free space but there
>> >> > is a good way to do at the moment.
>> >> >
>> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
>> >> > many of freeable space it has. With using that, this patch adds
>> >> > __swap_full which returns true if the zram is full and substract
>> >> > remained freeable space of the zram-swap from nr_swap_pages.
>> >> > IOW, VM sees there is no more swap space of zram so that it stops
>> >> > anonymous reclaiming until swap_entry_free free a page and increase
>> >> > nr_swap_pages again.
>> >> >
>> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> > ---
>> >> >  include/linux/blkdev.h |  1 +
>> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
>> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
>> >> >
>> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> >> > index 17437b2c18e4..c1199806e0f1 100644
>> >> > --- a/include/linux/blkdev.h
>> >> > +++ b/include/linux/blkdev.h
>> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
>> >> >
>> >> >  enum swap_blk_hint {
>> >> >         SWAP_SLOT_FREE,
>> >> > +       SWAP_GET_FREE,
>> >> >  };
>> >> >
>> >> >  struct block_device_operations {
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 4bff521e649a..72737e6dd5e5 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -484,6 +484,22 @@ new_cluster:
>> >> >         *scan_base = tmp;
>> >> >  }
>> >> >
>> >> > +static bool __swap_full(struct swap_info_struct *si)
>> >> > +{
>> >> > +       if (si->flags & SWP_BLKDEV) {
>> >> > +               long free;
>> >> > +               struct gendisk *disk = si->bdev->bd_disk;
>> >> > +
>> >> > +               if (disk->fops->swap_hint)
>> >> > +                       if (!disk->fops->swap_hint(si->bdev,
>> >> > +                                               SWAP_GET_FREE,
>> >> > +                                               &free))
>> >> > +                               return free <= 0;
>> >> > +       }
>> >> > +
>> >> > +       return si->inuse_pages == si->pages;
>> >> > +}
>> >> > +
>> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
>> >> >                                    unsigned char usage)
>> >> >  {
>> >> > @@ -583,11 +599,21 @@ checks:
>> >> >         if (offset == si->highest_bit)
>> >> >                 si->highest_bit--;
>> >> >         si->inuse_pages++;
>> >> > -       if (si->inuse_pages == si->pages) {
>> >> > +       if (__swap_full(si)) {
>> >>
>> >> This check is done after an available offset has already been
>> >> selected.  So if the variable-size blkdev is full at this point, then
>> >> this is incorrect, as swap will try to store a page at the current
>> >> selected offset.
>> >
>> > So the result is just fail of a write then what happens?
>> > Page become redirty and keep it in memory so there is no harm.
>>
>> Happening once, it's not a big deal.  But it's not as good as not
>> happening at all.
>
> With your suggestion, we should check full whevever we need new
> swap slot. To me, it's more concern than just a write fail.

well normal device fullness, i.e. inuse_pages == pages, is checked for
each new swap slot also.  I don't see how you would get around
checking for each new swap slot.

>
>>
>> >
>> >>
>> >> > +               struct gendisk *disk = si->bdev->bd_disk;
>> >> > +
>> >> >                 si->lowest_bit = si->max;
>> >> >                 si->highest_bit = 0;
>> >> >                 spin_lock(&swap_avail_lock);
>> >> >                 plist_del(&si->avail_list, &swap_avail_head);
>> >> > +               /*
>> >> > +                * If zram is full, it decreases nr_swap_pages
>> >> > +                * for stopping anonymous page reclaim until
>> >> > +                * zram has free space. Look at swap_entry_free
>> >> > +                */
>> >> > +               if (disk->fops->swap_hint)
>> >>
>> >> Simply checking for the existence of swap_hint isn't enough to know
>> >> we're using zram...
>> >
>> > Yes but acutally the hint have been used for only zram for several years.
>> > If other user is coming in future, we would add more checks if we really
>> > need it at that time.
>> > Do you have another idea?
>>
>> Well if this hint == zram just rename it zram.  Especially if it's now
>> going to be explicitly used to mean it == zram.  But I don't think
>> that is necessary.
>
> I'd like to clarify your comment. So, are you okay without any change?

no what i meant was i don't think the code at this location is
necessary...i think you can put a single blkdev swap_hint fullness
check at the start of scan_swap_map() and remove this.


>
>>
>> >
>> >>
>> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
>> >> > +                               &nr_swap_pages);
>> >> >                 spin_unlock(&swap_avail_lock);
>> >> >         }
>> >> >         si->swap_map[offset] = usage;
>> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >
>> >> >         /* free if no reference */
>> >> >         if (!usage) {
>> >> > +               struct gendisk *disk = p->bdev->bd_disk;
>> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
>> >> >                 if (offset < p->lowest_bit)
>> >> >                         p->lowest_bit = offset;
>> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >                                 if (plist_node_empty(&p->avail_list))
>> >> >                                         plist_add(&p->avail_list,
>> >> >                                                   &swap_avail_head);
>> >> > +                               if ((p->flags & SWP_BLKDEV) &&
>> >> > +                                       disk->fops->swap_hint) {
>> >>
>> >> freeing an entry from a full variable-size blkdev doesn't mean it's
>> >> not still full.  In this case with zsmalloc, freeing one handle
>> >> doesn't actually free any memory unless it was the only handle left in
>> >> its containing zspage, and therefore it's possible that it is still
>> >> full at this point.
>> >
>> > No need to free a zspage in zsmalloc.
>> > If we free a page in zspage, it means we have free space in zspage
>> > so user can give a chance to user for writing out new page.
>>
>> That's not actually true, since zsmalloc has 255 different class
>> sizes, freeing one page means the next page to be compressed has a
>> 1/255 chance that it will be the same size as the just-freed page
>> (assuming random page compressability).
>
> I said "a chance" so if we have a possiblity, I'd like to try it.
> Pz, don't tie your thought into zsmalloc's internal. It's facility
> to communitcate with swap/zram, not zram allocator.
> IOW, We could change allocator of zram potentially
> (ex, historically, we have already done) and the (imaginary allocator/
> or enhanced zsmalloc) could have a technique to handle it.

I'm only thinking of what currently will happen.  A lot of this
depends on exactly how zram's IS_FULL (or GET_FREE) is defined.  If
it's a black/white boundary then this can certainly assume freeing one
entry should put the swap_info_struct back onto the avail_list, but if
it's a not-so-clear full/not-full boundary, then there almost
certainly will be write failures.

It seems like avoiding swap write failures is important, especially
when under heavy memory pressure, but maybe some testing would show
it's not really a big deal.

>
>>
>> >
>> >>
>> >> > +                                       atomic_long_add(p->pages -
>> >> > +                                                       p->inuse_pages,
>> >> > +                                                       &nr_swap_pages);
>> >> > +                                       /*
>> >> > +                                        * reset [highest|lowest]_bit to avoid
>> >> > +                                        * scan_swap_map infinite looping if
>> >> > +                                        * cached free cluster's index by
>> >> > +                                        * scan_swap_map_try_ssd_cluster is
>> >> > +                                        * above p->highest_bit.
>> >> > +                                        */
>> >> > +                                       p->highest_bit = p->max - 1;
>> >> > +                                       p->lowest_bit = 1;
>> >>
>> >> lowest_bit and highest_bit are likely to remain at those extremes for
>> >> a long time, until 1 or max-1 is freed and re-allocated.
>> >>
>> >>
>> >> By adding variable-size blkdev support to swap, I don't think
>> >> highest_bit can be re-used as a "full" flag anymore.
>> >>
>> >> Instead, I suggest that you add a "full" flag to struct
>> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
>> >> scan_swap_map(), and if full simply turn "full" on, remove the
>> >> swap_info_struct from the avail list, reduce nr_swap_pages
>> >> appropriately, and return failure.  Don't mess with lowest_bit or
>> >> highest_bit at all.
>> >
>> > Could you explain what logic in your suggestion prevent the problem
>> > I mentioned(ie, scan_swap_map infinite looping)?
>>
>> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
>> check is done at its start.  And it wouldn't be called again with that
>> swap_info_struct until non-full since it is removed from the
>> avail_list.
>
> Sorry for being not clear. I don't mean it.
> Please consider the situation where swap is not full any more
> by swap_entry_free. Newly scan_swap_map can select the slot index which
> is higher than p->highest_bit because we have cached free_cluster so
> scan_swap_map will reset it with p->lowest_bit and scan again and finally
> pick the slot index just freed by swap_entry_free and checks again.
> Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
> scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
> but unfortunately, offset is higher than p->highest_bit so again it is reset
> to p->lowest_bit. It loops forever :(

sorry, i don't see what you're talking about...if you don't touch the
lowest_bit or highest_bit at all when marking zram as full, and also
don't touch them when marking zram as not-full, then there should
never be any problem with either.

The only reason that lowest_bit/highest_bit are reset when
inuse_pages==pages is because when that condition is true, there
really are no more offsets available.  So highest_bit really is 0, and
lowest_bit really is max.  Then, when a single offset is made
available in swap_entry_free, both lowest_bit and highest_bit are set
to it, because that really is both the lowest and highest bit.

In the case of a variable size blkdev, when it's full there actually
still are more offsets that can be used, so neither lowest nor highest
bit should be modified.  Just leave them where they are, and when the
swap_info_struct is placed back onto the avail_list for scanning,
things will keep working correctly.

am i missing something?

>
> I'd like to solve this problem without many hooking in swap layer and
> any overhead for !zram case.

the only overhead for !zram is the (failing) check for swap_hint,
which you can't avoid.

>
>>
>> >
>> >>
>> >> Then in swap_entry_free(), do something like:
>> >>
>> >>     dec_cluster_info_page(p, p->cluster_info, offset);
>> >>     if (offset < p->lowest_bit)
>> >>       p->lowest_bit = offset;
>> >> -   if (offset > p->highest_bit) {
>> >> -     bool was_full = !p->highest_bit;
>> >> +   if (offset > p->highest_bit)
>> >>       p->highest_bit = offset;
>> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
>> >> +   if (p->full && p->flags & SWP_WRITEOK) {
>> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
>> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
>> >> +
>> >> +     if (!is_var_size_blkdev || !blkdev_full) {
>> >> +       if (is_var_size_blkdev)
>> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
>> >> +       p->full = false;
>> >>         spin_lock(&swap_avail_lock);
>> >>         WARN_ON(!plist_node_empty(&p->avail_list));
>> >>         if (plist_node_empty(&p->avail_list))
>> >>           plist_add(&p->avail_list,
>> >>              &swap_avail_head);
>> >>         spin_unlock(&swap_avail_lock);
>> >> +     } else if (blkdev_full) {
>> >> +       /* still full, so this page isn't actually
>> >> +        * available yet to use; once non-full,
>> >> +        * pages-inuse_pages will be the correct
>> >> +        * number to add (above) since below will
>> >> +        * inuse_pages--
>> >> +        */
>> >> +       atomic_long_dec(&nr_swap_pages);
>> >>       }
>> >>     }
>> >>     atomic_long_inc(&nr_swap_pages);
>> >>
>> >>
>> >>
>> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
>> >> >                 p->inuse_pages--;
>> >> >                 frontswap_invalidate_page(p->type, offset);
>> >> >                 if (p->flags & SWP_BLKDEV) {
>> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
>> >> >                         if (disk->fops->swap_hint)
>> >> >                                 disk->fops->swap_hint(p->bdev,
>> >> >                                                 SWAP_SLOT_FREE,
>> >> > --
>> >> > 2.0.0
>> >> >
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-16  1:21               ` Minchan Kim
@ 2014-09-16 15:58                 ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-16 15:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
>> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > Hi Heesub,
>> >> >
>> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> >> Hello Minchan,
>> >> >>
>> >> >> First of all, I agree with the overall purpose of your patch set.
>> >> >
>> >> > Thank you.
>> >> >
>> >> >>
>> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >> >know how many zram has freeable space.
>> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >> >
>> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> >---
>> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >> >  1 file changed, 18 insertions(+)
>> >> >> >
>> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >     return 0;
>> >> >> >  }
>> >> >> >
>> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >+{
>> >> >> >+    struct zram *zram;
>> >> >> >+    struct zram_meta *meta;
>> >> >> >+
>> >> >> >+    zram = bdev->bd_disk->private_data;
>> >> >> >+    meta = zram->meta;
>> >> >> >+
>> >> >> >+    if (!zram->limit_pages)
>> >> >> >+            return 1;
>> >> >> >+
>> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >>
>> >> >> Even if 'free' is zero here, there may be free spaces available to
>> >> >> store more compressed pages into the zs_pool. I mean calculation
>> >> >> above is not quite accurate and wastes memory, but have no better
>> >> >> idea for now.
>> >> >
>> >> > Yeb, good point.
>> >> >
>> >> > Actually, I thought about that but in this patchset, I wanted to
>> >> > go with conservative approach which is a safe guard to prevent
>> >> > system hang which is terrible than early OOM kill.
>> >> >
>> >> > Whole point of this patchset is to add a facility to VM and VM
>> >> > collaborates with zram via the interface to avoid worst case
>> >> > (ie, system hang) and logic to throttle could be enhanced by
>> >> > several approaches in future but I agree my logic was too simple
>> >> > and conservative.
>> >> >
>> >> > We could improve it with [anti|de]fragmentation in future but
>> >> > at the moment, below simple heuristic is not too bad for first
>> >> > step. :)
>> >> >
>> >> >
>> >> > ---
>> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >> >  drivers/block/zram/zram_drv.h |  1 +
>> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >> >
>> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> >> > --- a/drivers/block/zram/zram_drv.c
>> >> > +++ b/drivers/block/zram/zram_drv.c
>> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >> >                         &zram->stats.compr_data_size);
>> >> >         atomic64_dec(&zram->stats.pages_stored);
>> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >
>> >> >         meta->table[index].handle = 0;
>> >> >         zram_set_obj_size(meta, index, 0);
>> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >> >                 zs_free(meta->mem_pool, handle);
>> >> > +               atomic_inc(&zram->alloc_fail);
>> >> >                 ret = -ENOMEM;
>> >> >                 goto out;
>> >> >         }
>> >>
>> >> This isn't going to work well at all with swap.  There will be,
>> >> minimum, 32 failures to write a swap page before GET_FREE finally
>> >> indicates it's full, and even then a single free during those 32
>> >> failures will restart the counter, so it could be dozens or hundreds
>> >> (or more) swap write failures before the zram device is marked as
>> >> full.  And then, a single zram free will move it back to non-full and
>> >> start the write failures over again.
>> >>
>> >> I think it would be better to just check for actual fullness (i.e.
>> >> alloced_pages > limit_pages) at the start of write, and fail if so.
>> >> That will allow a single write to succeed when it crosses into
>> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> >> any write failures (or very few), and zram will stay full until enough
>> >> pages are freed that it really does move under limit_pages.
>> >
>> > The alloced_pages > limit_pages doesn't mean zram is full so with your
>> > approach, it could kick OOM earlier which is not what we want.
>> > Because our product uses zram to delay app killing by low memory killer.
>>
>> With zram, the meaning of "full" isn't as obvious as other fixed-size
>> storage devices.  Obviously, "full" usually means "no more room to
>> store anything", while "not full" means "there is room to store
>> anything, up to the remaining free size".  With zram, its zsmalloc
>> pool size might be over the specified limit, but there will still be
>> room to store *some* things - but not *anything*.  Only compressed
>> pages that happen to fit inside a class with at least one zspage that
>> isn't full.
>>
>> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
>> is 100% full in all its classes.
>>
>> What about waiting until there is N number of write failures, like
>> this patch?  That doesn't seem very fair to the writer, since each
>> write failure will cause them to do extra work (first, in selecting
>> what to write, and then in recovering from the failed write).
>> However, it will probably squeeze some writes into some of those empty
>> spaces in already-allocated zspages.
>>
>> And declaring zram "full" immediately once the zsmalloc pool size
>> increases past the specified limit?  Since zsmalloc's classes almost
>> certainly contain some fragmentation, that will waste all the empty
>> spaces that could still store more compressed pages.  But, this is the
>> limit at which you cannot guarantee all writes to be able to store a
>> compressed page - any zsmalloc classes without a partially empty
>> zspage will have to increase zsmalloc's size, therefore failing the
>> write.
>>
>> Neither definition of "full" is optimal.  Since in this case we're
>> talking about swap, I think forcing swap write failures to happen,
>> which with direct reclaim could (I believe) stop everything while the
>> write failures continue, should be avoided as much as possible.  Even
>> when zram fullness is delayed by N write failures, to try to squeeze
>> out as much storage from zsmalloc as possible, when it does eventually
>> fill if zram is the only swap device the system will OOM anyway.  And
>> if zram isn't the only swap device, but just the first (highest
>> priority), then delaying things with unneeded write failures is
>> certainly not better than just filling up so swap can move on to the
>> next swap device.  The only case where write failures delaying marking
>> zram as full will help is if the system stopped right at this point,
>> and then started decreasing how much memory was needed.  That seems
>> like a very unlikely coincidence, but maybe some testing would help
>> determine how bad the write failures affect system
>> performance/responsiveness and how long they delay OOM.
>
> Please, keep in mind that swap is alreay really slow operation but
> we want to use it to avoid OOM if possible so I can't buy your early
> kill suggestion.

I disagree, OOM should be invoked once the system can't proceed with
reclaiming memory.  IMHO, repeated swap write failures will cause the
system to be unable to reclaim memory.

> If a user feel it's really slow for his product,
> it means his admin was fail. He should increase the limit of zram
> dynamically or statically(zram already support that ways).
>
> The thing I'd like to solve in this patchset is to avoid system hang
> where admin cannot do anyting, even ctrl+c, which is thing should
> support in OS level.

what's better - failing a lot of swap writes, or marking the swap
device as full?  As I said if zram is the *only* swap device in the
system, maybe that makes sense (although it's still questionable).  If
zram is only the first swap device, and there's a backup swap device
(presumably that just writes to disk), then it will be *much* better
to simply fail over to that, instead of (repeatedly) failing a lot of
swap writes.

Especially once direct reclaim is reached, failing swap writes is
probably going to make the system unresponsive.  Personally I think
moving to OOM (or the next swap device) is better.

If write failures are the direction you go, then IMHO there should *at
least* be a zram parameter to allow the admin to choose to immediately
fail or continue with write failures.


>
>>
>> Since there may be different use cases that desire different things,
>> maybe there should be a zram runtime (or buildtime) config to choose
>> exactly how it decides it's full?  Either full after N write failures,
>> or full when alloced>limit?  That would allow the user to either defer
>> getting full as long as possible (at the possible cost of system
>> unresponsiveness during those write failures), or to just move
>> immediately to zram being full as soon as it can't guarantee that each
>> write will succeed.
>
> Hmm, I thought it and was going to post it when I send v1.
> My idea was this.

what i actually meant was more like this, where ->stop_using_when_full
is a user-configurable param:

bool zram_is_full(...)
{
  if (zram->stop_using_when_full) {
    /* for this, allow 1 write to succeed past limit_pages */
    return zs_get_total_pages(zram) > zram->limit_pages;
  } else {
    return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
  }
}

>
> int zram_get_free_pages(...)
> {
>         if (zram->limit_pages &&
>                 zram->alloc_fail > FULL_THRESH_HOLD &&
>                 (100 * compr_data_size >> PAGE_SHIFT /
>                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {

well...i think this implementation has both downsides; it forces write
failures to happen, but also it doesn't guarantee being full after
FULL_THRESHOLD write failures.  If the fragmentation level never
reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
any way that using the amount of fragmentation will work, because you
can't guarantee it will be reached.  The incoming pages to compress
may all fall into classes that are already full.

with zsmalloc compaction, it would be possible to know that a certain
fragmentation threshold could be reached, but without it that's not a
promise zsmalloc can keep.  And we definitely don't want to fail swap
writes forever.


>                         *free = 0;
>                         return 0;
>         }
>         ..
> }
>
> Maybe we could export FRAG_THRESHOLD.
>
>>
>>
>>
>> >
>> >>
>> >>
>> >>
>> >> >
>> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >         update_used_max(zram, alloced_pages);
>> >> >
>> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >         return 0;
>> >> >  }
>> >> >
>> >> > +#define FULL_THRESH_HOLD 32
>> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >  {
>> >> >         struct zram *zram;
>> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >         zram = bdev->bd_disk->private_data;
>> >> >         meta = zram->meta;
>> >> >
>> >> > -       if (!zram->limit_pages)
>> >> > -               return 1;
>> >> > -
>> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> > +       if (zram->limit_pages &&
>> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> >> > +               *free = 0;
>> >> > +               return 0;
>> >> > +       }
>> >> >
>> >> > -       return 0;
>> >> > +       return 1;
>> >>
>> >> There's no way that zram can even provide a accurate number of free
>> >> pages, since it can't know how compressible future stored pages will
>> >> be.  It would be better to simply change this swap_hint from GET_FREE
>> >> to IS_FULL, and return either true or false.
>> >
>> > My plan is that we can give an approximation based on
>> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
>> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
>> > step by step so I didn't include the hint.
>> > If you are strong against with that in this stage, I can change it and
>> > try it later with the number.
>> > Please, say again if you want.
>>
>> since as you said zram is the only user of swap_hint, changing it
>> later shouldn't be a big deal.  And you could have both, IS_FULL and
>> GET_FREE; since the check in scan_swap_map() really only is checking
>> for IS_FULL, if you update vmscan later to adjust its file/anon
>> balance based on GET_FREE, that can be added then with no trouble,
>> right?
>
> Yeb, No problem.
>
>>
>>
>> >
>> > Thanks for the review!
>> >
>> >
>> >>
>> >>
>> >> >  }
>> >> >
>> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> >> > index 779d03fa4360..182a2544751b 100644
>> >> > --- a/drivers/block/zram/zram_drv.h
>> >> > +++ b/drivers/block/zram/zram_drv.h
>> >> > @@ -115,6 +115,7 @@ struct zram {
>> >> >         u64 disksize;   /* bytes */
>> >> >         int max_comp_streams;
>> >> >         struct zram_stats stats;
>> >> > +       atomic_t alloc_fail;
>> >> >         /*
>> >> >          * the number of pages zram can consume for storing compressed data
>> >> >          */
>> >> > --
>> >> > 2.0.0
>> >> >
>> >> >>
>> >> >> heesub
>> >> >>
>> >> >> >+
>> >> >> >+    return 0;
>> >> >> >+}
>> >> >> >+
>> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> >                             unsigned int hint, void *arg)
>> >> >> >  {
>> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >> >
>> >> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >> >
>> >> >> >     return ret;
>> >> >> >  }
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> see: http://www.linux-mm.org/ .
>> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >
>> >> > --
>> >> > Kind regards,
>> >> > Minchan Kim
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-16 15:58                 ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-16 15:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
>> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > Hi Heesub,
>> >> >
>> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> >> Hello Minchan,
>> >> >>
>> >> >> First of all, I agree with the overall purpose of your patch set.
>> >> >
>> >> > Thank you.
>> >> >
>> >> >>
>> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >> >know how many zram has freeable space.
>> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >> >
>> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> >---
>> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >> >  1 file changed, 18 insertions(+)
>> >> >> >
>> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >     return 0;
>> >> >> >  }
>> >> >> >
>> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >+{
>> >> >> >+    struct zram *zram;
>> >> >> >+    struct zram_meta *meta;
>> >> >> >+
>> >> >> >+    zram = bdev->bd_disk->private_data;
>> >> >> >+    meta = zram->meta;
>> >> >> >+
>> >> >> >+    if (!zram->limit_pages)
>> >> >> >+            return 1;
>> >> >> >+
>> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >>
>> >> >> Even if 'free' is zero here, there may be free spaces available to
>> >> >> store more compressed pages into the zs_pool. I mean calculation
>> >> >> above is not quite accurate and wastes memory, but have no better
>> >> >> idea for now.
>> >> >
>> >> > Yeb, good point.
>> >> >
>> >> > Actually, I thought about that but in this patchset, I wanted to
>> >> > go with conservative approach which is a safe guard to prevent
>> >> > system hang which is terrible than early OOM kill.
>> >> >
>> >> > Whole point of this patchset is to add a facility to VM and VM
>> >> > collaborates with zram via the interface to avoid worst case
>> >> > (ie, system hang) and logic to throttle could be enhanced by
>> >> > several approaches in future but I agree my logic was too simple
>> >> > and conservative.
>> >> >
>> >> > We could improve it with [anti|de]fragmentation in future but
>> >> > at the moment, below simple heuristic is not too bad for first
>> >> > step. :)
>> >> >
>> >> >
>> >> > ---
>> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >> >  drivers/block/zram/zram_drv.h |  1 +
>> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >> >
>> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> >> > --- a/drivers/block/zram/zram_drv.c
>> >> > +++ b/drivers/block/zram/zram_drv.c
>> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >> >                         &zram->stats.compr_data_size);
>> >> >         atomic64_dec(&zram->stats.pages_stored);
>> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >
>> >> >         meta->table[index].handle = 0;
>> >> >         zram_set_obj_size(meta, index, 0);
>> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >> >                 zs_free(meta->mem_pool, handle);
>> >> > +               atomic_inc(&zram->alloc_fail);
>> >> >                 ret = -ENOMEM;
>> >> >                 goto out;
>> >> >         }
>> >>
>> >> This isn't going to work well at all with swap.  There will be,
>> >> minimum, 32 failures to write a swap page before GET_FREE finally
>> >> indicates it's full, and even then a single free during those 32
>> >> failures will restart the counter, so it could be dozens or hundreds
>> >> (or more) swap write failures before the zram device is marked as
>> >> full.  And then, a single zram free will move it back to non-full and
>> >> start the write failures over again.
>> >>
>> >> I think it would be better to just check for actual fullness (i.e.
>> >> alloced_pages > limit_pages) at the start of write, and fail if so.
>> >> That will allow a single write to succeed when it crosses into
>> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> >> any write failures (or very few), and zram will stay full until enough
>> >> pages are freed that it really does move under limit_pages.
>> >
>> > The alloced_pages > limit_pages doesn't mean zram is full so with your
>> > approach, it could kick OOM earlier which is not what we want.
>> > Because our product uses zram to delay app killing by low memory killer.
>>
>> With zram, the meaning of "full" isn't as obvious as other fixed-size
>> storage devices.  Obviously, "full" usually means "no more room to
>> store anything", while "not full" means "there is room to store
>> anything, up to the remaining free size".  With zram, its zsmalloc
>> pool size might be over the specified limit, but there will still be
>> room to store *some* things - but not *anything*.  Only compressed
>> pages that happen to fit inside a class with at least one zspage that
>> isn't full.
>>
>> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
>> is 100% full in all its classes.
>>
>> What about waiting until there is N number of write failures, like
>> this patch?  That doesn't seem very fair to the writer, since each
>> write failure will cause them to do extra work (first, in selecting
>> what to write, and then in recovering from the failed write).
>> However, it will probably squeeze some writes into some of those empty
>> spaces in already-allocated zspages.
>>
>> And declaring zram "full" immediately once the zsmalloc pool size
>> increases past the specified limit?  Since zsmalloc's classes almost
>> certainly contain some fragmentation, that will waste all the empty
>> spaces that could still store more compressed pages.  But, this is the
>> limit at which you cannot guarantee all writes to be able to store a
>> compressed page - any zsmalloc classes without a partially empty
>> zspage will have to increase zsmalloc's size, therefore failing the
>> write.
>>
>> Neither definition of "full" is optimal.  Since in this case we're
>> talking about swap, I think forcing swap write failures to happen,
>> which with direct reclaim could (I believe) stop everything while the
>> write failures continue, should be avoided as much as possible.  Even
>> when zram fullness is delayed by N write failures, to try to squeeze
>> out as much storage from zsmalloc as possible, when it does eventually
>> fill if zram is the only swap device the system will OOM anyway.  And
>> if zram isn't the only swap device, but just the first (highest
>> priority), then delaying things with unneeded write failures is
>> certainly not better than just filling up so swap can move on to the
>> next swap device.  The only case where write failures delaying marking
>> zram as full will help is if the system stopped right at this point,
>> and then started decreasing how much memory was needed.  That seems
>> like a very unlikely coincidence, but maybe some testing would help
>> determine how bad the write failures affect system
>> performance/responsiveness and how long they delay OOM.
>
> Please, keep in mind that swap is alreay really slow operation but
> we want to use it to avoid OOM if possible so I can't buy your early
> kill suggestion.

I disagree, OOM should be invoked once the system can't proceed with
reclaiming memory.  IMHO, repeated swap write failures will cause the
system to be unable to reclaim memory.

> If a user feel it's really slow for his product,
> it means his admin was fail. He should increase the limit of zram
> dynamically or statically(zram already support that ways).
>
> The thing I'd like to solve in this patchset is to avoid system hang
> where admin cannot do anyting, even ctrl+c, which is thing should
> support in OS level.

what's better - failing a lot of swap writes, or marking the swap
device as full?  As I said if zram is the *only* swap device in the
system, maybe that makes sense (although it's still questionable).  If
zram is only the first swap device, and there's a backup swap device
(presumably that just writes to disk), then it will be *much* better
to simply fail over to that, instead of (repeatedly) failing a lot of
swap writes.

Especially once direct reclaim is reached, failing swap writes is
probably going to make the system unresponsive.  Personally I think
moving to OOM (or the next swap device) is better.

If write failures are the direction you go, then IMHO there should *at
least* be a zram parameter to allow the admin to choose to immediately
fail or continue with write failures.


>
>>
>> Since there may be different use cases that desire different things,
>> maybe there should be a zram runtime (or buildtime) config to choose
>> exactly how it decides it's full?  Either full after N write failures,
>> or full when alloced>limit?  That would allow the user to either defer
>> getting full as long as possible (at the possible cost of system
>> unresponsiveness during those write failures), or to just move
>> immediately to zram being full as soon as it can't guarantee that each
>> write will succeed.
>
> Hmm, I thought it and was going to post it when I send v1.
> My idea was this.

what i actually meant was more like this, where ->stop_using_when_full
is a user-configurable param:

bool zram_is_full(...)
{
  if (zram->stop_using_when_full) {
    /* for this, allow 1 write to succeed past limit_pages */
    return zs_get_total_pages(zram) > zram->limit_pages;
  } else {
    return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
  }
}

>
> int zram_get_free_pages(...)
> {
>         if (zram->limit_pages &&
>                 zram->alloc_fail > FULL_THRESH_HOLD &&
>                 (100 * compr_data_size >> PAGE_SHIFT /
>                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {

well...i think this implementation has both downsides; it forces write
failures to happen, but also it doesn't guarantee being full after
FULL_THRESHOLD write failures.  If the fragmentation level never
reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
any way that using the amount of fragmentation will work, because you
can't guarantee it will be reached.  The incoming pages to compress
may all fall into classes that are already full.

with zsmalloc compaction, it would be possible to know that a certain
fragmentation threshold could be reached, but without it that's not a
promise zsmalloc can keep.  And we definitely don't want to fail swap
writes forever.


>                         *free = 0;
>                         return 0;
>         }
>         ..
> }
>
> Maybe we could export FRAG_THRESHOLD.
>
>>
>>
>>
>> >
>> >>
>> >>
>> >>
>> >> >
>> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >         update_used_max(zram, alloced_pages);
>> >> >
>> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >         return 0;
>> >> >  }
>> >> >
>> >> > +#define FULL_THRESH_HOLD 32
>> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >  {
>> >> >         struct zram *zram;
>> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >         zram = bdev->bd_disk->private_data;
>> >> >         meta = zram->meta;
>> >> >
>> >> > -       if (!zram->limit_pages)
>> >> > -               return 1;
>> >> > -
>> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> > +       if (zram->limit_pages &&
>> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> >> > +               *free = 0;
>> >> > +               return 0;
>> >> > +       }
>> >> >
>> >> > -       return 0;
>> >> > +       return 1;
>> >>
>> >> There's no way that zram can even provide a accurate number of free
>> >> pages, since it can't know how compressible future stored pages will
>> >> be.  It would be better to simply change this swap_hint from GET_FREE
>> >> to IS_FULL, and return either true or false.
>> >
>> > My plan is that we can give an approximation based on
>> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
>> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
>> > step by step so I didn't include the hint.
>> > If you are strong against with that in this stage, I can change it and
>> > try it later with the number.
>> > Please, say again if you want.
>>
>> since as you said zram is the only user of swap_hint, changing it
>> later shouldn't be a big deal.  And you could have both, IS_FULL and
>> GET_FREE; since the check in scan_swap_map() really only is checking
>> for IS_FULL, if you update vmscan later to adjust its file/anon
>> balance based on GET_FREE, that can be added then with no trouble,
>> right?
>
> Yeb, No problem.
>
>>
>>
>> >
>> > Thanks for the review!
>> >
>> >
>> >>
>> >>
>> >> >  }
>> >> >
>> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> >> > index 779d03fa4360..182a2544751b 100644
>> >> > --- a/drivers/block/zram/zram_drv.h
>> >> > +++ b/drivers/block/zram/zram_drv.h
>> >> > @@ -115,6 +115,7 @@ struct zram {
>> >> >         u64 disksize;   /* bytes */
>> >> >         int max_comp_streams;
>> >> >         struct zram_stats stats;
>> >> > +       atomic_t alloc_fail;
>> >> >         /*
>> >> >          * the number of pages zram can consume for storing compressed data
>> >> >          */
>> >> > --
>> >> > 2.0.0
>> >> >
>> >> >>
>> >> >> heesub
>> >> >>
>> >> >> >+
>> >> >> >+    return 0;
>> >> >> >+}
>> >> >> >+
>> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> >                             unsigned int hint, void *arg)
>> >> >> >  {
>> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >> >
>> >> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >> >
>> >> >> >     return ret;
>> >> >> >  }
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> see: http://www.linux-mm.org/ .
>> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >
>> >> > --
>> >> > Kind regards,
>> >> > Minchan Kim
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
  2014-09-16 15:09             ` Dan Streetman
@ 2014-09-17  7:14               ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-17  7:14 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Tue, Sep 16, 2014 at 11:09:43AM -0400, Dan Streetman wrote:
> On Mon, Sep 15, 2014 at 8:33 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
> >> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> >> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > VM uses nr_swap_pages as one of information when it does
> >> >> > anonymous reclaim so that VM is able to throttle amount of swap.
> >> >> >
> >> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> >> >> > but for zram, it doesn't match because zram can limit memory usage
> >> >> > by knob(ie, mem_limit) so although VM can see lots of free space
> >> >> > from zram disk, zram can make fail intentionally once the allocated
> >> >> > space is over to limit. If it happens, VM should notice it and
> >> >> > stop reclaimaing until zram can obtain more free space but there
> >> >> > is a good way to do at the moment.
> >> >> >
> >> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
> >> >> > many of freeable space it has. With using that, this patch adds
> >> >> > __swap_full which returns true if the zram is full and substract
> >> >> > remained freeable space of the zram-swap from nr_swap_pages.
> >> >> > IOW, VM sees there is no more swap space of zram so that it stops
> >> >> > anonymous reclaiming until swap_entry_free free a page and increase
> >> >> > nr_swap_pages again.
> >> >> >
> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> > ---
> >> >> >  include/linux/blkdev.h |  1 +
> >> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >> >> >
> >> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> >> > index 17437b2c18e4..c1199806e0f1 100644
> >> >> > --- a/include/linux/blkdev.h
> >> >> > +++ b/include/linux/blkdev.h
> >> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >> >> >
> >> >> >  enum swap_blk_hint {
> >> >> >         SWAP_SLOT_FREE,
> >> >> > +       SWAP_GET_FREE,
> >> >> >  };
> >> >> >
> >> >> >  struct block_device_operations {
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 4bff521e649a..72737e6dd5e5 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -484,6 +484,22 @@ new_cluster:
> >> >> >         *scan_base = tmp;
> >> >> >  }
> >> >> >
> >> >> > +static bool __swap_full(struct swap_info_struct *si)
> >> >> > +{
> >> >> > +       if (si->flags & SWP_BLKDEV) {
> >> >> > +               long free;
> >> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> >> > +
> >> >> > +               if (disk->fops->swap_hint)
> >> >> > +                       if (!disk->fops->swap_hint(si->bdev,
> >> >> > +                                               SWAP_GET_FREE,
> >> >> > +                                               &free))
> >> >> > +                               return free <= 0;
> >> >> > +       }
> >> >> > +
> >> >> > +       return si->inuse_pages == si->pages;
> >> >> > +}
> >> >> > +
> >> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >> >> >                                    unsigned char usage)
> >> >> >  {
> >> >> > @@ -583,11 +599,21 @@ checks:
> >> >> >         if (offset == si->highest_bit)
> >> >> >                 si->highest_bit--;
> >> >> >         si->inuse_pages++;
> >> >> > -       if (si->inuse_pages == si->pages) {
> >> >> > +       if (__swap_full(si)) {
> >> >>
> >> >> This check is done after an available offset has already been
> >> >> selected.  So if the variable-size blkdev is full at this point, then
> >> >> this is incorrect, as swap will try to store a page at the current
> >> >> selected offset.
> >> >
> >> > So the result is just fail of a write then what happens?
> >> > Page become redirty and keep it in memory so there is no harm.
> >>
> >> Happening once, it's not a big deal.  But it's not as good as not
> >> happening at all.
> >
> > With your suggestion, we should check full whevever we need new
> > swap slot. To me, it's more concern than just a write fail.
> 
> well normal device fullness, i.e. inuse_pages == pages, is checked for
> each new swap slot also.  I don't see how you would get around
> checking for each new swap slot.

You're right!

> 
> >
> >>
> >> >
> >> >>
> >> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> >> > +
> >> >> >                 si->lowest_bit = si->max;
> >> >> >                 si->highest_bit = 0;
> >> >> >                 spin_lock(&swap_avail_lock);
> >> >> >                 plist_del(&si->avail_list, &swap_avail_head);
> >> >> > +               /*
> >> >> > +                * If zram is full, it decreases nr_swap_pages
> >> >> > +                * for stopping anonymous page reclaim until
> >> >> > +                * zram has free space. Look at swap_entry_free
> >> >> > +                */
> >> >> > +               if (disk->fops->swap_hint)
> >> >>
> >> >> Simply checking for the existence of swap_hint isn't enough to know
> >> >> we're using zram...
> >> >
> >> > Yes but acutally the hint have been used for only zram for several years.
> >> > If other user is coming in future, we would add more checks if we really
> >> > need it at that time.
> >> > Do you have another idea?
> >>
> >> Well if this hint == zram just rename it zram.  Especially if it's now
> >> going to be explicitly used to mean it == zram.  But I don't think
> >> that is necessary.
> >
> > I'd like to clarify your comment. So, are you okay without any change?
> 
> no what i meant was i don't think the code at this location is
> necessary...i think you can put a single blkdev swap_hint fullness
> check at the start of scan_swap_map() and remove this.
> 
> 
> >
> >>
> >> >
> >> >>
> >> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> >> >> > +                               &nr_swap_pages);
> >> >> >                 spin_unlock(&swap_avail_lock);
> >> >> >         }
> >> >> >         si->swap_map[offset] = usage;
> >> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >
> >> >> >         /* free if no reference */
> >> >> >         if (!usage) {
> >> >> > +               struct gendisk *disk = p->bdev->bd_disk;
> >> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >> >> >                 if (offset < p->lowest_bit)
> >> >> >                         p->lowest_bit = offset;
> >> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >                                 if (plist_node_empty(&p->avail_list))
> >> >> >                                         plist_add(&p->avail_list,
> >> >> >                                                   &swap_avail_head);
> >> >> > +                               if ((p->flags & SWP_BLKDEV) &&
> >> >> > +                                       disk->fops->swap_hint) {
> >> >>
> >> >> freeing an entry from a full variable-size blkdev doesn't mean it's
> >> >> not still full.  In this case with zsmalloc, freeing one handle
> >> >> doesn't actually free any memory unless it was the only handle left in
> >> >> its containing zspage, and therefore it's possible that it is still
> >> >> full at this point.
> >> >
> >> > No need to free a zspage in zsmalloc.
> >> > If we free a page in zspage, it means we have free space in zspage
> >> > so user can give a chance to user for writing out new page.
> >>
> >> That's not actually true, since zsmalloc has 255 different class
> >> sizes, freeing one page means the next page to be compressed has a
> >> 1/255 chance that it will be the same size as the just-freed page
> >> (assuming random page compressability).
> >
> > I said "a chance" so if we have a possiblity, I'd like to try it.
> > Pz, don't tie your thought into zsmalloc's internal. It's facility
> > to communitcate with swap/zram, not zram allocator.
> > IOW, We could change allocator of zram potentially
> > (ex, historically, we have already done) and the (imaginary allocator/
> > or enhanced zsmalloc) could have a technique to handle it.
> 
> I'm only thinking of what currently will happen.  A lot of this
> depends on exactly how zram's IS_FULL (or GET_FREE) is defined.  If
> it's a black/white boundary then this can certainly assume freeing one
> entry should put the swap_info_struct back onto the avail_list, but if
> it's a not-so-clear full/not-full boundary, then there almost
> certainly will be write failures.
> 
> It seems like avoiding swap write failures is important, especially
> when under heavy memory pressure, but maybe some testing would show
> it's not really a big deal.

Hmm, I can't think why swap write failure is important.
I agree it's rather weird because it's not popular with real storage but
what's problem although it happens? If swap is full, it means system
is already slow so if you are concern about performance caused by
unnecessary IO, I don't think it's critical problem.
My concern is to prevent system unresponse via OOM kill and successive write
failure finally kicks OOM.
In my simple test(ie, kernel build), there is no problem to meet the goal.
Otherwise, it goes system unresponsive very easily.

> 
> >
> >>
> >> >
> >> >>
> >> >> > +                                       atomic_long_add(p->pages -
> >> >> > +                                                       p->inuse_pages,
> >> >> > +                                                       &nr_swap_pages);
> >> >> > +                                       /*
> >> >> > +                                        * reset [highest|lowest]_bit to avoid
> >> >> > +                                        * scan_swap_map infinite looping if
> >> >> > +                                        * cached free cluster's index by
> >> >> > +                                        * scan_swap_map_try_ssd_cluster is
> >> >> > +                                        * above p->highest_bit.
> >> >> > +                                        */
> >> >> > +                                       p->highest_bit = p->max - 1;
> >> >> > +                                       p->lowest_bit = 1;
> >> >>
> >> >> lowest_bit and highest_bit are likely to remain at those extremes for
> >> >> a long time, until 1 or max-1 is freed and re-allocated.
> >> >>
> >> >>
> >> >> By adding variable-size blkdev support to swap, I don't think
> >> >> highest_bit can be re-used as a "full" flag anymore.
> >> >>
> >> >> Instead, I suggest that you add a "full" flag to struct
> >> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> >> >> scan_swap_map(), and if full simply turn "full" on, remove the
> >> >> swap_info_struct from the avail list, reduce nr_swap_pages
> >> >> appropriately, and return failure.  Don't mess with lowest_bit or
> >> >> highest_bit at all.
> >> >
> >> > Could you explain what logic in your suggestion prevent the problem
> >> > I mentioned(ie, scan_swap_map infinite looping)?
> >>
> >> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
> >> check is done at its start.  And it wouldn't be called again with that
> >> swap_info_struct until non-full since it is removed from the
> >> avail_list.
> >
> > Sorry for being not clear. I don't mean it.
> > Please consider the situation where swap is not full any more
> > by swap_entry_free. Newly scan_swap_map can select the slot index which
> > is higher than p->highest_bit because we have cached free_cluster so
> > scan_swap_map will reset it with p->lowest_bit and scan again and finally
> > pick the slot index just freed by swap_entry_free and checks again.
> > Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
> > scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
> > but unfortunately, offset is higher than p->highest_bit so again it is reset
> > to p->lowest_bit. It loops forever :(
> 
> sorry, i don't see what you're talking about...if you don't touch the
> lowest_bit or highest_bit at all when marking zram as full, and also
> don't touch them when marking zram as not-full, then there should
> never be any problem with either.

Aha, I got your point and feel it's more simple.
Okay, I will try it.

Thanks for the review!

> 
> The only reason that lowest_bit/highest_bit are reset when
> inuse_pages==pages is because when that condition is true, there
> really are no more offsets available.  So highest_bit really is 0, and
> lowest_bit really is max.  Then, when a single offset is made
> available in swap_entry_free, both lowest_bit and highest_bit are set
> to it, because that really is both the lowest and highest bit.
> 
> In the case of a variable size blkdev, when it's full there actually
> still are more offsets that can be used, so neither lowest nor highest
> bit should be modified.  Just leave them where they are, and when the
> swap_info_struct is placed back onto the avail_list for scanning,
> things will keep working correctly.
> 
> am i missing something?

Nope, your suggestion is better than mine.

> 
> >
> > I'd like to solve this problem without many hooking in swap layer and
> > any overhead for !zram case.
> 
> the only overhead for !zram is the (failing) check for swap_hint,
> which you can't avoid.
> 
> >
> >>
> >> >
> >> >>
> >> >> Then in swap_entry_free(), do something like:
> >> >>
> >> >>     dec_cluster_info_page(p, p->cluster_info, offset);
> >> >>     if (offset < p->lowest_bit)
> >> >>       p->lowest_bit = offset;
> >> >> -   if (offset > p->highest_bit) {
> >> >> -     bool was_full = !p->highest_bit;
> >> >> +   if (offset > p->highest_bit)
> >> >>       p->highest_bit = offset;
> >> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> >> >> +   if (p->full && p->flags & SWP_WRITEOK) {
> >> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> >> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> >> >> +
> >> >> +     if (!is_var_size_blkdev || !blkdev_full) {
> >> >> +       if (is_var_size_blkdev)
> >> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> >> >> +       p->full = false;
> >> >>         spin_lock(&swap_avail_lock);
> >> >>         WARN_ON(!plist_node_empty(&p->avail_list));
> >> >>         if (plist_node_empty(&p->avail_list))
> >> >>           plist_add(&p->avail_list,
> >> >>              &swap_avail_head);
> >> >>         spin_unlock(&swap_avail_lock);
> >> >> +     } else if (blkdev_full) {
> >> >> +       /* still full, so this page isn't actually
> >> >> +        * available yet to use; once non-full,
> >> >> +        * pages-inuse_pages will be the correct
> >> >> +        * number to add (above) since below will
> >> >> +        * inuse_pages--
> >> >> +        */
> >> >> +       atomic_long_dec(&nr_swap_pages);
> >> >>       }
> >> >>     }
> >> >>     atomic_long_inc(&nr_swap_pages);
> >> >>
> >> >>
> >> >>
> >> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >                 p->inuse_pages--;
> >> >> >                 frontswap_invalidate_page(p->type, offset);
> >> >> >                 if (p->flags & SWP_BLKDEV) {
> >> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >> >> >                         if (disk->fops->swap_hint)
> >> >> >                                 disk->fops->swap_hint(p->bdev,
> >> >> >                                                 SWAP_SLOT_FREE,
> >> >> > --
> >> >> > 2.0.0
> >> >> >
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 2/3] mm: add swap_get_free hint for zram
@ 2014-09-17  7:14               ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-17  7:14 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins, Shaohua Li,
	Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Tue, Sep 16, 2014 at 11:09:43AM -0400, Dan Streetman wrote:
> On Mon, Sep 15, 2014 at 8:33 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Sep 15, 2014 at 10:53:01AM -0400, Dan Streetman wrote:
> >> On Sun, Sep 14, 2014 at 8:30 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Sat, Sep 13, 2014 at 03:01:47PM -0400, Dan Streetman wrote:
> >> >> On Wed, Sep 3, 2014 at 9:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > VM uses nr_swap_pages as one of information when it does
> >> >> > anonymous reclaim so that VM is able to throttle amount of swap.
> >> >> >
> >> >> > Normally, the nr_swap_pages is equal to freeable space of swap disk
> >> >> > but for zram, it doesn't match because zram can limit memory usage
> >> >> > by knob(ie, mem_limit) so although VM can see lots of free space
> >> >> > from zram disk, zram can make fail intentionally once the allocated
> >> >> > space is over to limit. If it happens, VM should notice it and
> >> >> > stop reclaimaing until zram can obtain more free space but there
> >> >> > is a good way to do at the moment.
> >> >> >
> >> >> > This patch adds new hint SWAP_GET_FREE which zram can return how
> >> >> > many of freeable space it has. With using that, this patch adds
> >> >> > __swap_full which returns true if the zram is full and substract
> >> >> > remained freeable space of the zram-swap from nr_swap_pages.
> >> >> > IOW, VM sees there is no more swap space of zram so that it stops
> >> >> > anonymous reclaiming until swap_entry_free free a page and increase
> >> >> > nr_swap_pages again.
> >> >> >
> >> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> > ---
> >> >> >  include/linux/blkdev.h |  1 +
> >> >> >  mm/swapfile.c          | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >> >> >  2 files changed, 44 insertions(+), 2 deletions(-)
> >> >> >
> >> >> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> >> > index 17437b2c18e4..c1199806e0f1 100644
> >> >> > --- a/include/linux/blkdev.h
> >> >> > +++ b/include/linux/blkdev.h
> >> >> > @@ -1611,6 +1611,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
> >> >> >
> >> >> >  enum swap_blk_hint {
> >> >> >         SWAP_SLOT_FREE,
> >> >> > +       SWAP_GET_FREE,
> >> >> >  };
> >> >> >
> >> >> >  struct block_device_operations {
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 4bff521e649a..72737e6dd5e5 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -484,6 +484,22 @@ new_cluster:
> >> >> >         *scan_base = tmp;
> >> >> >  }
> >> >> >
> >> >> > +static bool __swap_full(struct swap_info_struct *si)
> >> >> > +{
> >> >> > +       if (si->flags & SWP_BLKDEV) {
> >> >> > +               long free;
> >> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> >> > +
> >> >> > +               if (disk->fops->swap_hint)
> >> >> > +                       if (!disk->fops->swap_hint(si->bdev,
> >> >> > +                                               SWAP_GET_FREE,
> >> >> > +                                               &free))
> >> >> > +                               return free <= 0;
> >> >> > +       }
> >> >> > +
> >> >> > +       return si->inuse_pages == si->pages;
> >> >> > +}
> >> >> > +
> >> >> >  static unsigned long scan_swap_map(struct swap_info_struct *si,
> >> >> >                                    unsigned char usage)
> >> >> >  {
> >> >> > @@ -583,11 +599,21 @@ checks:
> >> >> >         if (offset == si->highest_bit)
> >> >> >                 si->highest_bit--;
> >> >> >         si->inuse_pages++;
> >> >> > -       if (si->inuse_pages == si->pages) {
> >> >> > +       if (__swap_full(si)) {
> >> >>
> >> >> This check is done after an available offset has already been
> >> >> selected.  So if the variable-size blkdev is full at this point, then
> >> >> this is incorrect, as swap will try to store a page at the current
> >> >> selected offset.
> >> >
> >> > So the result is just fail of a write then what happens?
> >> > Page become redirty and keep it in memory so there is no harm.
> >>
> >> Happening once, it's not a big deal.  But it's not as good as not
> >> happening at all.
> >
> > With your suggestion, we should check full whevever we need new
> > swap slot. To me, it's more concern than just a write fail.
> 
> well normal device fullness, i.e. inuse_pages == pages, is checked for
> each new swap slot also.  I don't see how you would get around
> checking for each new swap slot.

You're right!

> 
> >
> >>
> >> >
> >> >>
> >> >> > +               struct gendisk *disk = si->bdev->bd_disk;
> >> >> > +
> >> >> >                 si->lowest_bit = si->max;
> >> >> >                 si->highest_bit = 0;
> >> >> >                 spin_lock(&swap_avail_lock);
> >> >> >                 plist_del(&si->avail_list, &swap_avail_head);
> >> >> > +               /*
> >> >> > +                * If zram is full, it decreases nr_swap_pages
> >> >> > +                * for stopping anonymous page reclaim until
> >> >> > +                * zram has free space. Look at swap_entry_free
> >> >> > +                */
> >> >> > +               if (disk->fops->swap_hint)
> >> >>
> >> >> Simply checking for the existence of swap_hint isn't enough to know
> >> >> we're using zram...
> >> >
> >> > Yes but acutally the hint have been used for only zram for several years.
> >> > If other user is coming in future, we would add more checks if we really
> >> > need it at that time.
> >> > Do you have another idea?
> >>
> >> Well if this hint == zram just rename it zram.  Especially if it's now
> >> going to be explicitly used to mean it == zram.  But I don't think
> >> that is necessary.
> >
> > I'd like to clarify your comment. So, are you okay without any change?
> 
> no what i meant was i don't think the code at this location is
> necessary...i think you can put a single blkdev swap_hint fullness
> check at the start of scan_swap_map() and remove this.
> 
> 
> >
> >>
> >> >
> >> >>
> >> >> > +                       atomic_long_sub(si->pages - si->inuse_pages,
> >> >> > +                               &nr_swap_pages);
> >> >> >                 spin_unlock(&swap_avail_lock);
> >> >> >         }
> >> >> >         si->swap_map[offset] = usage;
> >> >> > @@ -796,6 +822,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >
> >> >> >         /* free if no reference */
> >> >> >         if (!usage) {
> >> >> > +               struct gendisk *disk = p->bdev->bd_disk;
> >> >> >                 dec_cluster_info_page(p, p->cluster_info, offset);
> >> >> >                 if (offset < p->lowest_bit)
> >> >> >                         p->lowest_bit = offset;
> >> >> > @@ -808,6 +835,21 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >                                 if (plist_node_empty(&p->avail_list))
> >> >> >                                         plist_add(&p->avail_list,
> >> >> >                                                   &swap_avail_head);
> >> >> > +                               if ((p->flags & SWP_BLKDEV) &&
> >> >> > +                                       disk->fops->swap_hint) {
> >> >>
> >> >> freeing an entry from a full variable-size blkdev doesn't mean it's
> >> >> not still full.  In this case with zsmalloc, freeing one handle
> >> >> doesn't actually free any memory unless it was the only handle left in
> >> >> its containing zspage, and therefore it's possible that it is still
> >> >> full at this point.
> >> >
> >> > No need to free a zspage in zsmalloc.
> >> > If we free a page in zspage, it means we have free space in zspage
> >> > so user can give a chance to user for writing out new page.
> >>
> >> That's not actually true, since zsmalloc has 255 different class
> >> sizes, freeing one page means the next page to be compressed has a
> >> 1/255 chance that it will be the same size as the just-freed page
> >> (assuming random page compressability).
> >
> > I said "a chance" so if we have a possiblity, I'd like to try it.
> > Pz, don't tie your thought into zsmalloc's internal. It's facility
> > to communitcate with swap/zram, not zram allocator.
> > IOW, We could change allocator of zram potentially
> > (ex, historically, we have already done) and the (imaginary allocator/
> > or enhanced zsmalloc) could have a technique to handle it.
> 
> I'm only thinking of what currently will happen.  A lot of this
> depends on exactly how zram's IS_FULL (or GET_FREE) is defined.  If
> it's a black/white boundary then this can certainly assume freeing one
> entry should put the swap_info_struct back onto the avail_list, but if
> it's a not-so-clear full/not-full boundary, then there almost
> certainly will be write failures.
> 
> It seems like avoiding swap write failures is important, especially
> when under heavy memory pressure, but maybe some testing would show
> it's not really a big deal.

Hmm, I can't think why swap write failure is important.
I agree it's rather weird because it's not popular with real storage but
what's problem although it happens? If swap is full, it means system
is already slow so if you are concern about performance caused by
unnecessary IO, I don't think it's critical problem.
My concern is to prevent system unresponse via OOM kill and successive write
failure finally kicks OOM.
In my simple test(ie, kernel build), there is no problem to meet the goal.
Otherwise, it goes system unresponsive very easily.

> 
> >
> >>
> >> >
> >> >>
> >> >> > +                                       atomic_long_add(p->pages -
> >> >> > +                                                       p->inuse_pages,
> >> >> > +                                                       &nr_swap_pages);
> >> >> > +                                       /*
> >> >> > +                                        * reset [highest|lowest]_bit to avoid
> >> >> > +                                        * scan_swap_map infinite looping if
> >> >> > +                                        * cached free cluster's index by
> >> >> > +                                        * scan_swap_map_try_ssd_cluster is
> >> >> > +                                        * above p->highest_bit.
> >> >> > +                                        */
> >> >> > +                                       p->highest_bit = p->max - 1;
> >> >> > +                                       p->lowest_bit = 1;
> >> >>
> >> >> lowest_bit and highest_bit are likely to remain at those extremes for
> >> >> a long time, until 1 or max-1 is freed and re-allocated.
> >> >>
> >> >>
> >> >> By adding variable-size blkdev support to swap, I don't think
> >> >> highest_bit can be re-used as a "full" flag anymore.
> >> >>
> >> >> Instead, I suggest that you add a "full" flag to struct
> >> >> swap_info_struct.  Then put a swap_hint GET_FREE check at the top of
> >> >> scan_swap_map(), and if full simply turn "full" on, remove the
> >> >> swap_info_struct from the avail list, reduce nr_swap_pages
> >> >> appropriately, and return failure.  Don't mess with lowest_bit or
> >> >> highest_bit at all.
> >> >
> >> > Could you explain what logic in your suggestion prevent the problem
> >> > I mentioned(ie, scan_swap_map infinite looping)?
> >>
> >> scan_swap_map would immediately exit since the GET_FREE (or IS_FULL)
> >> check is done at its start.  And it wouldn't be called again with that
> >> swap_info_struct until non-full since it is removed from the
> >> avail_list.
> >
> > Sorry for being not clear. I don't mean it.
> > Please consider the situation where swap is not full any more
> > by swap_entry_free. Newly scan_swap_map can select the slot index which
> > is higher than p->highest_bit because we have cached free_cluster so
> > scan_swap_map will reset it with p->lowest_bit and scan again and finally
> > pick the slot index just freed by swap_entry_free and checks again.
> > Then, it could be conflict by scan_swap_map_ssd_cluster_conflict so
> > scan_swap_map_try_ssd_cluster will reset offset, scan_base to free_cluster_head
> > but unfortunately, offset is higher than p->highest_bit so again it is reset
> > to p->lowest_bit. It loops forever :(
> 
> sorry, i don't see what you're talking about...if you don't touch the
> lowest_bit or highest_bit at all when marking zram as full, and also
> don't touch them when marking zram as not-full, then there should
> never be any problem with either.

Aha, I got your point and feel it's more simple.
Okay, I will try it.

Thanks for the review!

> 
> The only reason that lowest_bit/highest_bit are reset when
> inuse_pages==pages is because when that condition is true, there
> really are no more offsets available.  So highest_bit really is 0, and
> lowest_bit really is max.  Then, when a single offset is made
> available in swap_entry_free, both lowest_bit and highest_bit are set
> to it, because that really is both the lowest and highest bit.
> 
> In the case of a variable size blkdev, when it's full there actually
> still are more offsets that can be used, so neither lowest nor highest
> bit should be modified.  Just leave them where they are, and when the
> swap_info_struct is placed back onto the avail_list for scanning,
> things will keep working correctly.
> 
> am i missing something?

Nope, your suggestion is better than mine.

> 
> >
> > I'd like to solve this problem without many hooking in swap layer and
> > any overhead for !zram case.
> 
> the only overhead for !zram is the (failing) check for swap_hint,
> which you can't avoid.
> 
> >
> >>
> >> >
> >> >>
> >> >> Then in swap_entry_free(), do something like:
> >> >>
> >> >>     dec_cluster_info_page(p, p->cluster_info, offset);
> >> >>     if (offset < p->lowest_bit)
> >> >>       p->lowest_bit = offset;
> >> >> -   if (offset > p->highest_bit) {
> >> >> -     bool was_full = !p->highest_bit;
> >> >> +   if (offset > p->highest_bit)
> >> >>       p->highest_bit = offset;
> >> >> -     if (was_full && (p->flags & SWP_WRITEOK)) {
> >> >> +   if (p->full && p->flags & SWP_WRITEOK) {
> >> >> +     bool is_var_size_blkdev = is_variable_size_blkdev(p);
> >> >> +     bool blkdev_full = is_variable_size_blkdev_full(p);
> >> >> +
> >> >> +     if (!is_var_size_blkdev || !blkdev_full) {
> >> >> +       if (is_var_size_blkdev)
> >> >> +         atomic_long_add(p->pages - p->inuse_pages, &nr_swap_pages);
> >> >> +       p->full = false;
> >> >>         spin_lock(&swap_avail_lock);
> >> >>         WARN_ON(!plist_node_empty(&p->avail_list));
> >> >>         if (plist_node_empty(&p->avail_list))
> >> >>           plist_add(&p->avail_list,
> >> >>              &swap_avail_head);
> >> >>         spin_unlock(&swap_avail_lock);
> >> >> +     } else if (blkdev_full) {
> >> >> +       /* still full, so this page isn't actually
> >> >> +        * available yet to use; once non-full,
> >> >> +        * pages-inuse_pages will be the correct
> >> >> +        * number to add (above) since below will
> >> >> +        * inuse_pages--
> >> >> +        */
> >> >> +       atomic_long_dec(&nr_swap_pages);
> >> >>       }
> >> >>     }
> >> >>     atomic_long_inc(&nr_swap_pages);
> >> >>
> >> >>
> >> >>
> >> >> > @@ -815,7 +857,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
> >> >> >                 p->inuse_pages--;
> >> >> >                 frontswap_invalidate_page(p->type, offset);
> >> >> >                 if (p->flags & SWP_BLKDEV) {
> >> >> > -                       struct gendisk *disk = p->bdev->bd_disk;
> >> >> >                         if (disk->fops->swap_hint)
> >> >> >                                 disk->fops->swap_hint(p->bdev,
> >> >> >                                                 SWAP_SLOT_FREE,
> >> >> > --
> >> >> > 2.0.0
> >> >> >
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-16 15:58                 ` Dan Streetman
@ 2014-09-17  7:44                   ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-17  7:44 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > Hi Heesub,
> >> >> >
> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> >> Hello Minchan,
> >> >> >>
> >> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >> >
> >> >> > Thank you.
> >> >> >
> >> >> >>
> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >> >know how many zram has freeable space.
> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >> >
> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >> >---
> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >> >  1 file changed, 18 insertions(+)
> >> >> >> >
> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >     return 0;
> >> >> >> >  }
> >> >> >> >
> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >+{
> >> >> >> >+    struct zram *zram;
> >> >> >> >+    struct zram_meta *meta;
> >> >> >> >+
> >> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >> >+    meta = zram->meta;
> >> >> >> >+
> >> >> >> >+    if (!zram->limit_pages)
> >> >> >> >+            return 1;
> >> >> >> >+
> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >>
> >> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> >> idea for now.
> >> >> >
> >> >> > Yeb, good point.
> >> >> >
> >> >> > Actually, I thought about that but in this patchset, I wanted to
> >> >> > go with conservative approach which is a safe guard to prevent
> >> >> > system hang which is terrible than early OOM kill.
> >> >> >
> >> >> > Whole point of this patchset is to add a facility to VM and VM
> >> >> > collaborates with zram via the interface to avoid worst case
> >> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> >> > several approaches in future but I agree my logic was too simple
> >> >> > and conservative.
> >> >> >
> >> >> > We could improve it with [anti|de]fragmentation in future but
> >> >> > at the moment, below simple heuristic is not too bad for first
> >> >> > step. :)
> >> >> >
> >> >> >
> >> >> > ---
> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >> >
> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> >> > --- a/drivers/block/zram/zram_drv.c
> >> >> > +++ b/drivers/block/zram/zram_drv.c
> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >> >                         &zram->stats.compr_data_size);
> >> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >
> >> >> >         meta->table[index].handle = 0;
> >> >> >         zram_set_obj_size(meta, index, 0);
> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >> >                 zs_free(meta->mem_pool, handle);
> >> >> > +               atomic_inc(&zram->alloc_fail);
> >> >> >                 ret = -ENOMEM;
> >> >> >                 goto out;
> >> >> >         }
> >> >>
> >> >> This isn't going to work well at all with swap.  There will be,
> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> >> indicates it's full, and even then a single free during those 32
> >> >> failures will restart the counter, so it could be dozens or hundreds
> >> >> (or more) swap write failures before the zram device is marked as
> >> >> full.  And then, a single zram free will move it back to non-full and
> >> >> start the write failures over again.
> >> >>
> >> >> I think it would be better to just check for actual fullness (i.e.
> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> >> That will allow a single write to succeed when it crosses into
> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> >> any write failures (or very few), and zram will stay full until enough
> >> >> pages are freed that it really does move under limit_pages.
> >> >
> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> >> > approach, it could kick OOM earlier which is not what we want.
> >> > Because our product uses zram to delay app killing by low memory killer.
> >>
> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
> >> storage devices.  Obviously, "full" usually means "no more room to
> >> store anything", while "not full" means "there is room to store
> >> anything, up to the remaining free size".  With zram, its zsmalloc
> >> pool size might be over the specified limit, but there will still be
> >> room to store *some* things - but not *anything*.  Only compressed
> >> pages that happen to fit inside a class with at least one zspage that
> >> isn't full.
> >>
> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> >> is 100% full in all its classes.
> >>
> >> What about waiting until there is N number of write failures, like
> >> this patch?  That doesn't seem very fair to the writer, since each
> >> write failure will cause them to do extra work (first, in selecting
> >> what to write, and then in recovering from the failed write).
> >> However, it will probably squeeze some writes into some of those empty
> >> spaces in already-allocated zspages.
> >>
> >> And declaring zram "full" immediately once the zsmalloc pool size
> >> increases past the specified limit?  Since zsmalloc's classes almost
> >> certainly contain some fragmentation, that will waste all the empty
> >> spaces that could still store more compressed pages.  But, this is the
> >> limit at which you cannot guarantee all writes to be able to store a
> >> compressed page - any zsmalloc classes without a partially empty
> >> zspage will have to increase zsmalloc's size, therefore failing the
> >> write.
> >>
> >> Neither definition of "full" is optimal.  Since in this case we're
> >> talking about swap, I think forcing swap write failures to happen,
> >> which with direct reclaim could (I believe) stop everything while the
> >> write failures continue, should be avoided as much as possible.  Even
> >> when zram fullness is delayed by N write failures, to try to squeeze
> >> out as much storage from zsmalloc as possible, when it does eventually
> >> fill if zram is the only swap device the system will OOM anyway.  And
> >> if zram isn't the only swap device, but just the first (highest
> >> priority), then delaying things with unneeded write failures is
> >> certainly not better than just filling up so swap can move on to the
> >> next swap device.  The only case where write failures delaying marking
> >> zram as full will help is if the system stopped right at this point,
> >> and then started decreasing how much memory was needed.  That seems
> >> like a very unlikely coincidence, but maybe some testing would help
> >> determine how bad the write failures affect system
> >> performance/responsiveness and how long they delay OOM.
> >
> > Please, keep in mind that swap is alreay really slow operation but
> > we want to use it to avoid OOM if possible so I can't buy your early
> > kill suggestion.
> 
> I disagree, OOM should be invoked once the system can't proceed with
> reclaiming memory.  IMHO, repeated swap write failures will cause the
> system to be unable to reclaim memory.

That's what I want. I'd like to go with OOM once repeated swap write
failures happens.
The difference between you and me is that how we should be aggressive
to kick OOM. Your proposal was too agressive so that it can make OOM
too early, which makes swap inefficient. That's what I'd like to avoid.

> 
> > If a user feel it's really slow for his product,
> > it means his admin was fail. He should increase the limit of zram
> > dynamically or statically(zram already support that ways).
> >
> > The thing I'd like to solve in this patchset is to avoid system hang
> > where admin cannot do anyting, even ctrl+c, which is thing should
> > support in OS level.
> 
> what's better - failing a lot of swap writes, or marking the swap
> device as full?  As I said if zram is the *only* swap device in the
> system, maybe that makes sense (although it's still questionable).  If
> zram is only the first swap device, and there's a backup swap device
> (presumably that just writes to disk), then it will be *much* better
> to simply fail over to that, instead of (repeatedly) failing a lot of
> swap writes.

Actually, I don't hear such usecase until now but I can't ignore it
because it's really doable configuration so I agree we need some knob.

> 
> Especially once direct reclaim is reached, failing swap writes is
> probably going to make the system unresponsive.  Personally I think
> moving to OOM (or the next swap device) is better.

Again said, that's what I want! But your suggestion was too agressive.
The system can have more resource which can free easily(ex, page cache,
purgeable memory or unimportant process could be killed).

> 
> If write failures are the direction you go, then IMHO there should *at
> least* be a zram parameter to allow the admin to choose to immediately
> fail or continue with write failures.

Agree.

> 
> 
> >
> >>
> >> Since there may be different use cases that desire different things,
> >> maybe there should be a zram runtime (or buildtime) config to choose
> >> exactly how it decides it's full?  Either full after N write failures,
> >> or full when alloced>limit?  That would allow the user to either defer
> >> getting full as long as possible (at the possible cost of system
> >> unresponsiveness during those write failures), or to just move
> >> immediately to zram being full as soon as it can't guarantee that each
> >> write will succeed.
> >
> > Hmm, I thought it and was going to post it when I send v1.
> > My idea was this.
> 
> what i actually meant was more like this, where ->stop_using_when_full
> is a user-configurable param:
> 
> bool zram_is_full(...)
> {
>   if (zram->stop_using_when_full) {
>     /* for this, allow 1 write to succeed past limit_pages */
>     return zs_get_total_pages(zram) > zram->limit_pages;
>   } else {
>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
>   }
> }

To me, it's too simple so there is no chance to control zram fullness.
How about this one?

bool zram_is_full(...)
{
        unsigned long total_pages;
        if (!zram->limit_pages)
                return false;

        total_pages = zs_get_total_pages(zram);
        if (total_pages >= zram->limit_pages &&
                (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)
                return true;

        if (zram->alloc_fail > FULL_THRESH_HOLD)
                return true;

        return false;
}

So if someone want to avoid write failure but bear with early OOM,
he can set FRAG_THRESH_HOLD to 0.
Any thought?

> 
> >
> > int zram_get_free_pages(...)
> > {
> >         if (zram->limit_pages &&
> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
> >                 (100 * compr_data_size >> PAGE_SHIFT /
> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
> 
> well...i think this implementation has both downsides; it forces write
> failures to happen, but also it doesn't guarantee being full after
> FULL_THRESHOLD write failures.  If the fragmentation level never
> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
> any way that using the amount of fragmentation will work, because you
> can't guarantee it will be reached.  The incoming pages to compress
> may all fall into classes that are already full.
> 
> with zsmalloc compaction, it would be possible to know that a certain
> fragmentation threshold could be reached, but without it that's not a
> promise zsmalloc can keep.  And we definitely don't want to fail swap
> writes forever.
> 
> 
> >                         *free = 0;
> >                         return 0;
> >         }
> >         ..
> > }
> >
> > Maybe we could export FRAG_THRESHOLD.
> >
> >>
> >>
> >>
> >> >
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >         update_used_max(zram, alloced_pages);
> >> >> >
> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >         return 0;
> >> >> >  }
> >> >> >
> >> >> > +#define FULL_THRESH_HOLD 32
> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >  {
> >> >> >         struct zram *zram;
> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >         zram = bdev->bd_disk->private_data;
> >> >> >         meta = zram->meta;
> >> >> >
> >> >> > -       if (!zram->limit_pages)
> >> >> > -               return 1;
> >> >> > -
> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> > +       if (zram->limit_pages &&
> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> >> > +               *free = 0;
> >> >> > +               return 0;
> >> >> > +       }
> >> >> >
> >> >> > -       return 0;
> >> >> > +       return 1;
> >> >>
> >> >> There's no way that zram can even provide a accurate number of free
> >> >> pages, since it can't know how compressible future stored pages will
> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> >> to IS_FULL, and return either true or false.
> >> >
> >> > My plan is that we can give an approximation based on
> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> >> > step by step so I didn't include the hint.
> >> > If you are strong against with that in this stage, I can change it and
> >> > try it later with the number.
> >> > Please, say again if you want.
> >>
> >> since as you said zram is the only user of swap_hint, changing it
> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
> >> GET_FREE; since the check in scan_swap_map() really only is checking
> >> for IS_FULL, if you update vmscan later to adjust its file/anon
> >> balance based on GET_FREE, that can be added then with no trouble,
> >> right?
> >
> > Yeb, No problem.
> >
> >>
> >>
> >> >
> >> > Thanks for the review!
> >> >
> >> >
> >> >>
> >> >>
> >> >> >  }
> >> >> >
> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> >> > index 779d03fa4360..182a2544751b 100644
> >> >> > --- a/drivers/block/zram/zram_drv.h
> >> >> > +++ b/drivers/block/zram/zram_drv.h
> >> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >> >         u64 disksize;   /* bytes */
> >> >> >         int max_comp_streams;
> >> >> >         struct zram_stats stats;
> >> >> > +       atomic_t alloc_fail;
> >> >> >         /*
> >> >> >          * the number of pages zram can consume for storing compressed data
> >> >> >          */
> >> >> > --
> >> >> > 2.0.0
> >> >> >
> >> >> >>
> >> >> >> heesub
> >> >> >>
> >> >> >> >+
> >> >> >> >+    return 0;
> >> >> >> >+}
> >> >> >> >+
> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >                             unsigned int hint, void *arg)
> >> >> >> >  {
> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >
> >> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >> >
> >> >> >> >     return ret;
> >> >> >> >  }
> >> >> >> >
> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >
> >> >> > --
> >> >> > Kind regards,
> >> >> > Minchan Kim
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-17  7:44                   ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-17  7:44 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > Hi Heesub,
> >> >> >
> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> >> Hello Minchan,
> >> >> >>
> >> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >> >
> >> >> > Thank you.
> >> >> >
> >> >> >>
> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >> >know how many zram has freeable space.
> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >> >
> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >> >---
> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >> >  1 file changed, 18 insertions(+)
> >> >> >> >
> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >     return 0;
> >> >> >> >  }
> >> >> >> >
> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >+{
> >> >> >> >+    struct zram *zram;
> >> >> >> >+    struct zram_meta *meta;
> >> >> >> >+
> >> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >> >+    meta = zram->meta;
> >> >> >> >+
> >> >> >> >+    if (!zram->limit_pages)
> >> >> >> >+            return 1;
> >> >> >> >+
> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >>
> >> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> >> idea for now.
> >> >> >
> >> >> > Yeb, good point.
> >> >> >
> >> >> > Actually, I thought about that but in this patchset, I wanted to
> >> >> > go with conservative approach which is a safe guard to prevent
> >> >> > system hang which is terrible than early OOM kill.
> >> >> >
> >> >> > Whole point of this patchset is to add a facility to VM and VM
> >> >> > collaborates with zram via the interface to avoid worst case
> >> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> >> > several approaches in future but I agree my logic was too simple
> >> >> > and conservative.
> >> >> >
> >> >> > We could improve it with [anti|de]fragmentation in future but
> >> >> > at the moment, below simple heuristic is not too bad for first
> >> >> > step. :)
> >> >> >
> >> >> >
> >> >> > ---
> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >> >
> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> >> > --- a/drivers/block/zram/zram_drv.c
> >> >> > +++ b/drivers/block/zram/zram_drv.c
> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >> >                         &zram->stats.compr_data_size);
> >> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >
> >> >> >         meta->table[index].handle = 0;
> >> >> >         zram_set_obj_size(meta, index, 0);
> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >> >                 zs_free(meta->mem_pool, handle);
> >> >> > +               atomic_inc(&zram->alloc_fail);
> >> >> >                 ret = -ENOMEM;
> >> >> >                 goto out;
> >> >> >         }
> >> >>
> >> >> This isn't going to work well at all with swap.  There will be,
> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> >> indicates it's full, and even then a single free during those 32
> >> >> failures will restart the counter, so it could be dozens or hundreds
> >> >> (or more) swap write failures before the zram device is marked as
> >> >> full.  And then, a single zram free will move it back to non-full and
> >> >> start the write failures over again.
> >> >>
> >> >> I think it would be better to just check for actual fullness (i.e.
> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> >> That will allow a single write to succeed when it crosses into
> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> >> any write failures (or very few), and zram will stay full until enough
> >> >> pages are freed that it really does move under limit_pages.
> >> >
> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> >> > approach, it could kick OOM earlier which is not what we want.
> >> > Because our product uses zram to delay app killing by low memory killer.
> >>
> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
> >> storage devices.  Obviously, "full" usually means "no more room to
> >> store anything", while "not full" means "there is room to store
> >> anything, up to the remaining free size".  With zram, its zsmalloc
> >> pool size might be over the specified limit, but there will still be
> >> room to store *some* things - but not *anything*.  Only compressed
> >> pages that happen to fit inside a class with at least one zspage that
> >> isn't full.
> >>
> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> >> is 100% full in all its classes.
> >>
> >> What about waiting until there is N number of write failures, like
> >> this patch?  That doesn't seem very fair to the writer, since each
> >> write failure will cause them to do extra work (first, in selecting
> >> what to write, and then in recovering from the failed write).
> >> However, it will probably squeeze some writes into some of those empty
> >> spaces in already-allocated zspages.
> >>
> >> And declaring zram "full" immediately once the zsmalloc pool size
> >> increases past the specified limit?  Since zsmalloc's classes almost
> >> certainly contain some fragmentation, that will waste all the empty
> >> spaces that could still store more compressed pages.  But, this is the
> >> limit at which you cannot guarantee all writes to be able to store a
> >> compressed page - any zsmalloc classes without a partially empty
> >> zspage will have to increase zsmalloc's size, therefore failing the
> >> write.
> >>
> >> Neither definition of "full" is optimal.  Since in this case we're
> >> talking about swap, I think forcing swap write failures to happen,
> >> which with direct reclaim could (I believe) stop everything while the
> >> write failures continue, should be avoided as much as possible.  Even
> >> when zram fullness is delayed by N write failures, to try to squeeze
> >> out as much storage from zsmalloc as possible, when it does eventually
> >> fill if zram is the only swap device the system will OOM anyway.  And
> >> if zram isn't the only swap device, but just the first (highest
> >> priority), then delaying things with unneeded write failures is
> >> certainly not better than just filling up so swap can move on to the
> >> next swap device.  The only case where write failures delaying marking
> >> zram as full will help is if the system stopped right at this point,
> >> and then started decreasing how much memory was needed.  That seems
> >> like a very unlikely coincidence, but maybe some testing would help
> >> determine how bad the write failures affect system
> >> performance/responsiveness and how long they delay OOM.
> >
> > Please, keep in mind that swap is alreay really slow operation but
> > we want to use it to avoid OOM if possible so I can't buy your early
> > kill suggestion.
> 
> I disagree, OOM should be invoked once the system can't proceed with
> reclaiming memory.  IMHO, repeated swap write failures will cause the
> system to be unable to reclaim memory.

That's what I want. I'd like to go with OOM once repeated swap write
failures happens.
The difference between you and me is that how we should be aggressive
to kick OOM. Your proposal was too agressive so that it can make OOM
too early, which makes swap inefficient. That's what I'd like to avoid.

> 
> > If a user feel it's really slow for his product,
> > it means his admin was fail. He should increase the limit of zram
> > dynamically or statically(zram already support that ways).
> >
> > The thing I'd like to solve in this patchset is to avoid system hang
> > where admin cannot do anyting, even ctrl+c, which is thing should
> > support in OS level.
> 
> what's better - failing a lot of swap writes, or marking the swap
> device as full?  As I said if zram is the *only* swap device in the
> system, maybe that makes sense (although it's still questionable).  If
> zram is only the first swap device, and there's a backup swap device
> (presumably that just writes to disk), then it will be *much* better
> to simply fail over to that, instead of (repeatedly) failing a lot of
> swap writes.

Actually, I don't hear such usecase until now but I can't ignore it
because it's really doable configuration so I agree we need some knob.

> 
> Especially once direct reclaim is reached, failing swap writes is
> probably going to make the system unresponsive.  Personally I think
> moving to OOM (or the next swap device) is better.

Again said, that's what I want! But your suggestion was too agressive.
The system can have more resource which can free easily(ex, page cache,
purgeable memory or unimportant process could be killed).

> 
> If write failures are the direction you go, then IMHO there should *at
> least* be a zram parameter to allow the admin to choose to immediately
> fail or continue with write failures.

Agree.

> 
> 
> >
> >>
> >> Since there may be different use cases that desire different things,
> >> maybe there should be a zram runtime (or buildtime) config to choose
> >> exactly how it decides it's full?  Either full after N write failures,
> >> or full when alloced>limit?  That would allow the user to either defer
> >> getting full as long as possible (at the possible cost of system
> >> unresponsiveness during those write failures), or to just move
> >> immediately to zram being full as soon as it can't guarantee that each
> >> write will succeed.
> >
> > Hmm, I thought it and was going to post it when I send v1.
> > My idea was this.
> 
> what i actually meant was more like this, where ->stop_using_when_full
> is a user-configurable param:
> 
> bool zram_is_full(...)
> {
>   if (zram->stop_using_when_full) {
>     /* for this, allow 1 write to succeed past limit_pages */
>     return zs_get_total_pages(zram) > zram->limit_pages;
>   } else {
>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
>   }
> }

To me, it's too simple so there is no chance to control zram fullness.
How about this one?

bool zram_is_full(...)
{
        unsigned long total_pages;
        if (!zram->limit_pages)
                return false;

        total_pages = zs_get_total_pages(zram);
        if (total_pages >= zram->limit_pages &&
                (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)
                return true;

        if (zram->alloc_fail > FULL_THRESH_HOLD)
                return true;

        return false;
}

So if someone want to avoid write failure but bear with early OOM,
he can set FRAG_THRESH_HOLD to 0.
Any thought?

> 
> >
> > int zram_get_free_pages(...)
> > {
> >         if (zram->limit_pages &&
> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
> >                 (100 * compr_data_size >> PAGE_SHIFT /
> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
> 
> well...i think this implementation has both downsides; it forces write
> failures to happen, but also it doesn't guarantee being full after
> FULL_THRESHOLD write failures.  If the fragmentation level never
> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
> any way that using the amount of fragmentation will work, because you
> can't guarantee it will be reached.  The incoming pages to compress
> may all fall into classes that are already full.
> 
> with zsmalloc compaction, it would be possible to know that a certain
> fragmentation threshold could be reached, but without it that's not a
> promise zsmalloc can keep.  And we definitely don't want to fail swap
> writes forever.
> 
> 
> >                         *free = 0;
> >                         return 0;
> >         }
> >         ..
> > }
> >
> > Maybe we could export FRAG_THRESHOLD.
> >
> >>
> >>
> >>
> >> >
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >         update_used_max(zram, alloced_pages);
> >> >> >
> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >         return 0;
> >> >> >  }
> >> >> >
> >> >> > +#define FULL_THRESH_HOLD 32
> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >  {
> >> >> >         struct zram *zram;
> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >         zram = bdev->bd_disk->private_data;
> >> >> >         meta = zram->meta;
> >> >> >
> >> >> > -       if (!zram->limit_pages)
> >> >> > -               return 1;
> >> >> > -
> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> > +       if (zram->limit_pages &&
> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> >> > +               *free = 0;
> >> >> > +               return 0;
> >> >> > +       }
> >> >> >
> >> >> > -       return 0;
> >> >> > +       return 1;
> >> >>
> >> >> There's no way that zram can even provide a accurate number of free
> >> >> pages, since it can't know how compressible future stored pages will
> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> >> to IS_FULL, and return either true or false.
> >> >
> >> > My plan is that we can give an approximation based on
> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> >> > step by step so I didn't include the hint.
> >> > If you are strong against with that in this stage, I can change it and
> >> > try it later with the number.
> >> > Please, say again if you want.
> >>
> >> since as you said zram is the only user of swap_hint, changing it
> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
> >> GET_FREE; since the check in scan_swap_map() really only is checking
> >> for IS_FULL, if you update vmscan later to adjust its file/anon
> >> balance based on GET_FREE, that can be added then with no trouble,
> >> right?
> >
> > Yeb, No problem.
> >
> >>
> >>
> >> >
> >> > Thanks for the review!
> >> >
> >> >
> >> >>
> >> >>
> >> >> >  }
> >> >> >
> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> >> > index 779d03fa4360..182a2544751b 100644
> >> >> > --- a/drivers/block/zram/zram_drv.h
> >> >> > +++ b/drivers/block/zram/zram_drv.h
> >> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >> >         u64 disksize;   /* bytes */
> >> >> >         int max_comp_streams;
> >> >> >         struct zram_stats stats;
> >> >> > +       atomic_t alloc_fail;
> >> >> >         /*
> >> >> >          * the number of pages zram can consume for storing compressed data
> >> >> >          */
> >> >> > --
> >> >> > 2.0.0
> >> >> >
> >> >> >>
> >> >> >> heesub
> >> >> >>
> >> >> >> >+
> >> >> >> >+    return 0;
> >> >> >> >+}
> >> >> >> >+
> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >                             unsigned int hint, void *arg)
> >> >> >> >  {
> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >
> >> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >> >
> >> >> >> >     return ret;
> >> >> >> >  }
> >> >> >> >
> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >
> >> >> > --
> >> >> > Kind regards,
> >> >> > Minchan Kim
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-17  7:44                   ` Minchan Kim
@ 2014-09-17 16:28                     ` Dan Streetman
  -1 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-17 16:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 17, 2014 at 3:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
>> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
>> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> >> > Hi Heesub,
>> >> >> >
>> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> >> >> Hello Minchan,
>> >> >> >>
>> >> >> >> First of all, I agree with the overall purpose of your patch set.
>> >> >> >
>> >> >> > Thank you.
>> >> >> >
>> >> >> >>
>> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >> >> >know how many zram has freeable space.
>> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >> >> >
>> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> >> >---
>> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >> >> >  1 file changed, 18 insertions(+)
>> >> >> >> >
>> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >> >     return 0;
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >> >+{
>> >> >> >> >+    struct zram *zram;
>> >> >> >> >+    struct zram_meta *meta;
>> >> >> >> >+
>> >> >> >> >+    zram = bdev->bd_disk->private_data;
>> >> >> >> >+    meta = zram->meta;
>> >> >> >> >+
>> >> >> >> >+    if (!zram->limit_pages)
>> >> >> >> >+            return 1;
>> >> >> >> >+
>> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >> >>
>> >> >> >> Even if 'free' is zero here, there may be free spaces available to
>> >> >> >> store more compressed pages into the zs_pool. I mean calculation
>> >> >> >> above is not quite accurate and wastes memory, but have no better
>> >> >> >> idea for now.
>> >> >> >
>> >> >> > Yeb, good point.
>> >> >> >
>> >> >> > Actually, I thought about that but in this patchset, I wanted to
>> >> >> > go with conservative approach which is a safe guard to prevent
>> >> >> > system hang which is terrible than early OOM kill.
>> >> >> >
>> >> >> > Whole point of this patchset is to add a facility to VM and VM
>> >> >> > collaborates with zram via the interface to avoid worst case
>> >> >> > (ie, system hang) and logic to throttle could be enhanced by
>> >> >> > several approaches in future but I agree my logic was too simple
>> >> >> > and conservative.
>> >> >> >
>> >> >> > We could improve it with [anti|de]fragmentation in future but
>> >> >> > at the moment, below simple heuristic is not too bad for first
>> >> >> > step. :)
>> >> >> >
>> >> >> >
>> >> >> > ---
>> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >> >> >  drivers/block/zram/zram_drv.h |  1 +
>> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >> >> >
>> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> >> >> > --- a/drivers/block/zram/zram_drv.c
>> >> >> > +++ b/drivers/block/zram/zram_drv.c
>> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >> >> >                         &zram->stats.compr_data_size);
>> >> >> >         atomic64_dec(&zram->stats.pages_stored);
>> >> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >> >
>> >> >> >         meta->table[index].handle = 0;
>> >> >> >         zram_set_obj_size(meta, index, 0);
>> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >> >> >                 zs_free(meta->mem_pool, handle);
>> >> >> > +               atomic_inc(&zram->alloc_fail);
>> >> >> >                 ret = -ENOMEM;
>> >> >> >                 goto out;
>> >> >> >         }
>> >> >>
>> >> >> This isn't going to work well at all with swap.  There will be,
>> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
>> >> >> indicates it's full, and even then a single free during those 32
>> >> >> failures will restart the counter, so it could be dozens or hundreds
>> >> >> (or more) swap write failures before the zram device is marked as
>> >> >> full.  And then, a single zram free will move it back to non-full and
>> >> >> start the write failures over again.
>> >> >>
>> >> >> I think it would be better to just check for actual fullness (i.e.
>> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
>> >> >> That will allow a single write to succeed when it crosses into
>> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> >> >> any write failures (or very few), and zram will stay full until enough
>> >> >> pages are freed that it really does move under limit_pages.
>> >> >
>> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
>> >> > approach, it could kick OOM earlier which is not what we want.
>> >> > Because our product uses zram to delay app killing by low memory killer.
>> >>
>> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
>> >> storage devices.  Obviously, "full" usually means "no more room to
>> >> store anything", while "not full" means "there is room to store
>> >> anything, up to the remaining free size".  With zram, its zsmalloc
>> >> pool size might be over the specified limit, but there will still be
>> >> room to store *some* things - but not *anything*.  Only compressed
>> >> pages that happen to fit inside a class with at least one zspage that
>> >> isn't full.
>> >>
>> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
>> >> is 100% full in all its classes.
>> >>
>> >> What about waiting until there is N number of write failures, like
>> >> this patch?  That doesn't seem very fair to the writer, since each
>> >> write failure will cause them to do extra work (first, in selecting
>> >> what to write, and then in recovering from the failed write).
>> >> However, it will probably squeeze some writes into some of those empty
>> >> spaces in already-allocated zspages.
>> >>
>> >> And declaring zram "full" immediately once the zsmalloc pool size
>> >> increases past the specified limit?  Since zsmalloc's classes almost
>> >> certainly contain some fragmentation, that will waste all the empty
>> >> spaces that could still store more compressed pages.  But, this is the
>> >> limit at which you cannot guarantee all writes to be able to store a
>> >> compressed page - any zsmalloc classes without a partially empty
>> >> zspage will have to increase zsmalloc's size, therefore failing the
>> >> write.
>> >>
>> >> Neither definition of "full" is optimal.  Since in this case we're
>> >> talking about swap, I think forcing swap write failures to happen,
>> >> which with direct reclaim could (I believe) stop everything while the
>> >> write failures continue, should be avoided as much as possible.  Even
>> >> when zram fullness is delayed by N write failures, to try to squeeze
>> >> out as much storage from zsmalloc as possible, when it does eventually
>> >> fill if zram is the only swap device the system will OOM anyway.  And
>> >> if zram isn't the only swap device, but just the first (highest
>> >> priority), then delaying things with unneeded write failures is
>> >> certainly not better than just filling up so swap can move on to the
>> >> next swap device.  The only case where write failures delaying marking
>> >> zram as full will help is if the system stopped right at this point,
>> >> and then started decreasing how much memory was needed.  That seems
>> >> like a very unlikely coincidence, but maybe some testing would help
>> >> determine how bad the write failures affect system
>> >> performance/responsiveness and how long they delay OOM.
>> >
>> > Please, keep in mind that swap is alreay really slow operation but
>> > we want to use it to avoid OOM if possible so I can't buy your early
>> > kill suggestion.
>>
>> I disagree, OOM should be invoked once the system can't proceed with
>> reclaiming memory.  IMHO, repeated swap write failures will cause the
>> system to be unable to reclaim memory.
>
> That's what I want. I'd like to go with OOM once repeated swap write
> failures happens.
> The difference between you and me is that how we should be aggressive
> to kick OOM. Your proposal was too agressive so that it can make OOM
> too early, which makes swap inefficient. That's what I'd like to avoid.
>
>>
>> > If a user feel it's really slow for his product,
>> > it means his admin was fail. He should increase the limit of zram
>> > dynamically or statically(zram already support that ways).
>> >
>> > The thing I'd like to solve in this patchset is to avoid system hang
>> > where admin cannot do anyting, even ctrl+c, which is thing should
>> > support in OS level.
>>
>> what's better - failing a lot of swap writes, or marking the swap
>> device as full?  As I said if zram is the *only* swap device in the
>> system, maybe that makes sense (although it's still questionable).  If
>> zram is only the first swap device, and there's a backup swap device
>> (presumably that just writes to disk), then it will be *much* better
>> to simply fail over to that, instead of (repeatedly) failing a lot of
>> swap writes.
>
> Actually, I don't hear such usecase until now but I can't ignore it
> because it's really doable configuration so I agree we need some knob.
>
>>
>> Especially once direct reclaim is reached, failing swap writes is
>> probably going to make the system unresponsive.  Personally I think
>> moving to OOM (or the next swap device) is better.
>
> Again said, that's what I want! But your suggestion was too agressive.
> The system can have more resource which can free easily(ex, page cache,
> purgeable memory or unimportant process could be killed).

Ok, I think we agree - I'm not against some write failures, I just
worry about "too many" (where I can't define "too many" ;-) of them,
since each write failure doesn't make any progress in reclaiming
memory for the process(es) that are waiting for it.

Also when you got the write errors, I assume you saw a lot of:
Write-error on swap-device (%u:%u:%Lu)
messages?  Obviously that's expected, but maybe it would be good to
add a check for swap_hint IS_FULL there, and skip printing the alert
if so...?

>
>>
>> If write failures are the direction you go, then IMHO there should *at
>> least* be a zram parameter to allow the admin to choose to immediately
>> fail or continue with write failures.
>
> Agree.
>
>>
>>
>> >
>> >>
>> >> Since there may be different use cases that desire different things,
>> >> maybe there should be a zram runtime (or buildtime) config to choose
>> >> exactly how it decides it's full?  Either full after N write failures,
>> >> or full when alloced>limit?  That would allow the user to either defer
>> >> getting full as long as possible (at the possible cost of system
>> >> unresponsiveness during those write failures), or to just move
>> >> immediately to zram being full as soon as it can't guarantee that each
>> >> write will succeed.
>> >
>> > Hmm, I thought it and was going to post it when I send v1.
>> > My idea was this.
>>
>> what i actually meant was more like this, where ->stop_using_when_full
>> is a user-configurable param:
>>
>> bool zram_is_full(...)
>> {
>>   if (zram->stop_using_when_full) {
>>     /* for this, allow 1 write to succeed past limit_pages */
>>     return zs_get_total_pages(zram) > zram->limit_pages;
>>   } else {
>>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
>>   }
>> }
>
> To me, it's too simple so there is no chance to control zram fullness.
> How about this one?
>
> bool zram_is_full(...)
> {
>         unsigned long total_pages;
>         if (!zram->limit_pages)
>                 return false;
>
>         total_pages = zs_get_total_pages(zram);
>         if (total_pages >= zram->limit_pages &&

Just to clarify - this also implies that zram_bvec_write() will allow
(at least) one write to succeed past limit_pages (which I agree
with)...

>                 (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)

I assume FRAG_THRESH_HOLD will be a runtime user-configurable param?

Also strictly as far as semantics, it seems like this is the reverse
of fragmentation, i.e. high fragmentation should indicate lots of
empty space, while low fragmentation should indicate compr_data_size
is almost equal to total_pages.  Not a big deal, but may cause
confusion.  Maybe call it "efficiency" or "compaction"?  Or invert the
calculation, e.g.
  (100 * (total_pages - compr_pages) / total_pages) < FRAG_THRESH_HOLD

>                 return true;
>
>         if (zram->alloc_fail > FULL_THRESH_HOLD)
>                 return true;
>
>         return false;
> }
>
> So if someone want to avoid write failure but bear with early OOM,
> he can set FRAG_THRESH_HOLD to 0.
> Any thought?

Yep that looks good.

One minor english correction, "threshold" is a single word.

>
>>
>> >
>> > int zram_get_free_pages(...)
>> > {
>> >         if (zram->limit_pages &&
>> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
>> >                 (100 * compr_data_size >> PAGE_SHIFT /
>> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
>>
>> well...i think this implementation has both downsides; it forces write
>> failures to happen, but also it doesn't guarantee being full after
>> FULL_THRESHOLD write failures.  If the fragmentation level never
>> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
>> any way that using the amount of fragmentation will work, because you
>> can't guarantee it will be reached.  The incoming pages to compress
>> may all fall into classes that are already full.
>>
>> with zsmalloc compaction, it would be possible to know that a certain
>> fragmentation threshold could be reached, but without it that's not a
>> promise zsmalloc can keep.  And we definitely don't want to fail swap
>> writes forever.
>>
>>
>> >                         *free = 0;
>> >                         return 0;
>> >         }
>> >         ..
>> > }
>> >
>> > Maybe we could export FRAG_THRESHOLD.
>> >
>> >>
>> >>
>> >>
>> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >> >         update_used_max(zram, alloced_pages);
>> >> >> >
>> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >         return 0;
>> >> >> >  }
>> >> >> >
>> >> >> > +#define FULL_THRESH_HOLD 32
>> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >  {
>> >> >> >         struct zram *zram;
>> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >         zram = bdev->bd_disk->private_data;
>> >> >> >         meta = zram->meta;
>> >> >> >
>> >> >> > -       if (!zram->limit_pages)
>> >> >> > -               return 1;
>> >> >> > -
>> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >> > +       if (zram->limit_pages &&
>> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> >> >> > +               *free = 0;
>> >> >> > +               return 0;
>> >> >> > +       }
>> >> >> >
>> >> >> > -       return 0;
>> >> >> > +       return 1;
>> >> >>
>> >> >> There's no way that zram can even provide a accurate number of free
>> >> >> pages, since it can't know how compressible future stored pages will
>> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
>> >> >> to IS_FULL, and return either true or false.
>> >> >
>> >> > My plan is that we can give an approximation based on
>> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
>> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
>> >> > step by step so I didn't include the hint.
>> >> > If you are strong against with that in this stage, I can change it and
>> >> > try it later with the number.
>> >> > Please, say again if you want.
>> >>
>> >> since as you said zram is the only user of swap_hint, changing it
>> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
>> >> GET_FREE; since the check in scan_swap_map() really only is checking
>> >> for IS_FULL, if you update vmscan later to adjust its file/anon
>> >> balance based on GET_FREE, that can be added then with no trouble,
>> >> right?
>> >
>> > Yeb, No problem.
>> >
>> >>
>> >>
>> >> >
>> >> > Thanks for the review!
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> >  }
>> >> >> >
>> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> >> >> > index 779d03fa4360..182a2544751b 100644
>> >> >> > --- a/drivers/block/zram/zram_drv.h
>> >> >> > +++ b/drivers/block/zram/zram_drv.h
>> >> >> > @@ -115,6 +115,7 @@ struct zram {
>> >> >> >         u64 disksize;   /* bytes */
>> >> >> >         int max_comp_streams;
>> >> >> >         struct zram_stats stats;
>> >> >> > +       atomic_t alloc_fail;
>> >> >> >         /*
>> >> >> >          * the number of pages zram can consume for storing compressed data
>> >> >> >          */
>> >> >> > --
>> >> >> > 2.0.0
>> >> >> >
>> >> >> >>
>> >> >> >> heesub
>> >> >> >>
>> >> >> >> >+
>> >> >> >> >+    return 0;
>> >> >> >> >+}
>> >> >> >> >+
>> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> >> >                             unsigned int hint, void *arg)
>> >> >> >> >  {
>> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >> >> >
>> >> >> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >> >> >
>> >> >> >> >     return ret;
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >>
>> >> >> >> --
>> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> >> see: http://www.linux-mm.org/ .
>> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >> >
>> >> >> > --
>> >> >> > Kind regards,
>> >> >> > Minchan Kim
>> >> >>
>> >> >> --
>> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> see: http://www.linux-mm.org/ .
>> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >
>> >> > --
>> >> > Kind regards,
>> >> > Minchan Kim
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-17 16:28                     ` Dan Streetman
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Streetman @ 2014-09-17 16:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 17, 2014 at 3:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
>> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
>> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
>> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
>> >> >> > Hi Heesub,
>> >> >> >
>> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
>> >> >> >> Hello Minchan,
>> >> >> >>
>> >> >> >> First of all, I agree with the overall purpose of your patch set.
>> >> >> >
>> >> >> > Thank you.
>> >> >> >
>> >> >> >>
>> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
>> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
>> >> >> >> >know how many zram has freeable space.
>> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
>> >> >> >> >
>> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>> >> >> >> >---
>> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
>> >> >> >> >  1 file changed, 18 insertions(+)
>> >> >> >> >
>> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
>> >> >> >> >--- a/drivers/block/zram/zram_drv.c
>> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
>> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >> >     return 0;
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >> >+{
>> >> >> >> >+    struct zram *zram;
>> >> >> >> >+    struct zram_meta *meta;
>> >> >> >> >+
>> >> >> >> >+    zram = bdev->bd_disk->private_data;
>> >> >> >> >+    meta = zram->meta;
>> >> >> >> >+
>> >> >> >> >+    if (!zram->limit_pages)
>> >> >> >> >+            return 1;
>> >> >> >> >+
>> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >> >>
>> >> >> >> Even if 'free' is zero here, there may be free spaces available to
>> >> >> >> store more compressed pages into the zs_pool. I mean calculation
>> >> >> >> above is not quite accurate and wastes memory, but have no better
>> >> >> >> idea for now.
>> >> >> >
>> >> >> > Yeb, good point.
>> >> >> >
>> >> >> > Actually, I thought about that but in this patchset, I wanted to
>> >> >> > go with conservative approach which is a safe guard to prevent
>> >> >> > system hang which is terrible than early OOM kill.
>> >> >> >
>> >> >> > Whole point of this patchset is to add a facility to VM and VM
>> >> >> > collaborates with zram via the interface to avoid worst case
>> >> >> > (ie, system hang) and logic to throttle could be enhanced by
>> >> >> > several approaches in future but I agree my logic was too simple
>> >> >> > and conservative.
>> >> >> >
>> >> >> > We could improve it with [anti|de]fragmentation in future but
>> >> >> > at the moment, below simple heuristic is not too bad for first
>> >> >> > step. :)
>> >> >> >
>> >> >> >
>> >> >> > ---
>> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
>> >> >> >  drivers/block/zram/zram_drv.h |  1 +
>> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
>> >> >> >
>> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
>> >> >> > --- a/drivers/block/zram/zram_drv.c
>> >> >> > +++ b/drivers/block/zram/zram_drv.c
>> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
>> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
>> >> >> >                         &zram->stats.compr_data_size);
>> >> >> >         atomic64_dec(&zram->stats.pages_stored);
>> >> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >> >
>> >> >> >         meta->table[index].handle = 0;
>> >> >> >         zram_set_obj_size(meta, index, 0);
>> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
>> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
>> >> >> >                 zs_free(meta->mem_pool, handle);
>> >> >> > +               atomic_inc(&zram->alloc_fail);
>> >> >> >                 ret = -ENOMEM;
>> >> >> >                 goto out;
>> >> >> >         }
>> >> >>
>> >> >> This isn't going to work well at all with swap.  There will be,
>> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
>> >> >> indicates it's full, and even then a single free during those 32
>> >> >> failures will restart the counter, so it could be dozens or hundreds
>> >> >> (or more) swap write failures before the zram device is marked as
>> >> >> full.  And then, a single zram free will move it back to non-full and
>> >> >> start the write failures over again.
>> >> >>
>> >> >> I think it would be better to just check for actual fullness (i.e.
>> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
>> >> >> That will allow a single write to succeed when it crosses into
>> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
>> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
>> >> >> any write failures (or very few), and zram will stay full until enough
>> >> >> pages are freed that it really does move under limit_pages.
>> >> >
>> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
>> >> > approach, it could kick OOM earlier which is not what we want.
>> >> > Because our product uses zram to delay app killing by low memory killer.
>> >>
>> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
>> >> storage devices.  Obviously, "full" usually means "no more room to
>> >> store anything", while "not full" means "there is room to store
>> >> anything, up to the remaining free size".  With zram, its zsmalloc
>> >> pool size might be over the specified limit, but there will still be
>> >> room to store *some* things - but not *anything*.  Only compressed
>> >> pages that happen to fit inside a class with at least one zspage that
>> >> isn't full.
>> >>
>> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
>> >> is 100% full in all its classes.
>> >>
>> >> What about waiting until there is N number of write failures, like
>> >> this patch?  That doesn't seem very fair to the writer, since each
>> >> write failure will cause them to do extra work (first, in selecting
>> >> what to write, and then in recovering from the failed write).
>> >> However, it will probably squeeze some writes into some of those empty
>> >> spaces in already-allocated zspages.
>> >>
>> >> And declaring zram "full" immediately once the zsmalloc pool size
>> >> increases past the specified limit?  Since zsmalloc's classes almost
>> >> certainly contain some fragmentation, that will waste all the empty
>> >> spaces that could still store more compressed pages.  But, this is the
>> >> limit at which you cannot guarantee all writes to be able to store a
>> >> compressed page - any zsmalloc classes without a partially empty
>> >> zspage will have to increase zsmalloc's size, therefore failing the
>> >> write.
>> >>
>> >> Neither definition of "full" is optimal.  Since in this case we're
>> >> talking about swap, I think forcing swap write failures to happen,
>> >> which with direct reclaim could (I believe) stop everything while the
>> >> write failures continue, should be avoided as much as possible.  Even
>> >> when zram fullness is delayed by N write failures, to try to squeeze
>> >> out as much storage from zsmalloc as possible, when it does eventually
>> >> fill if zram is the only swap device the system will OOM anyway.  And
>> >> if zram isn't the only swap device, but just the first (highest
>> >> priority), then delaying things with unneeded write failures is
>> >> certainly not better than just filling up so swap can move on to the
>> >> next swap device.  The only case where write failures delaying marking
>> >> zram as full will help is if the system stopped right at this point,
>> >> and then started decreasing how much memory was needed.  That seems
>> >> like a very unlikely coincidence, but maybe some testing would help
>> >> determine how bad the write failures affect system
>> >> performance/responsiveness and how long they delay OOM.
>> >
>> > Please, keep in mind that swap is alreay really slow operation but
>> > we want to use it to avoid OOM if possible so I can't buy your early
>> > kill suggestion.
>>
>> I disagree, OOM should be invoked once the system can't proceed with
>> reclaiming memory.  IMHO, repeated swap write failures will cause the
>> system to be unable to reclaim memory.
>
> That's what I want. I'd like to go with OOM once repeated swap write
> failures happens.
> The difference between you and me is that how we should be aggressive
> to kick OOM. Your proposal was too agressive so that it can make OOM
> too early, which makes swap inefficient. That's what I'd like to avoid.
>
>>
>> > If a user feel it's really slow for his product,
>> > it means his admin was fail. He should increase the limit of zram
>> > dynamically or statically(zram already support that ways).
>> >
>> > The thing I'd like to solve in this patchset is to avoid system hang
>> > where admin cannot do anyting, even ctrl+c, which is thing should
>> > support in OS level.
>>
>> what's better - failing a lot of swap writes, or marking the swap
>> device as full?  As I said if zram is the *only* swap device in the
>> system, maybe that makes sense (although it's still questionable).  If
>> zram is only the first swap device, and there's a backup swap device
>> (presumably that just writes to disk), then it will be *much* better
>> to simply fail over to that, instead of (repeatedly) failing a lot of
>> swap writes.
>
> Actually, I don't hear such usecase until now but I can't ignore it
> because it's really doable configuration so I agree we need some knob.
>
>>
>> Especially once direct reclaim is reached, failing swap writes is
>> probably going to make the system unresponsive.  Personally I think
>> moving to OOM (or the next swap device) is better.
>
> Again said, that's what I want! But your suggestion was too agressive.
> The system can have more resource which can free easily(ex, page cache,
> purgeable memory or unimportant process could be killed).

Ok, I think we agree - I'm not against some write failures, I just
worry about "too many" (where I can't define "too many" ;-) of them,
since each write failure doesn't make any progress in reclaiming
memory for the process(es) that are waiting for it.

Also when you got the write errors, I assume you saw a lot of:
Write-error on swap-device (%u:%u:%Lu)
messages?  Obviously that's expected, but maybe it would be good to
add a check for swap_hint IS_FULL there, and skip printing the alert
if so...?

>
>>
>> If write failures are the direction you go, then IMHO there should *at
>> least* be a zram parameter to allow the admin to choose to immediately
>> fail or continue with write failures.
>
> Agree.
>
>>
>>
>> >
>> >>
>> >> Since there may be different use cases that desire different things,
>> >> maybe there should be a zram runtime (or buildtime) config to choose
>> >> exactly how it decides it's full?  Either full after N write failures,
>> >> or full when alloced>limit?  That would allow the user to either defer
>> >> getting full as long as possible (at the possible cost of system
>> >> unresponsiveness during those write failures), or to just move
>> >> immediately to zram being full as soon as it can't guarantee that each
>> >> write will succeed.
>> >
>> > Hmm, I thought it and was going to post it when I send v1.
>> > My idea was this.
>>
>> what i actually meant was more like this, where ->stop_using_when_full
>> is a user-configurable param:
>>
>> bool zram_is_full(...)
>> {
>>   if (zram->stop_using_when_full) {
>>     /* for this, allow 1 write to succeed past limit_pages */
>>     return zs_get_total_pages(zram) > zram->limit_pages;
>>   } else {
>>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
>>   }
>> }
>
> To me, it's too simple so there is no chance to control zram fullness.
> How about this one?
>
> bool zram_is_full(...)
> {
>         unsigned long total_pages;
>         if (!zram->limit_pages)
>                 return false;
>
>         total_pages = zs_get_total_pages(zram);
>         if (total_pages >= zram->limit_pages &&

Just to clarify - this also implies that zram_bvec_write() will allow
(at least) one write to succeed past limit_pages (which I agree
with)...

>                 (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)

I assume FRAG_THRESH_HOLD will be a runtime user-configurable param?

Also strictly as far as semantics, it seems like this is the reverse
of fragmentation, i.e. high fragmentation should indicate lots of
empty space, while low fragmentation should indicate compr_data_size
is almost equal to total_pages.  Not a big deal, but may cause
confusion.  Maybe call it "efficiency" or "compaction"?  Or invert the
calculation, e.g.
  (100 * (total_pages - compr_pages) / total_pages) < FRAG_THRESH_HOLD

>                 return true;
>
>         if (zram->alloc_fail > FULL_THRESH_HOLD)
>                 return true;
>
>         return false;
> }
>
> So if someone want to avoid write failure but bear with early OOM,
> he can set FRAG_THRESH_HOLD to 0.
> Any thought?

Yep that looks good.

One minor english correction, "threshold" is a single word.

>
>>
>> >
>> > int zram_get_free_pages(...)
>> > {
>> >         if (zram->limit_pages &&
>> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
>> >                 (100 * compr_data_size >> PAGE_SHIFT /
>> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
>>
>> well...i think this implementation has both downsides; it forces write
>> failures to happen, but also it doesn't guarantee being full after
>> FULL_THRESHOLD write failures.  If the fragmentation level never
>> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
>> any way that using the amount of fragmentation will work, because you
>> can't guarantee it will be reached.  The incoming pages to compress
>> may all fall into classes that are already full.
>>
>> with zsmalloc compaction, it would be possible to know that a certain
>> fragmentation threshold could be reached, but without it that's not a
>> promise zsmalloc can keep.  And we definitely don't want to fail swap
>> writes forever.
>>
>>
>> >                         *free = 0;
>> >                         return 0;
>> >         }
>> >         ..
>> > }
>> >
>> > Maybe we could export FRAG_THRESHOLD.
>> >
>> >>
>> >>
>> >>
>> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > +       atomic_set(&zram->alloc_fail, 0);
>> >> >> >         update_used_max(zram, alloced_pages);
>> >> >> >
>> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
>> >> >> >         return 0;
>> >> >> >  }
>> >> >> >
>> >> >> > +#define FULL_THRESH_HOLD 32
>> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >  {
>> >> >> >         struct zram *zram;
>> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
>> >> >> >         zram = bdev->bd_disk->private_data;
>> >> >> >         meta = zram->meta;
>> >> >> >
>> >> >> > -       if (!zram->limit_pages)
>> >> >> > -               return 1;
>> >> >> > -
>> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
>> >> >> > +       if (zram->limit_pages &&
>> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
>> >> >> > +               *free = 0;
>> >> >> > +               return 0;
>> >> >> > +       }
>> >> >> >
>> >> >> > -       return 0;
>> >> >> > +       return 1;
>> >> >>
>> >> >> There's no way that zram can even provide a accurate number of free
>> >> >> pages, since it can't know how compressible future stored pages will
>> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
>> >> >> to IS_FULL, and return either true or false.
>> >> >
>> >> > My plan is that we can give an approximation based on
>> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
>> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
>> >> > step by step so I didn't include the hint.
>> >> > If you are strong against with that in this stage, I can change it and
>> >> > try it later with the number.
>> >> > Please, say again if you want.
>> >>
>> >> since as you said zram is the only user of swap_hint, changing it
>> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
>> >> GET_FREE; since the check in scan_swap_map() really only is checking
>> >> for IS_FULL, if you update vmscan later to adjust its file/anon
>> >> balance based on GET_FREE, that can be added then with no trouble,
>> >> right?
>> >
>> > Yeb, No problem.
>> >
>> >>
>> >>
>> >> >
>> >> > Thanks for the review!
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> >  }
>> >> >> >
>> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> >> >> > index 779d03fa4360..182a2544751b 100644
>> >> >> > --- a/drivers/block/zram/zram_drv.h
>> >> >> > +++ b/drivers/block/zram/zram_drv.h
>> >> >> > @@ -115,6 +115,7 @@ struct zram {
>> >> >> >         u64 disksize;   /* bytes */
>> >> >> >         int max_comp_streams;
>> >> >> >         struct zram_stats stats;
>> >> >> > +       atomic_t alloc_fail;
>> >> >> >         /*
>> >> >> >          * the number of pages zram can consume for storing compressed data
>> >> >> >          */
>> >> >> > --
>> >> >> > 2.0.0
>> >> >> >
>> >> >> >>
>> >> >> >> heesub
>> >> >> >>
>> >> >> >> >+
>> >> >> >> >+    return 0;
>> >> >> >> >+}
>> >> >> >> >+
>> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
>> >> >> >> >                             unsigned int hint, void *arg)
>> >> >> >> >  {
>> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
>> >> >> >> >
>> >> >> >> >     if (hint == SWAP_SLOT_FREE)
>> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
>> >> >> >> >+    else if (hint == SWAP_GET_FREE)
>> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
>> >> >> >> >
>> >> >> >> >     return ret;
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >>
>> >> >> >> --
>> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> >> see: http://www.linux-mm.org/ .
>> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >> >
>> >> >> > --
>> >> >> > Kind regards,
>> >> >> > Minchan Kim
>> >> >>
>> >> >> --
>> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> >> see: http://www.linux-mm.org/ .
>> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >> >
>> >> > --
>> >> > Kind regards,
>> >> > Minchan Kim
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
  2014-09-17 16:28                     ` Dan Streetman
@ 2014-09-19  6:14                       ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-19  6:14 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 17, 2014 at 12:28:59PM -0400, Dan Streetman wrote:
> On Wed, Sep 17, 2014 at 3:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
> >> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> >> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> >> > Hi Heesub,
> >> >> >> >
> >> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> >> >> Hello Minchan,
> >> >> >> >>
> >> >> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >> >> >
> >> >> >> > Thank you.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >> >> >know how many zram has freeable space.
> >> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >> >> >
> >> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >> >> >---
> >> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >> >> >  1 file changed, 18 insertions(+)
> >> >> >> >> >
> >> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >> >     return 0;
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >> >+{
> >> >> >> >> >+    struct zram *zram;
> >> >> >> >> >+    struct zram_meta *meta;
> >> >> >> >> >+
> >> >> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >> >> >+    meta = zram->meta;
> >> >> >> >> >+
> >> >> >> >> >+    if (!zram->limit_pages)
> >> >> >> >> >+            return 1;
> >> >> >> >> >+
> >> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >> >>
> >> >> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> >> >> idea for now.
> >> >> >> >
> >> >> >> > Yeb, good point.
> >> >> >> >
> >> >> >> > Actually, I thought about that but in this patchset, I wanted to
> >> >> >> > go with conservative approach which is a safe guard to prevent
> >> >> >> > system hang which is terrible than early OOM kill.
> >> >> >> >
> >> >> >> > Whole point of this patchset is to add a facility to VM and VM
> >> >> >> > collaborates with zram via the interface to avoid worst case
> >> >> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> >> >> > several approaches in future but I agree my logic was too simple
> >> >> >> > and conservative.
> >> >> >> >
> >> >> >> > We could improve it with [anti|de]fragmentation in future but
> >> >> >> > at the moment, below simple heuristic is not too bad for first
> >> >> >> > step. :)
> >> >> >> >
> >> >> >> >
> >> >> >> > ---
> >> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> >> >> > --- a/drivers/block/zram/zram_drv.c
> >> >> >> > +++ b/drivers/block/zram/zram_drv.c
> >> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >> >> >                         &zram->stats.compr_data_size);
> >> >> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >> >
> >> >> >> >         meta->table[index].handle = 0;
> >> >> >> >         zram_set_obj_size(meta, index, 0);
> >> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >> >> >                 zs_free(meta->mem_pool, handle);
> >> >> >> > +               atomic_inc(&zram->alloc_fail);
> >> >> >> >                 ret = -ENOMEM;
> >> >> >> >                 goto out;
> >> >> >> >         }
> >> >> >>
> >> >> >> This isn't going to work well at all with swap.  There will be,
> >> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> >> >> indicates it's full, and even then a single free during those 32
> >> >> >> failures will restart the counter, so it could be dozens or hundreds
> >> >> >> (or more) swap write failures before the zram device is marked as
> >> >> >> full.  And then, a single zram free will move it back to non-full and
> >> >> >> start the write failures over again.
> >> >> >>
> >> >> >> I think it would be better to just check for actual fullness (i.e.
> >> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> >> >> That will allow a single write to succeed when it crosses into
> >> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> >> >> any write failures (or very few), and zram will stay full until enough
> >> >> >> pages are freed that it really does move under limit_pages.
> >> >> >
> >> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> >> >> > approach, it could kick OOM earlier which is not what we want.
> >> >> > Because our product uses zram to delay app killing by low memory killer.
> >> >>
> >> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
> >> >> storage devices.  Obviously, "full" usually means "no more room to
> >> >> store anything", while "not full" means "there is room to store
> >> >> anything, up to the remaining free size".  With zram, its zsmalloc
> >> >> pool size might be over the specified limit, but there will still be
> >> >> room to store *some* things - but not *anything*.  Only compressed
> >> >> pages that happen to fit inside a class with at least one zspage that
> >> >> isn't full.
> >> >>
> >> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> >> >> is 100% full in all its classes.
> >> >>
> >> >> What about waiting until there is N number of write failures, like
> >> >> this patch?  That doesn't seem very fair to the writer, since each
> >> >> write failure will cause them to do extra work (first, in selecting
> >> >> what to write, and then in recovering from the failed write).
> >> >> However, it will probably squeeze some writes into some of those empty
> >> >> spaces in already-allocated zspages.
> >> >>
> >> >> And declaring zram "full" immediately once the zsmalloc pool size
> >> >> increases past the specified limit?  Since zsmalloc's classes almost
> >> >> certainly contain some fragmentation, that will waste all the empty
> >> >> spaces that could still store more compressed pages.  But, this is the
> >> >> limit at which you cannot guarantee all writes to be able to store a
> >> >> compressed page - any zsmalloc classes without a partially empty
> >> >> zspage will have to increase zsmalloc's size, therefore failing the
> >> >> write.
> >> >>
> >> >> Neither definition of "full" is optimal.  Since in this case we're
> >> >> talking about swap, I think forcing swap write failures to happen,
> >> >> which with direct reclaim could (I believe) stop everything while the
> >> >> write failures continue, should be avoided as much as possible.  Even
> >> >> when zram fullness is delayed by N write failures, to try to squeeze
> >> >> out as much storage from zsmalloc as possible, when it does eventually
> >> >> fill if zram is the only swap device the system will OOM anyway.  And
> >> >> if zram isn't the only swap device, but just the first (highest
> >> >> priority), then delaying things with unneeded write failures is
> >> >> certainly not better than just filling up so swap can move on to the
> >> >> next swap device.  The only case where write failures delaying marking
> >> >> zram as full will help is if the system stopped right at this point,
> >> >> and then started decreasing how much memory was needed.  That seems
> >> >> like a very unlikely coincidence, but maybe some testing would help
> >> >> determine how bad the write failures affect system
> >> >> performance/responsiveness and how long they delay OOM.
> >> >
> >> > Please, keep in mind that swap is alreay really slow operation but
> >> > we want to use it to avoid OOM if possible so I can't buy your early
> >> > kill suggestion.
> >>
> >> I disagree, OOM should be invoked once the system can't proceed with
> >> reclaiming memory.  IMHO, repeated swap write failures will cause the
> >> system to be unable to reclaim memory.
> >
> > That's what I want. I'd like to go with OOM once repeated swap write
> > failures happens.
> > The difference between you and me is that how we should be aggressive
> > to kick OOM. Your proposal was too agressive so that it can make OOM
> > too early, which makes swap inefficient. That's what I'd like to avoid.
> >
> >>
> >> > If a user feel it's really slow for his product,
> >> > it means his admin was fail. He should increase the limit of zram
> >> > dynamically or statically(zram already support that ways).
> >> >
> >> > The thing I'd like to solve in this patchset is to avoid system hang
> >> > where admin cannot do anyting, even ctrl+c, which is thing should
> >> > support in OS level.
> >>
> >> what's better - failing a lot of swap writes, or marking the swap
> >> device as full?  As I said if zram is the *only* swap device in the
> >> system, maybe that makes sense (although it's still questionable).  If
> >> zram is only the first swap device, and there's a backup swap device
> >> (presumably that just writes to disk), then it will be *much* better
> >> to simply fail over to that, instead of (repeatedly) failing a lot of
> >> swap writes.
> >
> > Actually, I don't hear such usecase until now but I can't ignore it
> > because it's really doable configuration so I agree we need some knob.
> >
> >>
> >> Especially once direct reclaim is reached, failing swap writes is
> >> probably going to make the system unresponsive.  Personally I think
> >> moving to OOM (or the next swap device) is better.
> >
> > Again said, that's what I want! But your suggestion was too agressive.
> > The system can have more resource which can free easily(ex, page cache,
> > purgeable memory or unimportant process could be killed).
> 
> Ok, I think we agree - I'm not against some write failures, I just
> worry about "too many" (where I can't define "too many" ;-) of them,
> since each write failure doesn't make any progress in reclaiming
> memory for the process(es) that are waiting for it.
> 
> Also when you got the write errors, I assume you saw a lot of:
> Write-error on swap-device (%u:%u:%Lu)
> messages?  Obviously that's expected, but maybe it would be good to
> add a check for swap_hint IS_FULL there, and skip printing the alert
> if so...?

If it's really problem, we could add ratelimit like buffer_io_error.

> 
> >
> >>
> >> If write failures are the direction you go, then IMHO there should *at
> >> least* be a zram parameter to allow the admin to choose to immediately
> >> fail or continue with write failures.
> >
> > Agree.
> >
> >>
> >>
> >> >
> >> >>
> >> >> Since there may be different use cases that desire different things,
> >> >> maybe there should be a zram runtime (or buildtime) config to choose
> >> >> exactly how it decides it's full?  Either full after N write failures,
> >> >> or full when alloced>limit?  That would allow the user to either defer
> >> >> getting full as long as possible (at the possible cost of system
> >> >> unresponsiveness during those write failures), or to just move
> >> >> immediately to zram being full as soon as it can't guarantee that each
> >> >> write will succeed.
> >> >
> >> > Hmm, I thought it and was going to post it when I send v1.
> >> > My idea was this.
> >>
> >> what i actually meant was more like this, where ->stop_using_when_full
> >> is a user-configurable param:
> >>
> >> bool zram_is_full(...)
> >> {
> >>   if (zram->stop_using_when_full) {
> >>     /* for this, allow 1 write to succeed past limit_pages */
> >>     return zs_get_total_pages(zram) > zram->limit_pages;
> >>   } else {
> >>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
> >>   }
> >> }
> >
> > To me, it's too simple so there is no chance to control zram fullness.
> > How about this one?
> >
> > bool zram_is_full(...)
> > {
> >         unsigned long total_pages;
> >         if (!zram->limit_pages)
> >                 return false;
> >
> >         total_pages = zs_get_total_pages(zram);
> >         if (total_pages >= zram->limit_pages &&
> 
> Just to clarify - this also implies that zram_bvec_write() will allow
> (at least) one write to succeed past limit_pages (which I agree
> with)...

Yeb.

> 
> >                 (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)
> 
> I assume FRAG_THRESH_HOLD will be a runtime user-configurable param?

Sure.

> 
> Also strictly as far as semantics, it seems like this is the reverse
> of fragmentation, i.e. high fragmentation should indicate lots of
> empty space, while low fragmentation should indicate compr_data_size
> is almost equal to total_pages.  Not a big deal, but may cause
> confusion.  Maybe call it "efficiency" or "compaction"?  Or invert the
> calculation, e.g.
>   (100 * (total_pages - compr_pages) / total_pages) < FRAG_THRESH_HOLD
> 
> >                 return true;
> >
> >         if (zram->alloc_fail > FULL_THRESH_HOLD)
> >                 return true;
> >
> >         return false;
> > }

I'd like to define "fullness".

        fullness = (100 * used space / total space)

if (100 * compr_pages / total_pages) >= ZRAM_FULLNESS_PERCENT)
        return 1;

It means the higher fullness is, the slower we reach zram full.

And I want to set ZRAM_FULLNESS_PERCENT as 80, which means want
to bias more memory consumption rather than early OOM kill.

> >
> > So if someone want to avoid write failure but bear with early OOM,
> > he can set FRAG_THRESH_HOLD to 0.
> > Any thought?
> 
> Yep that looks good.
> 
> One minor english correction, "threshold" is a single word.

Thanks for the review, Dan!

> 
> >
> >>
> >> >
> >> > int zram_get_free_pages(...)
> >> > {
> >> >         if (zram->limit_pages &&
> >> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
> >> >                 (100 * compr_data_size >> PAGE_SHIFT /
> >> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
> >>
> >> well...i think this implementation has both downsides; it forces write
> >> failures to happen, but also it doesn't guarantee being full after
> >> FULL_THRESHOLD write failures.  If the fragmentation level never
> >> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
> >> any way that using the amount of fragmentation will work, because you
> >> can't guarantee it will be reached.  The incoming pages to compress
> >> may all fall into classes that are already full.
> >>
> >> with zsmalloc compaction, it would be possible to know that a certain
> >> fragmentation threshold could be reached, but without it that's not a
> >> promise zsmalloc can keep.  And we definitely don't want to fail swap
> >> writes forever.
> >>
> >>
> >> >                         *free = 0;
> >> >                         return 0;
> >> >         }
> >> >         ..
> >> > }
> >> >
> >> > Maybe we could export FRAG_THRESHOLD.
> >> >
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >> >         update_used_max(zram, alloced_pages);
> >> >> >> >
> >> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >         return 0;
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +#define FULL_THRESH_HOLD 32
> >> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >  {
> >> >> >> >         struct zram *zram;
> >> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >         zram = bdev->bd_disk->private_data;
> >> >> >> >         meta = zram->meta;
> >> >> >> >
> >> >> >> > -       if (!zram->limit_pages)
> >> >> >> > -               return 1;
> >> >> >> > -
> >> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >> > +       if (zram->limit_pages &&
> >> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> >> >> > +               *free = 0;
> >> >> >> > +               return 0;
> >> >> >> > +       }
> >> >> >> >
> >> >> >> > -       return 0;
> >> >> >> > +       return 1;
> >> >> >>
> >> >> >> There's no way that zram can even provide a accurate number of free
> >> >> >> pages, since it can't know how compressible future stored pages will
> >> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> >> >> to IS_FULL, and return either true or false.
> >> >> >
> >> >> > My plan is that we can give an approximation based on
> >> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> >> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> >> >> > step by step so I didn't include the hint.
> >> >> > If you are strong against with that in this stage, I can change it and
> >> >> > try it later with the number.
> >> >> > Please, say again if you want.
> >> >>
> >> >> since as you said zram is the only user of swap_hint, changing it
> >> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
> >> >> GET_FREE; since the check in scan_swap_map() really only is checking
> >> >> for IS_FULL, if you update vmscan later to adjust its file/anon
> >> >> balance based on GET_FREE, that can be added then with no trouble,
> >> >> right?
> >> >
> >> > Yeb, No problem.
> >> >
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks for the review!
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> >  }
> >> >> >> >
> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> >> >> > index 779d03fa4360..182a2544751b 100644
> >> >> >> > --- a/drivers/block/zram/zram_drv.h
> >> >> >> > +++ b/drivers/block/zram/zram_drv.h
> >> >> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >> >> >         u64 disksize;   /* bytes */
> >> >> >> >         int max_comp_streams;
> >> >> >> >         struct zram_stats stats;
> >> >> >> > +       atomic_t alloc_fail;
> >> >> >> >         /*
> >> >> >> >          * the number of pages zram can consume for storing compressed data
> >> >> >> >          */
> >> >> >> > --
> >> >> >> > 2.0.0
> >> >> >> >
> >> >> >> >>
> >> >> >> >> heesub
> >> >> >> >>
> >> >> >> >> >+
> >> >> >> >> >+    return 0;
> >> >> >> >> >+}
> >> >> >> >> >+
> >> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >> >                             unsigned int hint, void *arg)
> >> >> >> >> >  {
> >> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >> >
> >> >> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >> >> >
> >> >> >> >> >     return ret;
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >> >
> >> >> >> > --
> >> >> >> > Kind regards,
> >> >> >> > Minchan Kim
> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >
> >> >> > --
> >> >> > Kind regards,
> >> >> > Minchan Kim
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 3/3] zram: add swap_get_free hint
@ 2014-09-19  6:14                       ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2014-09-19  6:14 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Heesub Shin, Andrew Morton, linux-kernel, Linux-MM, Hugh Dickins,
	Shaohua Li, Jerome Marchand, Sergey Senozhatsky, Nitin Gupta,
	Luigi Semenzato

On Wed, Sep 17, 2014 at 12:28:59PM -0400, Dan Streetman wrote:
> On Wed, Sep 17, 2014 at 3:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Sep 16, 2014 at 11:58:32AM -0400, Dan Streetman wrote:
> >> On Mon, Sep 15, 2014 at 9:21 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Mon, Sep 15, 2014 at 12:00:33PM -0400, Dan Streetman wrote:
> >> >> On Sun, Sep 14, 2014 at 8:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> > On Sat, Sep 13, 2014 at 03:39:13PM -0400, Dan Streetman wrote:
> >> >> >> On Thu, Sep 4, 2014 at 7:59 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> >> >> > Hi Heesub,
> >> >> >> >
> >> >> >> > On Thu, Sep 04, 2014 at 03:26:14PM +0900, Heesub Shin wrote:
> >> >> >> >> Hello Minchan,
> >> >> >> >>
> >> >> >> >> First of all, I agree with the overall purpose of your patch set.
> >> >> >> >
> >> >> >> > Thank you.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> On 09/04/2014 10:39 AM, Minchan Kim wrote:
> >> >> >> >> >This patch implement SWAP_GET_FREE handler in zram so that VM can
> >> >> >> >> >know how many zram has freeable space.
> >> >> >> >> >VM can use it to stop anonymous reclaiming once zram is full.
> >> >> >> >> >
> >> >> >> >> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> >> >> >> >---
> >> >> >> >> >  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++++
> >> >> >> >> >  1 file changed, 18 insertions(+)
> >> >> >> >> >
> >> >> >> >> >diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> >> >index 88661d62e46a..8e22b20aa2db 100644
> >> >> >> >> >--- a/drivers/block/zram/zram_drv.c
> >> >> >> >> >+++ b/drivers/block/zram/zram_drv.c
> >> >> >> >> >@@ -951,6 +951,22 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >> >     return 0;
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> >+static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >> >+{
> >> >> >> >> >+    struct zram *zram;
> >> >> >> >> >+    struct zram_meta *meta;
> >> >> >> >> >+
> >> >> >> >> >+    zram = bdev->bd_disk->private_data;
> >> >> >> >> >+    meta = zram->meta;
> >> >> >> >> >+
> >> >> >> >> >+    if (!zram->limit_pages)
> >> >> >> >> >+            return 1;
> >> >> >> >> >+
> >> >> >> >> >+    *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >> >>
> >> >> >> >> Even if 'free' is zero here, there may be free spaces available to
> >> >> >> >> store more compressed pages into the zs_pool. I mean calculation
> >> >> >> >> above is not quite accurate and wastes memory, but have no better
> >> >> >> >> idea for now.
> >> >> >> >
> >> >> >> > Yeb, good point.
> >> >> >> >
> >> >> >> > Actually, I thought about that but in this patchset, I wanted to
> >> >> >> > go with conservative approach which is a safe guard to prevent
> >> >> >> > system hang which is terrible than early OOM kill.
> >> >> >> >
> >> >> >> > Whole point of this patchset is to add a facility to VM and VM
> >> >> >> > collaborates with zram via the interface to avoid worst case
> >> >> >> > (ie, system hang) and logic to throttle could be enhanced by
> >> >> >> > several approaches in future but I agree my logic was too simple
> >> >> >> > and conservative.
> >> >> >> >
> >> >> >> > We could improve it with [anti|de]fragmentation in future but
> >> >> >> > at the moment, below simple heuristic is not too bad for first
> >> >> >> > step. :)
> >> >> >> >
> >> >> >> >
> >> >> >> > ---
> >> >> >> >  drivers/block/zram/zram_drv.c | 15 ++++++++++-----
> >> >> >> >  drivers/block/zram/zram_drv.h |  1 +
> >> >> >> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> >> >> >> > index 8e22b20aa2db..af9dfe6a7d2b 100644
> >> >> >> > --- a/drivers/block/zram/zram_drv.c
> >> >> >> > +++ b/drivers/block/zram/zram_drv.c
> >> >> >> > @@ -410,6 +410,7 @@ static bool zram_free_page(struct zram *zram, size_t index)
> >> >> >> >         atomic64_sub(zram_get_obj_size(meta, index),
> >> >> >> >                         &zram->stats.compr_data_size);
> >> >> >> >         atomic64_dec(&zram->stats.pages_stored);
> >> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >> >
> >> >> >> >         meta->table[index].handle = 0;
> >> >> >> >         zram_set_obj_size(meta, index, 0);
> >> >> >> > @@ -600,10 +601,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >> >> >> >         alloced_pages = zs_get_total_pages(meta->mem_pool);
> >> >> >> >         if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> >> >> >> >                 zs_free(meta->mem_pool, handle);
> >> >> >> > +               atomic_inc(&zram->alloc_fail);
> >> >> >> >                 ret = -ENOMEM;
> >> >> >> >                 goto out;
> >> >> >> >         }
> >> >> >>
> >> >> >> This isn't going to work well at all with swap.  There will be,
> >> >> >> minimum, 32 failures to write a swap page before GET_FREE finally
> >> >> >> indicates it's full, and even then a single free during those 32
> >> >> >> failures will restart the counter, so it could be dozens or hundreds
> >> >> >> (or more) swap write failures before the zram device is marked as
> >> >> >> full.  And then, a single zram free will move it back to non-full and
> >> >> >> start the write failures over again.
> >> >> >>
> >> >> >> I think it would be better to just check for actual fullness (i.e.
> >> >> >> alloced_pages > limit_pages) at the start of write, and fail if so.
> >> >> >> That will allow a single write to succeed when it crosses into
> >> >> >> fullness, and the if GET_FREE is changed to a simple IS_FULL and uses
> >> >> >> the same check (alloced_pages > limit_pages), then swap shouldn't see
> >> >> >> any write failures (or very few), and zram will stay full until enough
> >> >> >> pages are freed that it really does move under limit_pages.
> >> >> >
> >> >> > The alloced_pages > limit_pages doesn't mean zram is full so with your
> >> >> > approach, it could kick OOM earlier which is not what we want.
> >> >> > Because our product uses zram to delay app killing by low memory killer.
> >> >>
> >> >> With zram, the meaning of "full" isn't as obvious as other fixed-size
> >> >> storage devices.  Obviously, "full" usually means "no more room to
> >> >> store anything", while "not full" means "there is room to store
> >> >> anything, up to the remaining free size".  With zram, its zsmalloc
> >> >> pool size might be over the specified limit, but there will still be
> >> >> room to store *some* things - but not *anything*.  Only compressed
> >> >> pages that happen to fit inside a class with at least one zspage that
> >> >> isn't full.
> >> >>
> >> >> Clearly, we shouldn't wait to declare zram "full" only once zsmalloc
> >> >> is 100% full in all its classes.
> >> >>
> >> >> What about waiting until there is N number of write failures, like
> >> >> this patch?  That doesn't seem very fair to the writer, since each
> >> >> write failure will cause them to do extra work (first, in selecting
> >> >> what to write, and then in recovering from the failed write).
> >> >> However, it will probably squeeze some writes into some of those empty
> >> >> spaces in already-allocated zspages.
> >> >>
> >> >> And declaring zram "full" immediately once the zsmalloc pool size
> >> >> increases past the specified limit?  Since zsmalloc's classes almost
> >> >> certainly contain some fragmentation, that will waste all the empty
> >> >> spaces that could still store more compressed pages.  But, this is the
> >> >> limit at which you cannot guarantee all writes to be able to store a
> >> >> compressed page - any zsmalloc classes without a partially empty
> >> >> zspage will have to increase zsmalloc's size, therefore failing the
> >> >> write.
> >> >>
> >> >> Neither definition of "full" is optimal.  Since in this case we're
> >> >> talking about swap, I think forcing swap write failures to happen,
> >> >> which with direct reclaim could (I believe) stop everything while the
> >> >> write failures continue, should be avoided as much as possible.  Even
> >> >> when zram fullness is delayed by N write failures, to try to squeeze
> >> >> out as much storage from zsmalloc as possible, when it does eventually
> >> >> fill if zram is the only swap device the system will OOM anyway.  And
> >> >> if zram isn't the only swap device, but just the first (highest
> >> >> priority), then delaying things with unneeded write failures is
> >> >> certainly not better than just filling up so swap can move on to the
> >> >> next swap device.  The only case where write failures delaying marking
> >> >> zram as full will help is if the system stopped right at this point,
> >> >> and then started decreasing how much memory was needed.  That seems
> >> >> like a very unlikely coincidence, but maybe some testing would help
> >> >> determine how bad the write failures affect system
> >> >> performance/responsiveness and how long they delay OOM.
> >> >
> >> > Please, keep in mind that swap is alreay really slow operation but
> >> > we want to use it to avoid OOM if possible so I can't buy your early
> >> > kill suggestion.
> >>
> >> I disagree, OOM should be invoked once the system can't proceed with
> >> reclaiming memory.  IMHO, repeated swap write failures will cause the
> >> system to be unable to reclaim memory.
> >
> > That's what I want. I'd like to go with OOM once repeated swap write
> > failures happens.
> > The difference between you and me is that how we should be aggressive
> > to kick OOM. Your proposal was too agressive so that it can make OOM
> > too early, which makes swap inefficient. That's what I'd like to avoid.
> >
> >>
> >> > If a user feel it's really slow for his product,
> >> > it means his admin was fail. He should increase the limit of zram
> >> > dynamically or statically(zram already support that ways).
> >> >
> >> > The thing I'd like to solve in this patchset is to avoid system hang
> >> > where admin cannot do anyting, even ctrl+c, which is thing should
> >> > support in OS level.
> >>
> >> what's better - failing a lot of swap writes, or marking the swap
> >> device as full?  As I said if zram is the *only* swap device in the
> >> system, maybe that makes sense (although it's still questionable).  If
> >> zram is only the first swap device, and there's a backup swap device
> >> (presumably that just writes to disk), then it will be *much* better
> >> to simply fail over to that, instead of (repeatedly) failing a lot of
> >> swap writes.
> >
> > Actually, I don't hear such usecase until now but I can't ignore it
> > because it's really doable configuration so I agree we need some knob.
> >
> >>
> >> Especially once direct reclaim is reached, failing swap writes is
> >> probably going to make the system unresponsive.  Personally I think
> >> moving to OOM (or the next swap device) is better.
> >
> > Again said, that's what I want! But your suggestion was too agressive.
> > The system can have more resource which can free easily(ex, page cache,
> > purgeable memory or unimportant process could be killed).
> 
> Ok, I think we agree - I'm not against some write failures, I just
> worry about "too many" (where I can't define "too many" ;-) of them,
> since each write failure doesn't make any progress in reclaiming
> memory for the process(es) that are waiting for it.
> 
> Also when you got the write errors, I assume you saw a lot of:
> Write-error on swap-device (%u:%u:%Lu)
> messages?  Obviously that's expected, but maybe it would be good to
> add a check for swap_hint IS_FULL there, and skip printing the alert
> if so...?

If it's really problem, we could add ratelimit like buffer_io_error.

> 
> >
> >>
> >> If write failures are the direction you go, then IMHO there should *at
> >> least* be a zram parameter to allow the admin to choose to immediately
> >> fail or continue with write failures.
> >
> > Agree.
> >
> >>
> >>
> >> >
> >> >>
> >> >> Since there may be different use cases that desire different things,
> >> >> maybe there should be a zram runtime (or buildtime) config to choose
> >> >> exactly how it decides it's full?  Either full after N write failures,
> >> >> or full when alloced>limit?  That would allow the user to either defer
> >> >> getting full as long as possible (at the possible cost of system
> >> >> unresponsiveness during those write failures), or to just move
> >> >> immediately to zram being full as soon as it can't guarantee that each
> >> >> write will succeed.
> >> >
> >> > Hmm, I thought it and was going to post it when I send v1.
> >> > My idea was this.
> >>
> >> what i actually meant was more like this, where ->stop_using_when_full
> >> is a user-configurable param:
> >>
> >> bool zram_is_full(...)
> >> {
> >>   if (zram->stop_using_when_full) {
> >>     /* for this, allow 1 write to succeed past limit_pages */
> >>     return zs_get_total_pages(zram) > zram->limit_pages;
> >>   } else {
> >>     return zram->alloc_fail > ALLOC_FAIL_THRESHOLD;
> >>   }
> >> }
> >
> > To me, it's too simple so there is no chance to control zram fullness.
> > How about this one?
> >
> > bool zram_is_full(...)
> > {
> >         unsigned long total_pages;
> >         if (!zram->limit_pages)
> >                 return false;
> >
> >         total_pages = zs_get_total_pages(zram);
> >         if (total_pages >= zram->limit_pages &&
> 
> Just to clarify - this also implies that zram_bvec_write() will allow
> (at least) one write to succeed past limit_pages (which I agree
> with)...

Yeb.

> 
> >                 (100 * (compr_data_size >> PAGE_SHIFT) / total_pages) > FRAG_THRESH_HOLD)
> 
> I assume FRAG_THRESH_HOLD will be a runtime user-configurable param?

Sure.

> 
> Also strictly as far as semantics, it seems like this is the reverse
> of fragmentation, i.e. high fragmentation should indicate lots of
> empty space, while low fragmentation should indicate compr_data_size
> is almost equal to total_pages.  Not a big deal, but may cause
> confusion.  Maybe call it "efficiency" or "compaction"?  Or invert the
> calculation, e.g.
>   (100 * (total_pages - compr_pages) / total_pages) < FRAG_THRESH_HOLD
> 
> >                 return true;
> >
> >         if (zram->alloc_fail > FULL_THRESH_HOLD)
> >                 return true;
> >
> >         return false;
> > }

I'd like to define "fullness".

        fullness = (100 * used space / total space)

if (100 * compr_pages / total_pages) >= ZRAM_FULLNESS_PERCENT)
        return 1;

It means the higher fullness is, the slower we reach zram full.

And I want to set ZRAM_FULLNESS_PERCENT as 80, which means want
to bias more memory consumption rather than early OOM kill.

> >
> > So if someone want to avoid write failure but bear with early OOM,
> > he can set FRAG_THRESH_HOLD to 0.
> > Any thought?
> 
> Yep that looks good.
> 
> One minor english correction, "threshold" is a single word.

Thanks for the review, Dan!

> 
> >
> >>
> >> >
> >> > int zram_get_free_pages(...)
> >> > {
> >> >         if (zram->limit_pages &&
> >> >                 zram->alloc_fail > FULL_THRESH_HOLD &&
> >> >                 (100 * compr_data_size >> PAGE_SHIFT /
> >> >                         zs_get_total_pages(zram)) > FRAG_THRESH_HOLD) {
> >>
> >> well...i think this implementation has both downsides; it forces write
> >> failures to happen, but also it doesn't guarantee being full after
> >> FULL_THRESHOLD write failures.  If the fragmentation level never
> >> reaches FRAG_THRESHOLD, it'll fail writes forever.  I can't think of
> >> any way that using the amount of fragmentation will work, because you
> >> can't guarantee it will be reached.  The incoming pages to compress
> >> may all fall into classes that are already full.
> >>
> >> with zsmalloc compaction, it would be possible to know that a certain
> >> fragmentation threshold could be reached, but without it that's not a
> >> promise zsmalloc can keep.  And we definitely don't want to fail swap
> >> writes forever.
> >>
> >>
> >> >                         *free = 0;
> >> >                         return 0;
> >> >         }
> >> >         ..
> >> > }
> >> >
> >> > Maybe we could export FRAG_THRESHOLD.
> >> >
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> > +       atomic_set(&zram->alloc_fail, 0);
> >> >> >> >         update_used_max(zram, alloced_pages);
> >> >> >> >
> >> >> >> >         cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >> >> >> > @@ -951,6 +954,7 @@ static int zram_slot_free_notify(struct block_device *bdev,
> >> >> >> >         return 0;
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +#define FULL_THRESH_HOLD 32
> >> >> >> >  static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >  {
> >> >> >> >         struct zram *zram;
> >> >> >> > @@ -959,12 +963,13 @@ static int zram_get_free_pages(struct block_device *bdev, long *free)
> >> >> >> >         zram = bdev->bd_disk->private_data;
> >> >> >> >         meta = zram->meta;
> >> >> >> >
> >> >> >> > -       if (!zram->limit_pages)
> >> >> >> > -               return 1;
> >> >> >> > -
> >> >> >> > -       *free = zram->limit_pages - zs_get_total_pages(meta->mem_pool);
> >> >> >> > +       if (zram->limit_pages &&
> >> >> >> > +               (atomic_read(&zram->alloc_fail) > FULL_THRESH_HOLD)) {
> >> >> >> > +               *free = 0;
> >> >> >> > +               return 0;
> >> >> >> > +       }
> >> >> >> >
> >> >> >> > -       return 0;
> >> >> >> > +       return 1;
> >> >> >>
> >> >> >> There's no way that zram can even provide a accurate number of free
> >> >> >> pages, since it can't know how compressible future stored pages will
> >> >> >> be.  It would be better to simply change this swap_hint from GET_FREE
> >> >> >> to IS_FULL, and return either true or false.
> >> >> >
> >> >> > My plan is that we can give an approximation based on
> >> >> > orig_data_size/compr_data_size with tweaking zero page and vmscan can use
> >> >> > the hint from get_nr_swap_pages to throttle file/anon balance but I want to do
> >> >> > step by step so I didn't include the hint.
> >> >> > If you are strong against with that in this stage, I can change it and
> >> >> > try it later with the number.
> >> >> > Please, say again if you want.
> >> >>
> >> >> since as you said zram is the only user of swap_hint, changing it
> >> >> later shouldn't be a big deal.  And you could have both, IS_FULL and
> >> >> GET_FREE; since the check in scan_swap_map() really only is checking
> >> >> for IS_FULL, if you update vmscan later to adjust its file/anon
> >> >> balance based on GET_FREE, that can be added then with no trouble,
> >> >> right?
> >> >
> >> > Yeb, No problem.
> >> >
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks for the review!
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> >  }
> >> >> >> >
> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> >> >> >> > index 779d03fa4360..182a2544751b 100644
> >> >> >> > --- a/drivers/block/zram/zram_drv.h
> >> >> >> > +++ b/drivers/block/zram/zram_drv.h
> >> >> >> > @@ -115,6 +115,7 @@ struct zram {
> >> >> >> >         u64 disksize;   /* bytes */
> >> >> >> >         int max_comp_streams;
> >> >> >> >         struct zram_stats stats;
> >> >> >> > +       atomic_t alloc_fail;
> >> >> >> >         /*
> >> >> >> >          * the number of pages zram can consume for storing compressed data
> >> >> >> >          */
> >> >> >> > --
> >> >> >> > 2.0.0
> >> >> >> >
> >> >> >> >>
> >> >> >> >> heesub
> >> >> >> >>
> >> >> >> >> >+
> >> >> >> >> >+    return 0;
> >> >> >> >> >+}
> >> >> >> >> >+
> >> >> >> >> >  static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >> >                             unsigned int hint, void *arg)
> >> >> >> >> >  {
> >> >> >> >> >@@ -958,6 +974,8 @@ static int zram_swap_hint(struct block_device *bdev,
> >> >> >> >> >
> >> >> >> >> >     if (hint == SWAP_SLOT_FREE)
> >> >> >> >> >             ret = zram_slot_free_notify(bdev, (unsigned long)arg);
> >> >> >> >> >+    else if (hint == SWAP_GET_FREE)
> >> >> >> >> >+            ret = zram_get_free_pages(bdev, arg);
> >> >> >> >> >
> >> >> >> >> >     return ret;
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >> >
> >> >> >> > --
> >> >> >> > Kind regards,
> >> >> >> > Minchan Kim
> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> >> see: http://www.linux-mm.org/ .
> >> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >> >
> >> >> > --
> >> >> > Kind regards,
> >> >> > Minchan Kim
> >> >>
> >> >> --
> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> >> see: http://www.linux-mm.org/ .
> >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >> > --
> >> > Kind regards,
> >> > Minchan Kim
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2014-09-19  6:14 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04  1:39 [RFC 0/3] make vm aware of zram-swap Minchan Kim
2014-09-04  1:39 ` Minchan Kim
2014-09-04  1:39 ` [RFC 1/3] zram: generalize swap_slot_free_notify Minchan Kim
2014-09-04  1:39   ` Minchan Kim
2014-09-04  1:39 ` [RFC 2/3] mm: add swap_get_free hint for zram Minchan Kim
2014-09-04  1:39   ` Minchan Kim
2014-09-13 19:01   ` Dan Streetman
2014-09-13 19:01     ` Dan Streetman
2014-09-15  0:30     ` Minchan Kim
2014-09-15  0:30       ` Minchan Kim
2014-09-15 14:53       ` Dan Streetman
2014-09-15 14:53         ` Dan Streetman
2014-09-16  0:33         ` Minchan Kim
2014-09-16  0:33           ` Minchan Kim
2014-09-16 15:09           ` Dan Streetman
2014-09-16 15:09             ` Dan Streetman
2014-09-17  7:14             ` Minchan Kim
2014-09-17  7:14               ` Minchan Kim
2014-09-04  1:39 ` [RFC 3/3] zram: add swap_get_free hint Minchan Kim
2014-09-04  1:39   ` Minchan Kim
2014-09-04  6:26   ` Heesub Shin
2014-09-04  6:26     ` Heesub Shin
2014-09-04 23:59     ` Minchan Kim
2014-09-04 23:59       ` Minchan Kim
2014-09-13 19:39       ` Dan Streetman
2014-09-13 19:39         ` Dan Streetman
2014-09-15  0:57         ` Minchan Kim
2014-09-15  0:57           ` Minchan Kim
2014-09-15 16:00           ` Dan Streetman
2014-09-15 16:00             ` Dan Streetman
2014-09-16  1:21             ` Minchan Kim
2014-09-16  1:21               ` Minchan Kim
2014-09-16 15:58               ` Dan Streetman
2014-09-16 15:58                 ` Dan Streetman
2014-09-17  7:44                 ` Minchan Kim
2014-09-17  7:44                   ` Minchan Kim
2014-09-17 16:28                   ` Dan Streetman
2014-09-17 16:28                     ` Dan Streetman
2014-09-19  6:14                     ` Minchan Kim
2014-09-19  6:14                       ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.