All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH v3 0/5] mm/zsmalloc: rework compaction and increase density
@ 2016-03-03 14:45 ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

Hello,

	RFC

Posting as an early preview of zsmalloc compaction and density
improvements.

The patch set will be rebased once Minchan posts his zsmalloc
rework.


zsmalloc knows the watermark after which classes are considered to be
huge - every object stored consumes the entire zspage (which consist
of a single order-0 page). zram, however, has its own statically defined
watermark for `bad' compression and stores every object larger than this
watermark as a PAGE_SIZE, object, IOW, to a ->huge class, this results in
increased memory consumption and memory wastage. And zram's 'bad' watermark
is much lower than zsmalloc's one. Apart from that, 'bad' compressions
are not so rare and expecting that pages passed to zram mostly will be
compressed to 3/4 of page_size is a bit strange. There is no a
compression algorithm with such ratio guarantees. This patch set inverts
this 'huge class watermark' enforcement, it's zsmalloc that knows better,
not zram.

The patch set reduces the number of huge classes, which permits to save
some memory.

zsmalloc classes are known to have fragmentation problems, that's why
compaction has been aidded in the first place. The patch set change
the existing shrinker callback based compaction and introduces a watermark
based one. So now zsmalloc controls class's fragmentation level and
schedules a compaction work on a per-class basis once class fragmentation
jumps above the watermark. Instead of compacting the entire pool
class-by-class we know touch only classes that are known to be heavily
fragmented.


All important patches contain test results and test descriptions.
And it seems that previously weak tests (truncate) are no longer
problematic.


v3:
-- user watermark based per-class compaction (workqueue)
-- remove shrinker compaction callbacks
-- increase order only for huge classes via special #defines
-- renamed zs_get_huge_class_size_watermark() function
-- patches re-ordered

v2:
-- keep ZS_MAX_PAGES_PER_ZSPAGE order of two (Joonsoo)
-- suffice ZS_MIN_ALLOC_SIZE alignment requirement
-- do not change ZS_MAX_PAGES_PER_ZSPAGE on PAE/LPAE and
   on PAGE_SHIFT 16 systems (Joonsoo)

Sergey Senozhatsky (5):
  mm/zsmalloc: introduce class auto-compaction
  mm/zsmalloc: remove shrinker compaction callbacks
  mm/zsmalloc: introduce zs_huge_object()
  zram: use zs_huge_object()
  mm/zsmalloc: reduce the number of huge classes

 drivers/block/zram/zram_drv.c |   2 +-
 drivers/block/zram/zram_drv.h |   6 --
 include/linux/zsmalloc.h      |   2 +
 mm/zsmalloc.c                 | 157 ++++++++++++++++++++----------------------
 4 files changed, 78 insertions(+), 89 deletions(-)

-- 
2.8.0.rc0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 0/5] mm/zsmalloc: rework compaction and increase density
@ 2016-03-03 14:45 ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

Hello,

	RFC

Posting as an early preview of zsmalloc compaction and density
improvements.

The patch set will be rebased once Minchan posts his zsmalloc
rework.


zsmalloc knows the watermark after which classes are considered to be
huge - every object stored consumes the entire zspage (which consist
of a single order-0 page). zram, however, has its own statically defined
watermark for `bad' compression and stores every object larger than this
watermark as a PAGE_SIZE, object, IOW, to a ->huge class, this results in
increased memory consumption and memory wastage. And zram's 'bad' watermark
is much lower than zsmalloc's one. Apart from that, 'bad' compressions
are not so rare and expecting that pages passed to zram mostly will be
compressed to 3/4 of page_size is a bit strange. There is no a
compression algorithm with such ratio guarantees. This patch set inverts
this 'huge class watermark' enforcement, it's zsmalloc that knows better,
not zram.

The patch set reduces the number of huge classes, which permits to save
some memory.

zsmalloc classes are known to have fragmentation problems, that's why
compaction has been aidded in the first place. The patch set change
the existing shrinker callback based compaction and introduces a watermark
based one. So now zsmalloc controls class's fragmentation level and
schedules a compaction work on a per-class basis once class fragmentation
jumps above the watermark. Instead of compacting the entire pool
class-by-class we know touch only classes that are known to be heavily
fragmented.


All important patches contain test results and test descriptions.
And it seems that previously weak tests (truncate) are no longer
problematic.


v3:
-- user watermark based per-class compaction (workqueue)
-- remove shrinker compaction callbacks
-- increase order only for huge classes via special #defines
-- renamed zs_get_huge_class_size_watermark() function
-- patches re-ordered

v2:
-- keep ZS_MAX_PAGES_PER_ZSPAGE order of two (Joonsoo)
-- suffice ZS_MIN_ALLOC_SIZE alignment requirement
-- do not change ZS_MAX_PAGES_PER_ZSPAGE on PAE/LPAE and
   on PAGE_SHIFT 16 systems (Joonsoo)

Sergey Senozhatsky (5):
  mm/zsmalloc: introduce class auto-compaction
  mm/zsmalloc: remove shrinker compaction callbacks
  mm/zsmalloc: introduce zs_huge_object()
  zram: use zs_huge_object()
  mm/zsmalloc: reduce the number of huge classes

 drivers/block/zram/zram_drv.c |   2 +-
 drivers/block/zram/zram_drv.h |   6 --
 include/linux/zsmalloc.h      |   2 +
 mm/zsmalloc.c                 | 157 ++++++++++++++++++++----------------------
 4 files changed, 78 insertions(+), 89 deletions(-)

-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-03 14:45 ` Sergey Senozhatsky
@ 2016-03-03 14:45   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

zsmalloc classes are known to be affected by internal fragmentation.

For example, /sys/kernel/debug/zsmalloc/zramX/classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
    54   896           1           12           117         57         26                2       12
...
   107  1744           1           23           196         76         84                3       51
   111  1808           0            0            63         63         28                4        0
   126  2048           0          160           568        408        284                1       80
   144  2336          52          620          8631       5747       4932                4     1648
   151  2448         123          406         10090       8736       6054                3      810
   168  2720           0          512         15738      14926      10492                2      540
   190  3072           0            2           136        130        102                3        3
...

demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
1648/4932=33% of unused pages, etc. And the more classes we will have as
'normal' classes (more than one object per-zspage) the bigger this problem
will grow. The existing compaction relies on a user space (user can trigger
compaction via `compact' zram's sysfs attr) or a shrinker; it does not
happen automatically.

This patch introduces a 'watermark' value of unused pages and schedules a
compaction work on a per-class basis once class's fragmentation becomes
too big. So compaction is not performed in current I/O operation context,
but in workqueue workers later.

The current watermark is set to 40% -- if class has 40+% of `freeable'
pages then compaction work will be scheduled.

TEST
====

  2G zram, ext4, lz0

  iozone -t 1 -R -r 64K -s 1200M -I +Z

                        BASE       PATCHED
"  Initial write "   959670.94    966724.62
"        Rewrite "  1276167.62   1237632.88
"           Read "  3334708.25   3345357.50
"        Re-read "  3405310.75   3337137.25
"   Reverse Read "  3284499.75   3241283.50
"    Stride read "  3293417.75   3268364.00
"    Random read "  3255253.50   3241685.00
" Mixed workload "  3274398.00   3231498.00
"   Random write "  1253207.50   1216247.00
"         Pwrite "   873682.25    877045.81
"          Pread "  3173266.00   3318471.75
"         Fwrite "   881278.38    897622.81
"          Fread "  4397147.00   4501131.50

  iozone -t 3 -R -r 64K -s 60M -I +Z

                        BASE       PATCHED
"  Initial write "  1855931.62   1869576.31
"        Rewrite "  2223531.06   2221543.62
"           Read "  7958435.75   8023044.75
"        Re-read "  7912776.75   8068961.00
"   Reverse Read "  7832227.50   7788237.50
"    Stride read "  7952113.50   7919778.00
"    Random read "  7908816.00   7881792.50
" Mixed workload "  6364520.38   6332493.94
"   Random write "  2230115.69   2176777.19
"         Pwrite "  1915939.31   1929464.75
"          Pread "  3857052.91   3840517.91
"         Fwrite "  2271730.44   2272800.31
"          Fread "  9053867.00   8880966.25

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index e72efb1..a4ef7e7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -219,6 +219,10 @@ struct size_class {
 	int pages_per_zspage;
 	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
 	bool huge;
+
+	bool compact_scheduled;
+	struct zs_pool *pool;
+	struct work_struct compact_work;
 };
 
 /*
@@ -1467,6 +1471,8 @@ static void obj_free(struct zs_pool *pool, struct size_class *class,
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
+static bool class_watermark_ok(struct size_class *class);
+
 void zs_free(struct zs_pool *pool, unsigned long handle)
 {
 	struct page *first_page, *f_page;
@@ -1495,6 +1501,11 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 		atomic_long_sub(class->pages_per_zspage,
 				&pool->pages_allocated);
 		free_zspage(first_page);
+	} else {
+		if (!class_watermark_ok(class) && !class->compact_scheduled) {
+			queue_work(system_long_wq, &class->compact_work);
+			class->compact_scheduled = true;
+		}
 	}
 	spin_unlock(&class->lock);
 	unpin_tag(handle);
@@ -1745,6 +1756,19 @@ static unsigned long zs_can_compact(struct size_class *class)
 	return obj_wasted * class->pages_per_zspage;
 }
 
+static bool class_watermark_ok(struct size_class *class)
+{
+	unsigned long pages_used = zs_stat_get(class, OBJ_ALLOCATED);
+
+	pages_used /= get_maxobj_per_zspage(class->size,
+			class->pages_per_zspage) * class->pages_per_zspage;
+
+	if (!pages_used)
+		return true;
+
+	return (100 * zs_can_compact(class) / pages_used) < 40;
+}
+
 static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 {
 	struct zs_compact_control cc;
@@ -1789,9 +1813,17 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 	if (src_page)
 		putback_zspage(pool, class, src_page);
 
+	class->compact_scheduled = false;
 	spin_unlock(&class->lock);
 }
 
+static void class_compaction_work(struct work_struct *work)
+{
+	struct size_class *class = container_of(work, struct size_class, compact_work);
+
+	__zs_compact(class->pool, class);
+}
+
 unsigned long zs_compact(struct zs_pool *pool)
 {
 	int i;
@@ -1948,6 +1980,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		if (pages_per_zspage == 1 &&
 			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
 			class->huge = true;
+
+		INIT_WORK(&class->compact_work, class_compaction_work);
+		class->pool = pool;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
 
@@ -1990,6 +2025,8 @@ void zs_destroy_pool(struct zs_pool *pool)
 		if (class->index != i)
 			continue;
 
+		cancel_work_sync(&class->compact_work);
+
 		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
 			if (class->fullness_list[fg]) {
 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
-- 
2.8.0.rc0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-03 14:45   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

zsmalloc classes are known to be affected by internal fragmentation.

For example, /sys/kernel/debug/zsmalloc/zramX/classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
    54   896           1           12           117         57         26                2       12
...
   107  1744           1           23           196         76         84                3       51
   111  1808           0            0            63         63         28                4        0
   126  2048           0          160           568        408        284                1       80
   144  2336          52          620          8631       5747       4932                4     1648
   151  2448         123          406         10090       8736       6054                3      810
   168  2720           0          512         15738      14926      10492                2      540
   190  3072           0            2           136        130        102                3        3
...

demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
1648/4932=33% of unused pages, etc. And the more classes we will have as
'normal' classes (more than one object per-zspage) the bigger this problem
will grow. The existing compaction relies on a user space (user can trigger
compaction via `compact' zram's sysfs attr) or a shrinker; it does not
happen automatically.

This patch introduces a 'watermark' value of unused pages and schedules a
compaction work on a per-class basis once class's fragmentation becomes
too big. So compaction is not performed in current I/O operation context,
but in workqueue workers later.

The current watermark is set to 40% -- if class has 40+% of `freeable'
pages then compaction work will be scheduled.

TEST
====

  2G zram, ext4, lz0

  iozone -t 1 -R -r 64K -s 1200M -I +Z

                        BASE       PATCHED
"  Initial write "   959670.94    966724.62
"        Rewrite "  1276167.62   1237632.88
"           Read "  3334708.25   3345357.50
"        Re-read "  3405310.75   3337137.25
"   Reverse Read "  3284499.75   3241283.50
"    Stride read "  3293417.75   3268364.00
"    Random read "  3255253.50   3241685.00
" Mixed workload "  3274398.00   3231498.00
"   Random write "  1253207.50   1216247.00
"         Pwrite "   873682.25    877045.81
"          Pread "  3173266.00   3318471.75
"         Fwrite "   881278.38    897622.81
"          Fread "  4397147.00   4501131.50

  iozone -t 3 -R -r 64K -s 60M -I +Z

                        BASE       PATCHED
"  Initial write "  1855931.62   1869576.31
"        Rewrite "  2223531.06   2221543.62
"           Read "  7958435.75   8023044.75
"        Re-read "  7912776.75   8068961.00
"   Reverse Read "  7832227.50   7788237.50
"    Stride read "  7952113.50   7919778.00
"    Random read "  7908816.00   7881792.50
" Mixed workload "  6364520.38   6332493.94
"   Random write "  2230115.69   2176777.19
"         Pwrite "  1915939.31   1929464.75
"          Pread "  3857052.91   3840517.91
"         Fwrite "  2271730.44   2272800.31
"          Fread "  9053867.00   8880966.25

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index e72efb1..a4ef7e7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -219,6 +219,10 @@ struct size_class {
 	int pages_per_zspage;
 	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
 	bool huge;
+
+	bool compact_scheduled;
+	struct zs_pool *pool;
+	struct work_struct compact_work;
 };
 
 /*
@@ -1467,6 +1471,8 @@ static void obj_free(struct zs_pool *pool, struct size_class *class,
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
+static bool class_watermark_ok(struct size_class *class);
+
 void zs_free(struct zs_pool *pool, unsigned long handle)
 {
 	struct page *first_page, *f_page;
@@ -1495,6 +1501,11 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 		atomic_long_sub(class->pages_per_zspage,
 				&pool->pages_allocated);
 		free_zspage(first_page);
+	} else {
+		if (!class_watermark_ok(class) && !class->compact_scheduled) {
+			queue_work(system_long_wq, &class->compact_work);
+			class->compact_scheduled = true;
+		}
 	}
 	spin_unlock(&class->lock);
 	unpin_tag(handle);
@@ -1745,6 +1756,19 @@ static unsigned long zs_can_compact(struct size_class *class)
 	return obj_wasted * class->pages_per_zspage;
 }
 
+static bool class_watermark_ok(struct size_class *class)
+{
+	unsigned long pages_used = zs_stat_get(class, OBJ_ALLOCATED);
+
+	pages_used /= get_maxobj_per_zspage(class->size,
+			class->pages_per_zspage) * class->pages_per_zspage;
+
+	if (!pages_used)
+		return true;
+
+	return (100 * zs_can_compact(class) / pages_used) < 40;
+}
+
 static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 {
 	struct zs_compact_control cc;
@@ -1789,9 +1813,17 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 	if (src_page)
 		putback_zspage(pool, class, src_page);
 
+	class->compact_scheduled = false;
 	spin_unlock(&class->lock);
 }
 
+static void class_compaction_work(struct work_struct *work)
+{
+	struct size_class *class = container_of(work, struct size_class, compact_work);
+
+	__zs_compact(class->pool, class);
+}
+
 unsigned long zs_compact(struct zs_pool *pool)
 {
 	int i;
@@ -1948,6 +1980,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		if (pages_per_zspage == 1 &&
 			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
 			class->huge = true;
+
+		INIT_WORK(&class->compact_work, class_compaction_work);
+		class->pool = pool;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
 
@@ -1990,6 +2025,8 @@ void zs_destroy_pool(struct zs_pool *pool)
 		if (class->index != i)
 			continue;
 
+		cancel_work_sync(&class->compact_work);
+
 		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
 			if (class->fullness_list[fg]) {
 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-03 14:45 ` Sergey Senozhatsky
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

Do not register shrinker compaction callbacks anymore, since
now we shedule class compaction work each time its fragmentation
value goes above the watermark.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 72 -----------------------------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a4ef7e7..0bb060f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -256,13 +256,6 @@ struct zs_pool {
 
 	struct zs_pool_stats stats;
 
-	/* Compact classes */
-	struct shrinker shrinker;
-	/*
-	 * To signify that register_shrinker() was successful
-	 * and unregister_shrinker() will not Oops.
-	 */
-	bool shrinker_enabled;
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
 #endif
@@ -1848,64 +1841,6 @@ void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats)
 }
 EXPORT_SYMBOL_GPL(zs_pool_stats);
 
-static unsigned long zs_shrinker_scan(struct shrinker *shrinker,
-		struct shrink_control *sc)
-{
-	unsigned long pages_freed;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
-
-	pages_freed = pool->stats.pages_compacted;
-	/*
-	 * Compact classes and calculate compaction delta.
-	 * Can run concurrently with a manually triggered
-	 * (by user) compaction.
-	 */
-	pages_freed = zs_compact(pool) - pages_freed;
-
-	return pages_freed ? pages_freed : SHRINK_STOP;
-}
-
-static unsigned long zs_shrinker_count(struct shrinker *shrinker,
-		struct shrink_control *sc)
-{
-	int i;
-	struct size_class *class;
-	unsigned long pages_to_free = 0;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
-
-	for (i = zs_size_classes - 1; i >= 0; i--) {
-		class = pool->size_class[i];
-		if (!class)
-			continue;
-		if (class->index != i)
-			continue;
-
-		pages_to_free += zs_can_compact(class);
-	}
-
-	return pages_to_free;
-}
-
-static void zs_unregister_shrinker(struct zs_pool *pool)
-{
-	if (pool->shrinker_enabled) {
-		unregister_shrinker(&pool->shrinker);
-		pool->shrinker_enabled = false;
-	}
-}
-
-static int zs_register_shrinker(struct zs_pool *pool)
-{
-	pool->shrinker.scan_objects = zs_shrinker_scan;
-	pool->shrinker.count_objects = zs_shrinker_count;
-	pool->shrinker.batch = 0;
-	pool->shrinker.seeks = DEFAULT_SEEKS;
-
-	return register_shrinker(&pool->shrinker);
-}
-
 /**
  * zs_create_pool - Creates an allocation pool to work from.
  * @flags: allocation flags used to allocate pool metadata
@@ -1994,12 +1929,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 	if (zs_pool_stat_create(name, pool))
 		goto err;
 
-	/*
-	 * Not critical, we still can use the pool
-	 * and user can trigger compaction manually.
-	 */
-	if (zs_register_shrinker(pool) == 0)
-		pool->shrinker_enabled = true;
 	return pool;
 
 err:
@@ -2012,7 +1941,6 @@ void zs_destroy_pool(struct zs_pool *pool)
 {
 	int i;
 
-	zs_unregister_shrinker(pool);
 	zs_pool_stat_destroy(pool);
 
 	for (i = 0; i < zs_size_classes; i++) {
-- 
2.8.0.rc0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

Do not register shrinker compaction callbacks anymore, since
now we shedule class compaction work each time its fragmentation
value goes above the watermark.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 72 -----------------------------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a4ef7e7..0bb060f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -256,13 +256,6 @@ struct zs_pool {
 
 	struct zs_pool_stats stats;
 
-	/* Compact classes */
-	struct shrinker shrinker;
-	/*
-	 * To signify that register_shrinker() was successful
-	 * and unregister_shrinker() will not Oops.
-	 */
-	bool shrinker_enabled;
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
 #endif
@@ -1848,64 +1841,6 @@ void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats)
 }
 EXPORT_SYMBOL_GPL(zs_pool_stats);
 
-static unsigned long zs_shrinker_scan(struct shrinker *shrinker,
-		struct shrink_control *sc)
-{
-	unsigned long pages_freed;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
-
-	pages_freed = pool->stats.pages_compacted;
-	/*
-	 * Compact classes and calculate compaction delta.
-	 * Can run concurrently with a manually triggered
-	 * (by user) compaction.
-	 */
-	pages_freed = zs_compact(pool) - pages_freed;
-
-	return pages_freed ? pages_freed : SHRINK_STOP;
-}
-
-static unsigned long zs_shrinker_count(struct shrinker *shrinker,
-		struct shrink_control *sc)
-{
-	int i;
-	struct size_class *class;
-	unsigned long pages_to_free = 0;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
-
-	for (i = zs_size_classes - 1; i >= 0; i--) {
-		class = pool->size_class[i];
-		if (!class)
-			continue;
-		if (class->index != i)
-			continue;
-
-		pages_to_free += zs_can_compact(class);
-	}
-
-	return pages_to_free;
-}
-
-static void zs_unregister_shrinker(struct zs_pool *pool)
-{
-	if (pool->shrinker_enabled) {
-		unregister_shrinker(&pool->shrinker);
-		pool->shrinker_enabled = false;
-	}
-}
-
-static int zs_register_shrinker(struct zs_pool *pool)
-{
-	pool->shrinker.scan_objects = zs_shrinker_scan;
-	pool->shrinker.count_objects = zs_shrinker_count;
-	pool->shrinker.batch = 0;
-	pool->shrinker.seeks = DEFAULT_SEEKS;
-
-	return register_shrinker(&pool->shrinker);
-}
-
 /**
  * zs_create_pool - Creates an allocation pool to work from.
  * @flags: allocation flags used to allocate pool metadata
@@ -1994,12 +1929,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 	if (zs_pool_stat_create(name, pool))
 		goto err;
 
-	/*
-	 * Not critical, we still can use the pool
-	 * and user can trigger compaction manually.
-	 */
-	if (zs_register_shrinker(pool) == 0)
-		pool->shrinker_enabled = true;
 	return pool;
 
 err:
@@ -2012,7 +1941,6 @@ void zs_destroy_pool(struct zs_pool *pool)
 {
 	int i;
 
-	zs_unregister_shrinker(pool);
 	zs_pool_stat_destroy(pool);
 
 	for (i = 0; i < zs_size_classes; i++) {
-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
  2016-03-03 14:45 ` Sergey Senozhatsky
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

zsmalloc knows the watermark after which classes are considered
to be ->huge -- every object stored consumes the entire zspage (which
consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
first non-huge class size is 3264, so starting down from size 3264,
objects share page(-s) and thus minimize memory wastage.

zram, however, has its own statically defined watermark for `bad'
compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
to a ->huge class, this results in increased memory consumption and
memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
3072 bytes and get_size_class_index(size) will return 3264 bytes size
class).

Introduce zs_huge_object() function which tells whether the supplied
object's size belongs to a huge class; so zram now can store objects
to ->huge clases only when those objects have sizes greater than
huge_class_size_watermark.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 include/linux/zsmalloc.h |  2 ++
 mm/zsmalloc.c            | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 34eb160..7184ee1 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -55,4 +55,6 @@ unsigned long zs_get_total_pages(struct zs_pool *pool);
 unsigned long zs_compact(struct zs_pool *pool);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+bool zs_huge_object(size_t sz);
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0bb060f..06a7d87 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -188,6 +188,11 @@ static struct dentry *zs_stat_root;
 static int zs_size_classes;
 
 /*
+ * All classes above this class_size are huge classes
+ */
+static size_t huge_class_size_watermark;
+
+/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  *	n <= N / f, where
  * n = number of allocated objects
@@ -1244,6 +1249,12 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_get_total_pages);
 
+bool zs_huge_object(size_t sz)
+{
+	return sz > huge_class_size_watermark;
+}
+EXPORT_SYMBOL_GPL(zs_huge_object);
+
 /**
  * zs_map_object - get address of allocated object from handle.
  * @pool: pool from which the object was allocated
@@ -1922,6 +1933,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		pool->size_class[i] = class;
 
 		prev_class = class;
+		if (!class->huge && !huge_class_size_watermark)
+			huge_class_size_watermark = size - ZS_HANDLE_SIZE;
 	}
 
 	pool->flags = flags;
-- 
2.8.0.rc0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

zsmalloc knows the watermark after which classes are considered
to be ->huge -- every object stored consumes the entire zspage (which
consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
first non-huge class size is 3264, so starting down from size 3264,
objects share page(-s) and thus minimize memory wastage.

zram, however, has its own statically defined watermark for `bad'
compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
to a ->huge class, this results in increased memory consumption and
memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
3072 bytes and get_size_class_index(size) will return 3264 bytes size
class).

Introduce zs_huge_object() function which tells whether the supplied
object's size belongs to a huge class; so zram now can store objects
to ->huge clases only when those objects have sizes greater than
huge_class_size_watermark.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 include/linux/zsmalloc.h |  2 ++
 mm/zsmalloc.c            | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 34eb160..7184ee1 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -55,4 +55,6 @@ unsigned long zs_get_total_pages(struct zs_pool *pool);
 unsigned long zs_compact(struct zs_pool *pool);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+bool zs_huge_object(size_t sz);
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0bb060f..06a7d87 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -188,6 +188,11 @@ static struct dentry *zs_stat_root;
 static int zs_size_classes;
 
 /*
+ * All classes above this class_size are huge classes
+ */
+static size_t huge_class_size_watermark;
+
+/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  *	n <= N / f, where
  * n = number of allocated objects
@@ -1244,6 +1249,12 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_get_total_pages);
 
+bool zs_huge_object(size_t sz)
+{
+	return sz > huge_class_size_watermark;
+}
+EXPORT_SYMBOL_GPL(zs_huge_object);
+
 /**
  * zs_map_object - get address of allocated object from handle.
  * @pool: pool from which the object was allocated
@@ -1922,6 +1933,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		pool->size_class[i] = class;
 
 		prev_class = class;
+		if (!class->huge && !huge_class_size_watermark)
+			huge_class_size_watermark = size - ZS_HANDLE_SIZE;
 	}
 
 	pool->flags = flags;
-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 4/5] zram: use zs_huge_object()
  2016-03-03 14:45 ` Sergey Senozhatsky
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

From: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>

zram should stop enforcing its own 'bad' object size watermark,
and start using zs_huge_object(). zsmalloc really knows better.

Drop `max_zpage_size' and use zs_huge_object() instead.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 drivers/block/zram/zram_drv.h | 6 ------
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 46055db..bb81b1b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -714,7 +714,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		goto out;
 	}
 	src = zstrm->buffer;
-	if (unlikely(clen > max_zpage_size)) {
+	if (unlikely(zs_huge_object(clen))) {
 		clen = PAGE_SIZE;
 		if (is_partial_io(bvec))
 			src = uncmem;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 8e92339..8879161 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -23,12 +23,6 @@
 /*-- Configurable parameters */
 
 /*
- * Pages that compress to size greater than this are stored
- * uncompressed in memory.
- */
-static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
-
-/*
  * NOTE: max_zpage_size must be less than or equal to:
  *   ZS_MAX_ALLOC_SIZE. Otherwise, zs_malloc() would
  * always return failure.
-- 
2.8.0.rc0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 4/5] zram: use zs_huge_object()
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

From: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>

zram should stop enforcing its own 'bad' object size watermark,
and start using zs_huge_object(). zsmalloc really knows better.

Drop `max_zpage_size' and use zs_huge_object() instead.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 drivers/block/zram/zram_drv.h | 6 ------
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 46055db..bb81b1b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -714,7 +714,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		goto out;
 	}
 	src = zstrm->buffer;
-	if (unlikely(clen > max_zpage_size)) {
+	if (unlikely(zs_huge_object(clen))) {
 		clen = PAGE_SIZE;
 		if (is_partial_io(bvec))
 			src = uncmem;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 8e92339..8879161 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -23,12 +23,6 @@
 /*-- Configurable parameters */
 
 /*
- * Pages that compress to size greater than this are stored
- * uncompressed in memory.
- */
-static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
-
-/*
  * NOTE: max_zpage_size must be less than or equal to:
  *   ZS_MAX_ALLOC_SIZE. Otherwise, zs_malloc() would
  * always return failure.
-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 5/5] mm/zsmalloc: reduce the number of huge classes
  2016-03-03 14:45 ` Sergey Senozhatsky
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

From: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>

The existing limit of max 4 pages per zspage sets a tight limit
on ->huge classes, which results in increased memory consumption.

On x86_64, PAGE_SHIFT 12, ->huge class_size range is 3280-4096.
Each ->huge zspage has only one order-0 page and can store only
one object, so in the huge classes increase zsmalloc memory usage.
For instance, we store 3408 bytes objects as PAGE_SIZE objects,
while in fact each of those objects has 4096 - 3408 = 688 bytes
of spare space, so 5 objects will have 5*688 bytes of spare
space - enough to store an additional 3408 bytes objects with out
requesting a new order-0 page. In general, turning a ->huge class
into a normal will save PAGE_SIZE bytes every time
"PAGE_SIZE/(PAGE_SIZE - CLASS_SIZE)"-th object is stored.

The maximum number of order-0 pages in zspages is limited by
ZS_MAX_ZSPAGE_ORDER (zspage can have up to 1 << ZS_MAX_ZSPAGE_ORDER
order-0 pages). Increasing ZS_MAX_ZSPAGE_ORDER permits us to have
less ->huge classes, because some of them now can form a 'normal'
zspage consisting of several order-0 pages, however, a larger
ZS_MAX_ZSPAGE_ORDER will change characteristics of 'normal'
classes. For instance, with ZS_MAX_ZSPAGE_ORDER set to 4, class-176
will grow to 11 order-0 pages per zspage from its current limit of 4
order-0 pages per zspage. While the new value gives better density,
because in 11 order-0 pages we can store 4096 * 11 / 176 == 256
objects and 256 * 176 == 11 * 4096 (zero wasted bytes in every FULL
zspage), the increased class's memory requirement is considered to
have a negative impact (bigger than the gain). To avoid this, new
ZS_MAX_HUGE_ZSPAGE_ORDER and ZS_MAX_PAGES_PER_HUGE_ZSPAGE defines
are introduced, which will be used for ->huge classes only; 'normal'
classes will still be configured using the existing ZS_MAX_ZSPAGE_ORDER
and ZS_MAX_PAGES_PER_ZSPAGE.

Class configuration in get_pages_per_zspage(), thus, now has a
fallback scenario -- we attempt to configure class as a normal
class first using ZS_MAX_ZSPAGE_ORDER limit and only if that
class ended up being ->huge we attempt to configure it using
ZS_MAX_HUGE_ZSPAGE_ORDER limit.

ZS_MAX_HUGE_ZSPAGE_ORDER set to 4 give us a new ->huge classes
size range of: 3856-4096 bytes (on a PAGE_SHIFT 12 system).

TESTS
=====

1) copy linux directory with source and object files

/sys/block/zram0/mm_stat

BASE
2613661696 1786896890 1812230144        0 1812230144     4313      106        2

PATCHED
2616717312 1766397396 1793818624        0 1793818624     3569      147        1

pool stats for this test:

BASE
...
   168  2720           0            0         12405      12405       8270                2        0
   190  3072           0            1          7328       7327       5496                3        0
   202  3264           0            1           150        148        120                4        0
   254  4096           0            0        325705     325705     325705                1        0

 Total                46           53        638809     638101     442439                         6

PATCHED
...
   202  3264           0            1          5065       5062       4052                4        0
   206  3328           0            1          2512       2506       2041               13        0
   207  3344           0            1           759        754        621                9        0
   208  3360           0            1           765        749        630               14        0
   211  3408           0            1          2454       2449       2045                5        0
   212  3424           0            1           836        829        704               16        0
   214  3456           0            1          1742       1735       1474               11        0
   217  3504           0            1          2765       2760       2370                6        0
   219  3536           1            0          1890       1887       1638               13        0
   222  3584           0            0          2720       2720       2380                7        0
   223  3600           0            1           918        906        810               15        0
   225  3632           0            1          1908       1902       1696                8        0
   228  3680           0            1          2860       2857       2574                9        0
   230  3712           0            1          1892       1883       1720               10        0
   232  3744           0            1          1812       1801       1661               11        0
   234  3776           1            0          1794       1793       1656               12        0
   235  3792           0            1           882        875        819               13        0
   236  3808           0            1           870        860        812               14        0
   238  3840           1            0          1664       1663       1560               15        0
   254  4096           0            0        289864     289864     289864                1        0

 Total                25           83        639985     638847     437944                        17

= memory saving: (1812230144 - 1793818624) / 4096 = 4495 order-0 pages

2) run a test script

 a) for i in {2..7} do
   -- create text files (50% of disk size), create binary files (50% of disk size)
   -- truncate all files to 1/$i of original size; remove files;

 b) create text files (50% of disk size), create binary files (50% of disk size)
    - compress (gzip -9) text files
    - create a copy of every compressed file, using original files' names,
      so gunzip will have to overwrite files during decompression
    - decompress archives, overwriting already existing files

Showing only zs_get_total_pages() and zram->max_used_pages stats (/sys/block/zram0/mm_stat)

                                  BASE                      PATCHED
INITIAL STATE           1620959232 / 1620959232     1539022848 / 1539022848
TRUNCATE BIN 1/2        1128337408 / 1621217280     1114112000 / 1539305472
TRUNCATE TEXT 1/2        627945472 / 1621217280      613818368 / 1539305472

INITIAL STATE           1626198016 / 1626198016     1544245248 / 1544245248
TRUNCATE BIN 1/3        1028653056 / 1626238976     1029984256 / 1544269824
TRUNCATE TEXT 1/3        355729408 / 1626238976      357408768 / 1544269824

INITIAL STATE           1626730496 / 1626730496     1544654848 / 1544654848
TRUNCATE BIN 1/4        1021796352 / 1626763264     1021341696 / 1544663040
TRUNCATE TEXT 1/4        264744960 / 1626763264      265011200 / 1544663040

INITIAL STATE           1626726400 / 1626763264     1544794112 / 1544794112
TRUNCATE BIN 1/5        1021652992 / 1626763264     1021386752 / 1544802304
TRUNCATE TEXT 1/5        214376448 / 1626763264      214519808 / 1544802304

INITIAL STATE           1626853376 / 1626853376     1544835072 / 1544835072
TRUNCATE BIN 1/6        1021714432 / 1626853376     1021464576 / 1544863744
TRUNCATE TEXT 1/6        180908032 / 1626853376      181075968 / 1544863744

INITIAL STATE           1626726400 / 1626853376     1544822784 / 1544863744
TRUNCATE BIN 1/7        1021595648 / 1626853376     1021612032 / 1544863744
TRUNCATE TEXT 1/7        156188672 / 1626853376      156327936 / 1544863744

INITIAL STATE           1626796032 / 1626873856     1544904704 / 1544953856
COMPRESS TEXT           1763684352 / 1768726528     1681707008 / 1686708224
DECOMPRESS TEXT         1628348416 / 1768726528     1546612736 / 1686708224

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 06a7d87..d5aae38 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -73,13 +73,6 @@
  */
 #define ZS_ALIGN		8
 
-/*
- * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
- * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
- */
-#define ZS_MAX_ZSPAGE_ORDER 2
-#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
-
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
 /*
@@ -95,7 +88,7 @@
 
 #ifndef MAX_PHYSMEM_BITS
 #ifdef CONFIG_HIGHMEM64G
-#define MAX_PHYSMEM_BITS 36
+#define MAX_PHYSMEM_BITS	36
 #else /* !CONFIG_HIGHMEM64G */
 /*
  * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
@@ -104,8 +97,19 @@
 #define MAX_PHYSMEM_BITS BITS_PER_LONG
 #endif
 #endif
+
 #define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
 
+
+/*
+ * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
+ * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
+ */
+#define ZS_MAX_ZSPAGE_ORDER	2
+#define ZS_MAX_HUGE_ZSPAGE_ORDER	4
+#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
+#define ZS_MAX_PAGES_PER_HUGE_ZSPAGE (_AC(1, UL) << ZS_MAX_HUGE_ZSPAGE_ORDER)
+
 /*
  * Memory for allocating for handle keeps object position by
  * encoding <page, obj_idx> and the encoded value has a room
@@ -752,13 +756,13 @@ out:
  * link together 3 PAGE_SIZE sized pages to form a zspage
  * since then we can perfectly fit in 8 such objects.
  */
-static int get_pages_per_zspage(int class_size)
+static int __get_pages_per_zspage(int class_size, int max_pages)
 {
 	int i, max_usedpc = 0;
 	/* zspage order which gives maximum used size per KB */
 	int max_usedpc_order = 1;
 
-	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
+	for (i = 1; i <= max_pages; i++) {
 		int zspage_size;
 		int waste, usedpc;
 
@@ -775,6 +779,17 @@ static int get_pages_per_zspage(int class_size)
 	return max_usedpc_order;
 }
 
+static int get_pages_per_zspage(int class_size)
+{
+	int num = __get_pages_per_zspage(class_size,
+			ZS_MAX_PAGES_PER_ZSPAGE);
+
+	if (num == 1 && get_maxobj_per_zspage(class_size, num) == 1)
+		num = __get_pages_per_zspage(class_size,
+				ZS_MAX_PAGES_PER_HUGE_ZSPAGE);
+	return num;
+}
+
 /*
  * A single 'zspage' is composed of many system pages which are
  * linked together using fields in struct page. This function finds
-- 
2.8.0.rc0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC][PATCH v3 5/5] mm/zsmalloc: reduce the number of huge classes
@ 2016-03-03 14:46   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-03 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky, Sergey Senozhatsky

From: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>

The existing limit of max 4 pages per zspage sets a tight limit
on ->huge classes, which results in increased memory consumption.

On x86_64, PAGE_SHIFT 12, ->huge class_size range is 3280-4096.
Each ->huge zspage has only one order-0 page and can store only
one object, so in the huge classes increase zsmalloc memory usage.
For instance, we store 3408 bytes objects as PAGE_SIZE objects,
while in fact each of those objects has 4096 - 3408 = 688 bytes
of spare space, so 5 objects will have 5*688 bytes of spare
space - enough to store an additional 3408 bytes objects with out
requesting a new order-0 page. In general, turning a ->huge class
into a normal will save PAGE_SIZE bytes every time
"PAGE_SIZE/(PAGE_SIZE - CLASS_SIZE)"-th object is stored.

The maximum number of order-0 pages in zspages is limited by
ZS_MAX_ZSPAGE_ORDER (zspage can have up to 1 << ZS_MAX_ZSPAGE_ORDER
order-0 pages). Increasing ZS_MAX_ZSPAGE_ORDER permits us to have
less ->huge classes, because some of them now can form a 'normal'
zspage consisting of several order-0 pages, however, a larger
ZS_MAX_ZSPAGE_ORDER will change characteristics of 'normal'
classes. For instance, with ZS_MAX_ZSPAGE_ORDER set to 4, class-176
will grow to 11 order-0 pages per zspage from its current limit of 4
order-0 pages per zspage. While the new value gives better density,
because in 11 order-0 pages we can store 4096 * 11 / 176 == 256
objects and 256 * 176 == 11 * 4096 (zero wasted bytes in every FULL
zspage), the increased class's memory requirement is considered to
have a negative impact (bigger than the gain). To avoid this, new
ZS_MAX_HUGE_ZSPAGE_ORDER and ZS_MAX_PAGES_PER_HUGE_ZSPAGE defines
are introduced, which will be used for ->huge classes only; 'normal'
classes will still be configured using the existing ZS_MAX_ZSPAGE_ORDER
and ZS_MAX_PAGES_PER_ZSPAGE.

Class configuration in get_pages_per_zspage(), thus, now has a
fallback scenario -- we attempt to configure class as a normal
class first using ZS_MAX_ZSPAGE_ORDER limit and only if that
class ended up being ->huge we attempt to configure it using
ZS_MAX_HUGE_ZSPAGE_ORDER limit.

ZS_MAX_HUGE_ZSPAGE_ORDER set to 4 give us a new ->huge classes
size range of: 3856-4096 bytes (on a PAGE_SHIFT 12 system).

TESTS
=====

1) copy linux directory with source and object files

/sys/block/zram0/mm_stat

BASE
2613661696 1786896890 1812230144        0 1812230144     4313      106        2

PATCHED
2616717312 1766397396 1793818624        0 1793818624     3569      147        1

pool stats for this test:

BASE
...
   168  2720           0            0         12405      12405       8270                2        0
   190  3072           0            1          7328       7327       5496                3        0
   202  3264           0            1           150        148        120                4        0
   254  4096           0            0        325705     325705     325705                1        0

 Total                46           53        638809     638101     442439                         6

PATCHED
...
   202  3264           0            1          5065       5062       4052                4        0
   206  3328           0            1          2512       2506       2041               13        0
   207  3344           0            1           759        754        621                9        0
   208  3360           0            1           765        749        630               14        0
   211  3408           0            1          2454       2449       2045                5        0
   212  3424           0            1           836        829        704               16        0
   214  3456           0            1          1742       1735       1474               11        0
   217  3504           0            1          2765       2760       2370                6        0
   219  3536           1            0          1890       1887       1638               13        0
   222  3584           0            0          2720       2720       2380                7        0
   223  3600           0            1           918        906        810               15        0
   225  3632           0            1          1908       1902       1696                8        0
   228  3680           0            1          2860       2857       2574                9        0
   230  3712           0            1          1892       1883       1720               10        0
   232  3744           0            1          1812       1801       1661               11        0
   234  3776           1            0          1794       1793       1656               12        0
   235  3792           0            1           882        875        819               13        0
   236  3808           0            1           870        860        812               14        0
   238  3840           1            0          1664       1663       1560               15        0
   254  4096           0            0        289864     289864     289864                1        0

 Total                25           83        639985     638847     437944                        17

= memory saving: (1812230144 - 1793818624) / 4096 = 4495 order-0 pages

2) run a test script

 a) for i in {2..7} do
   -- create text files (50% of disk size), create binary files (50% of disk size)
   -- truncate all files to 1/$i of original size; remove files;

 b) create text files (50% of disk size), create binary files (50% of disk size)
    - compress (gzip -9) text files
    - create a copy of every compressed file, using original files' names,
      so gunzip will have to overwrite files during decompression
    - decompress archives, overwriting already existing files

Showing only zs_get_total_pages() and zram->max_used_pages stats (/sys/block/zram0/mm_stat)

                                  BASE                      PATCHED
INITIAL STATE           1620959232 / 1620959232     1539022848 / 1539022848
TRUNCATE BIN 1/2        1128337408 / 1621217280     1114112000 / 1539305472
TRUNCATE TEXT 1/2        627945472 / 1621217280      613818368 / 1539305472

INITIAL STATE           1626198016 / 1626198016     1544245248 / 1544245248
TRUNCATE BIN 1/3        1028653056 / 1626238976     1029984256 / 1544269824
TRUNCATE TEXT 1/3        355729408 / 1626238976      357408768 / 1544269824

INITIAL STATE           1626730496 / 1626730496     1544654848 / 1544654848
TRUNCATE BIN 1/4        1021796352 / 1626763264     1021341696 / 1544663040
TRUNCATE TEXT 1/4        264744960 / 1626763264      265011200 / 1544663040

INITIAL STATE           1626726400 / 1626763264     1544794112 / 1544794112
TRUNCATE BIN 1/5        1021652992 / 1626763264     1021386752 / 1544802304
TRUNCATE TEXT 1/5        214376448 / 1626763264      214519808 / 1544802304

INITIAL STATE           1626853376 / 1626853376     1544835072 / 1544835072
TRUNCATE BIN 1/6        1021714432 / 1626853376     1021464576 / 1544863744
TRUNCATE TEXT 1/6        180908032 / 1626853376      181075968 / 1544863744

INITIAL STATE           1626726400 / 1626853376     1544822784 / 1544863744
TRUNCATE BIN 1/7        1021595648 / 1626853376     1021612032 / 1544863744
TRUNCATE TEXT 1/7        156188672 / 1626853376      156327936 / 1544863744

INITIAL STATE           1626796032 / 1626873856     1544904704 / 1544953856
COMPRESS TEXT           1763684352 / 1768726528     1681707008 / 1686708224
DECOMPRESS TEXT         1628348416 / 1768726528     1546612736 / 1686708224

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 06a7d87..d5aae38 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -73,13 +73,6 @@
  */
 #define ZS_ALIGN		8
 
-/*
- * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
- * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
- */
-#define ZS_MAX_ZSPAGE_ORDER 2
-#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
-
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
 /*
@@ -95,7 +88,7 @@
 
 #ifndef MAX_PHYSMEM_BITS
 #ifdef CONFIG_HIGHMEM64G
-#define MAX_PHYSMEM_BITS 36
+#define MAX_PHYSMEM_BITS	36
 #else /* !CONFIG_HIGHMEM64G */
 /*
  * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
@@ -104,8 +97,19 @@
 #define MAX_PHYSMEM_BITS BITS_PER_LONG
 #endif
 #endif
+
 #define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
 
+
+/*
+ * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
+ * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
+ */
+#define ZS_MAX_ZSPAGE_ORDER	2
+#define ZS_MAX_HUGE_ZSPAGE_ORDER	4
+#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
+#define ZS_MAX_PAGES_PER_HUGE_ZSPAGE (_AC(1, UL) << ZS_MAX_HUGE_ZSPAGE_ORDER)
+
 /*
  * Memory for allocating for handle keeps object position by
  * encoding <page, obj_idx> and the encoded value has a room
@@ -752,13 +756,13 @@ out:
  * link together 3 PAGE_SIZE sized pages to form a zspage
  * since then we can perfectly fit in 8 such objects.
  */
-static int get_pages_per_zspage(int class_size)
+static int __get_pages_per_zspage(int class_size, int max_pages)
 {
 	int i, max_usedpc = 0;
 	/* zspage order which gives maximum used size per KB */
 	int max_usedpc_order = 1;
 
-	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
+	for (i = 1; i <= max_pages; i++) {
 		int zspage_size;
 		int waste, usedpc;
 
@@ -775,6 +779,17 @@ static int get_pages_per_zspage(int class_size)
 	return max_usedpc_order;
 }
 
+static int get_pages_per_zspage(int class_size)
+{
+	int num = __get_pages_per_zspage(class_size,
+			ZS_MAX_PAGES_PER_ZSPAGE);
+
+	if (num == 1 && get_maxobj_per_zspage(class_size, num) == 1)
+		num = __get_pages_per_zspage(class_size,
+				ZS_MAX_PAGES_PER_HUGE_ZSPAGE);
+	return num;
+}
+
 /*
  * A single 'zspage' is composed of many system pages which are
  * linked together using fields in struct page. This function finds
-- 
2.8.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-03 14:45   ` Sergey Senozhatsky
@ 2016-03-14  6:17     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Hey Sergey,

Sorry for late review.

On Thu, Mar 03, 2016 at 11:45:59PM +0900, Sergey Senozhatsky wrote:
> zsmalloc classes are known to be affected by internal fragmentation.
> 
> For example, /sys/kernel/debug/zsmalloc/zramX/classes
>  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
>     54   896           1           12           117         57         26                2       12
> ...
>    107  1744           1           23           196         76         84                3       51
>    111  1808           0            0            63         63         28                4        0
>    126  2048           0          160           568        408        284                1       80
>    144  2336          52          620          8631       5747       4932                4     1648
>    151  2448         123          406         10090       8736       6054                3      810
>    168  2720           0          512         15738      14926      10492                2      540
>    190  3072           0            2           136        130        102                3        3
> ...
> 
> demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> 1648/4932=33% of unused pages, etc. And the more classes we will have as
> 'normal' classes (more than one object per-zspage) the bigger this problem
> will grow. The existing compaction relies on a user space (user can trigger
> compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> happen automatically.
> 
> This patch introduces a 'watermark' value of unused pages and schedules a
> compaction work on a per-class basis once class's fragmentation becomes
> too big. So compaction is not performed in current I/O operation context,
> but in workqueue workers later.
> 
> The current watermark is set to 40% -- if class has 40+% of `freeable'
> pages then compaction work will be scheduled.

Could you explain why you select per-class watermark?
Because my plan was we kick background work based on total fragmented memory
(i.e., considering used_pages/allocated_pages < some threshold).

IOW, if used_pages/allocated_pages is less than some ratio,
we kick background job with marking index of size class just freed
and then the job scans size_class from the index circulary.
As well, we should put a upper bound to scan zspages to make it
deterministic.

What do you think about it?

> 
> TEST
> ====
> 
>   2G zram, ext4, lz0
> 
>   iozone -t 1 -R -r 64K -s 1200M -I +Z
> 
>                         BASE       PATCHED
> "  Initial write "   959670.94    966724.62
> "        Rewrite "  1276167.62   1237632.88
> "           Read "  3334708.25   3345357.50
> "        Re-read "  3405310.75   3337137.25
> "   Reverse Read "  3284499.75   3241283.50
> "    Stride read "  3293417.75   3268364.00
> "    Random read "  3255253.50   3241685.00
> " Mixed workload "  3274398.00   3231498.00
> "   Random write "  1253207.50   1216247.00
> "         Pwrite "   873682.25    877045.81
> "          Pread "  3173266.00   3318471.75
> "         Fwrite "   881278.38    897622.81
> "          Fread "  4397147.00   4501131.50
> 
>   iozone -t 3 -R -r 64K -s 60M -I +Z
> 
>                         BASE       PATCHED
> "  Initial write "  1855931.62   1869576.31
> "        Rewrite "  2223531.06   2221543.62
> "           Read "  7958435.75   8023044.75
> "        Re-read "  7912776.75   8068961.00
> "   Reverse Read "  7832227.50   7788237.50
> "    Stride read "  7952113.50   7919778.00
> "    Random read "  7908816.00   7881792.50
> " Mixed workload "  6364520.38   6332493.94
> "   Random write "  2230115.69   2176777.19
> "         Pwrite "  1915939.31   1929464.75
> "          Pread "  3857052.91   3840517.91
> "         Fwrite "  2271730.44   2272800.31
> "          Fread "  9053867.00   8880966.25
> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

> ---
>  mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index e72efb1..a4ef7e7 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -219,6 +219,10 @@ struct size_class {
>  	int pages_per_zspage;
>  	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>  	bool huge;
> +
> +	bool compact_scheduled;
> +	struct zs_pool *pool;
> +	struct work_struct compact_work;
>  };
>  
>  /*
> @@ -1467,6 +1471,8 @@ static void obj_free(struct zs_pool *pool, struct size_class *class,
>  	zs_stat_dec(class, OBJ_USED, 1);
>  }
>  
> +static bool class_watermark_ok(struct size_class *class);
> +
>  void zs_free(struct zs_pool *pool, unsigned long handle)
>  {
>  	struct page *first_page, *f_page;
> @@ -1495,6 +1501,11 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  		atomic_long_sub(class->pages_per_zspage,
>  				&pool->pages_allocated);
>  		free_zspage(first_page);
> +	} else {
> +		if (!class_watermark_ok(class) && !class->compact_scheduled) {
> +			queue_work(system_long_wq, &class->compact_work);
> +			class->compact_scheduled = true;
> +		}
>  	}
>  	spin_unlock(&class->lock);
>  	unpin_tag(handle);
> @@ -1745,6 +1756,19 @@ static unsigned long zs_can_compact(struct size_class *class)
>  	return obj_wasted * class->pages_per_zspage;
>  }
>  
> +static bool class_watermark_ok(struct size_class *class)
> +{
> +	unsigned long pages_used = zs_stat_get(class, OBJ_ALLOCATED);
> +
> +	pages_used /= get_maxobj_per_zspage(class->size,
> +			class->pages_per_zspage) * class->pages_per_zspage;
> +
> +	if (!pages_used)
> +		return true;
> +
> +	return (100 * zs_can_compact(class) / pages_used) < 40;
> +}
> +
>  static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  {
>  	struct zs_compact_control cc;
> @@ -1789,9 +1813,17 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  	if (src_page)
>  		putback_zspage(pool, class, src_page);
>  
> +	class->compact_scheduled = false;
>  	spin_unlock(&class->lock);
>  }
>  
> +static void class_compaction_work(struct work_struct *work)
> +{
> +	struct size_class *class = container_of(work, struct size_class, compact_work);
> +
> +	__zs_compact(class->pool, class);
> +}
> +
>  unsigned long zs_compact(struct zs_pool *pool)
>  {
>  	int i;
> @@ -1948,6 +1980,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  		if (pages_per_zspage == 1 &&
>  			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
>  			class->huge = true;
> +
> +		INIT_WORK(&class->compact_work, class_compaction_work);
> +		class->pool = pool;
>  		spin_lock_init(&class->lock);
>  		pool->size_class[i] = class;
>  
> @@ -1990,6 +2025,8 @@ void zs_destroy_pool(struct zs_pool *pool)
>  		if (class->index != i)
>  			continue;
>  
> +		cancel_work_sync(&class->compact_work);
> +
>  		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
>  			if (class->fullness_list[fg]) {
>  				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
> -- 
> 2.8.0.rc0
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-14  6:17     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Hey Sergey,

Sorry for late review.

On Thu, Mar 03, 2016 at 11:45:59PM +0900, Sergey Senozhatsky wrote:
> zsmalloc classes are known to be affected by internal fragmentation.
> 
> For example, /sys/kernel/debug/zsmalloc/zramX/classes
>  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
>     54   896           1           12           117         57         26                2       12
> ...
>    107  1744           1           23           196         76         84                3       51
>    111  1808           0            0            63         63         28                4        0
>    126  2048           0          160           568        408        284                1       80
>    144  2336          52          620          8631       5747       4932                4     1648
>    151  2448         123          406         10090       8736       6054                3      810
>    168  2720           0          512         15738      14926      10492                2      540
>    190  3072           0            2           136        130        102                3        3
> ...
> 
> demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> 1648/4932=33% of unused pages, etc. And the more classes we will have as
> 'normal' classes (more than one object per-zspage) the bigger this problem
> will grow. The existing compaction relies on a user space (user can trigger
> compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> happen automatically.
> 
> This patch introduces a 'watermark' value of unused pages and schedules a
> compaction work on a per-class basis once class's fragmentation becomes
> too big. So compaction is not performed in current I/O operation context,
> but in workqueue workers later.
> 
> The current watermark is set to 40% -- if class has 40+% of `freeable'
> pages then compaction work will be scheduled.

Could you explain why you select per-class watermark?
Because my plan was we kick background work based on total fragmented memory
(i.e., considering used_pages/allocated_pages < some threshold).

IOW, if used_pages/allocated_pages is less than some ratio,
we kick background job with marking index of size class just freed
and then the job scans size_class from the index circulary.
As well, we should put a upper bound to scan zspages to make it
deterministic.

What do you think about it?

> 
> TEST
> ====
> 
>   2G zram, ext4, lz0
> 
>   iozone -t 1 -R -r 64K -s 1200M -I +Z
> 
>                         BASE       PATCHED
> "  Initial write "   959670.94    966724.62
> "        Rewrite "  1276167.62   1237632.88
> "           Read "  3334708.25   3345357.50
> "        Re-read "  3405310.75   3337137.25
> "   Reverse Read "  3284499.75   3241283.50
> "    Stride read "  3293417.75   3268364.00
> "    Random read "  3255253.50   3241685.00
> " Mixed workload "  3274398.00   3231498.00
> "   Random write "  1253207.50   1216247.00
> "         Pwrite "   873682.25    877045.81
> "          Pread "  3173266.00   3318471.75
> "         Fwrite "   881278.38    897622.81
> "          Fread "  4397147.00   4501131.50
> 
>   iozone -t 3 -R -r 64K -s 60M -I +Z
> 
>                         BASE       PATCHED
> "  Initial write "  1855931.62   1869576.31
> "        Rewrite "  2223531.06   2221543.62
> "           Read "  7958435.75   8023044.75
> "        Re-read "  7912776.75   8068961.00
> "   Reverse Read "  7832227.50   7788237.50
> "    Stride read "  7952113.50   7919778.00
> "    Random read "  7908816.00   7881792.50
> " Mixed workload "  6364520.38   6332493.94
> "   Random write "  2230115.69   2176777.19
> "         Pwrite "  1915939.31   1929464.75
> "          Pread "  3857052.91   3840517.91
> "         Fwrite "  2271730.44   2272800.31
> "          Fread "  9053867.00   8880966.25
> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

> ---
>  mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index e72efb1..a4ef7e7 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -219,6 +219,10 @@ struct size_class {
>  	int pages_per_zspage;
>  	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>  	bool huge;
> +
> +	bool compact_scheduled;
> +	struct zs_pool *pool;
> +	struct work_struct compact_work;
>  };
>  
>  /*
> @@ -1467,6 +1471,8 @@ static void obj_free(struct zs_pool *pool, struct size_class *class,
>  	zs_stat_dec(class, OBJ_USED, 1);
>  }
>  
> +static bool class_watermark_ok(struct size_class *class);
> +
>  void zs_free(struct zs_pool *pool, unsigned long handle)
>  {
>  	struct page *first_page, *f_page;
> @@ -1495,6 +1501,11 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  		atomic_long_sub(class->pages_per_zspage,
>  				&pool->pages_allocated);
>  		free_zspage(first_page);
> +	} else {
> +		if (!class_watermark_ok(class) && !class->compact_scheduled) {
> +			queue_work(system_long_wq, &class->compact_work);
> +			class->compact_scheduled = true;
> +		}
>  	}
>  	spin_unlock(&class->lock);
>  	unpin_tag(handle);
> @@ -1745,6 +1756,19 @@ static unsigned long zs_can_compact(struct size_class *class)
>  	return obj_wasted * class->pages_per_zspage;
>  }
>  
> +static bool class_watermark_ok(struct size_class *class)
> +{
> +	unsigned long pages_used = zs_stat_get(class, OBJ_ALLOCATED);
> +
> +	pages_used /= get_maxobj_per_zspage(class->size,
> +			class->pages_per_zspage) * class->pages_per_zspage;
> +
> +	if (!pages_used)
> +		return true;
> +
> +	return (100 * zs_can_compact(class) / pages_used) < 40;
> +}
> +
>  static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  {
>  	struct zs_compact_control cc;
> @@ -1789,9 +1813,17 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  	if (src_page)
>  		putback_zspage(pool, class, src_page);
>  
> +	class->compact_scheduled = false;
>  	spin_unlock(&class->lock);
>  }
>  
> +static void class_compaction_work(struct work_struct *work)
> +{
> +	struct size_class *class = container_of(work, struct size_class, compact_work);
> +
> +	__zs_compact(class->pool, class);
> +}
> +
>  unsigned long zs_compact(struct zs_pool *pool)
>  {
>  	int i;
> @@ -1948,6 +1980,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  		if (pages_per_zspage == 1 &&
>  			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
>  			class->huge = true;
> +
> +		INIT_WORK(&class->compact_work, class_compaction_work);
> +		class->pool = pool;
>  		spin_lock_init(&class->lock);
>  		pool->size_class[i] = class;
>  
> @@ -1990,6 +2025,8 @@ void zs_destroy_pool(struct zs_pool *pool)
>  		if (class->index != i)
>  			continue;
>  
> +		cancel_work_sync(&class->compact_work);
> +
>  		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
>  			if (class->fullness_list[fg]) {
>  				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
> -- 
> 2.8.0.rc0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-03 14:46   ` Sergey Senozhatsky
@ 2016-03-14  6:32     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> Do not register shrinker compaction callbacks anymore, since
> now we shedule class compaction work each time its fragmentation
> value goes above the watermark.

I suggested to remove shrinker compaction but while I review your
first patch in this thread, I thought we need upper-bound to
compact zspage so background work can bail out for latency easily.
IOW, the work could give up the job. In such case, we might need
fall-back scheme to continue the job. And I think that could be
a shrinker.

What do you think?

> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> ---
>  mm/zsmalloc.c | 72 -----------------------------------------------------------
>  1 file changed, 72 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index a4ef7e7..0bb060f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -256,13 +256,6 @@ struct zs_pool {
>  
>  	struct zs_pool_stats stats;
>  
> -	/* Compact classes */
> -	struct shrinker shrinker;
> -	/*
> -	 * To signify that register_shrinker() was successful
> -	 * and unregister_shrinker() will not Oops.
> -	 */
> -	bool shrinker_enabled;
>  #ifdef CONFIG_ZSMALLOC_STAT
>  	struct dentry *stat_dentry;
>  #endif
> @@ -1848,64 +1841,6 @@ void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats)
>  }
>  EXPORT_SYMBOL_GPL(zs_pool_stats);
>  
> -static unsigned long zs_shrinker_scan(struct shrinker *shrinker,
> -		struct shrink_control *sc)
> -{
> -	unsigned long pages_freed;
> -	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
> -			shrinker);
> -
> -	pages_freed = pool->stats.pages_compacted;
> -	/*
> -	 * Compact classes and calculate compaction delta.
> -	 * Can run concurrently with a manually triggered
> -	 * (by user) compaction.
> -	 */
> -	pages_freed = zs_compact(pool) - pages_freed;
> -
> -	return pages_freed ? pages_freed : SHRINK_STOP;
> -}
> -
> -static unsigned long zs_shrinker_count(struct shrinker *shrinker,
> -		struct shrink_control *sc)
> -{
> -	int i;
> -	struct size_class *class;
> -	unsigned long pages_to_free = 0;
> -	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
> -			shrinker);
> -
> -	for (i = zs_size_classes - 1; i >= 0; i--) {
> -		class = pool->size_class[i];
> -		if (!class)
> -			continue;
> -		if (class->index != i)
> -			continue;
> -
> -		pages_to_free += zs_can_compact(class);
> -	}
> -
> -	return pages_to_free;
> -}
> -
> -static void zs_unregister_shrinker(struct zs_pool *pool)
> -{
> -	if (pool->shrinker_enabled) {
> -		unregister_shrinker(&pool->shrinker);
> -		pool->shrinker_enabled = false;
> -	}
> -}
> -
> -static int zs_register_shrinker(struct zs_pool *pool)
> -{
> -	pool->shrinker.scan_objects = zs_shrinker_scan;
> -	pool->shrinker.count_objects = zs_shrinker_count;
> -	pool->shrinker.batch = 0;
> -	pool->shrinker.seeks = DEFAULT_SEEKS;
> -
> -	return register_shrinker(&pool->shrinker);
> -}
> -
>  /**
>   * zs_create_pool - Creates an allocation pool to work from.
>   * @flags: allocation flags used to allocate pool metadata
> @@ -1994,12 +1929,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  	if (zs_pool_stat_create(name, pool))
>  		goto err;
>  
> -	/*
> -	 * Not critical, we still can use the pool
> -	 * and user can trigger compaction manually.
> -	 */
> -	if (zs_register_shrinker(pool) == 0)
> -		pool->shrinker_enabled = true;
>  	return pool;
>  
>  err:
> @@ -2012,7 +1941,6 @@ void zs_destroy_pool(struct zs_pool *pool)
>  {
>  	int i;
>  
> -	zs_unregister_shrinker(pool);
>  	zs_pool_stat_destroy(pool);
>  
>  	for (i = 0; i < zs_size_classes; i++) {
> -- 
> 2.8.0.rc0
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-14  6:32     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> Do not register shrinker compaction callbacks anymore, since
> now we shedule class compaction work each time its fragmentation
> value goes above the watermark.

I suggested to remove shrinker compaction but while I review your
first patch in this thread, I thought we need upper-bound to
compact zspage so background work can bail out for latency easily.
IOW, the work could give up the job. In such case, we might need
fall-back scheme to continue the job. And I think that could be
a shrinker.

What do you think?

> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> ---
>  mm/zsmalloc.c | 72 -----------------------------------------------------------
>  1 file changed, 72 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index a4ef7e7..0bb060f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -256,13 +256,6 @@ struct zs_pool {
>  
>  	struct zs_pool_stats stats;
>  
> -	/* Compact classes */
> -	struct shrinker shrinker;
> -	/*
> -	 * To signify that register_shrinker() was successful
> -	 * and unregister_shrinker() will not Oops.
> -	 */
> -	bool shrinker_enabled;
>  #ifdef CONFIG_ZSMALLOC_STAT
>  	struct dentry *stat_dentry;
>  #endif
> @@ -1848,64 +1841,6 @@ void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats)
>  }
>  EXPORT_SYMBOL_GPL(zs_pool_stats);
>  
> -static unsigned long zs_shrinker_scan(struct shrinker *shrinker,
> -		struct shrink_control *sc)
> -{
> -	unsigned long pages_freed;
> -	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
> -			shrinker);
> -
> -	pages_freed = pool->stats.pages_compacted;
> -	/*
> -	 * Compact classes and calculate compaction delta.
> -	 * Can run concurrently with a manually triggered
> -	 * (by user) compaction.
> -	 */
> -	pages_freed = zs_compact(pool) - pages_freed;
> -
> -	return pages_freed ? pages_freed : SHRINK_STOP;
> -}
> -
> -static unsigned long zs_shrinker_count(struct shrinker *shrinker,
> -		struct shrink_control *sc)
> -{
> -	int i;
> -	struct size_class *class;
> -	unsigned long pages_to_free = 0;
> -	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
> -			shrinker);
> -
> -	for (i = zs_size_classes - 1; i >= 0; i--) {
> -		class = pool->size_class[i];
> -		if (!class)
> -			continue;
> -		if (class->index != i)
> -			continue;
> -
> -		pages_to_free += zs_can_compact(class);
> -	}
> -
> -	return pages_to_free;
> -}
> -
> -static void zs_unregister_shrinker(struct zs_pool *pool)
> -{
> -	if (pool->shrinker_enabled) {
> -		unregister_shrinker(&pool->shrinker);
> -		pool->shrinker_enabled = false;
> -	}
> -}
> -
> -static int zs_register_shrinker(struct zs_pool *pool)
> -{
> -	pool->shrinker.scan_objects = zs_shrinker_scan;
> -	pool->shrinker.count_objects = zs_shrinker_count;
> -	pool->shrinker.batch = 0;
> -	pool->shrinker.seeks = DEFAULT_SEEKS;
> -
> -	return register_shrinker(&pool->shrinker);
> -}
> -
>  /**
>   * zs_create_pool - Creates an allocation pool to work from.
>   * @flags: allocation flags used to allocate pool metadata
> @@ -1994,12 +1929,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  	if (zs_pool_stat_create(name, pool))
>  		goto err;
>  
> -	/*
> -	 * Not critical, we still can use the pool
> -	 * and user can trigger compaction manually.
> -	 */
> -	if (zs_register_shrinker(pool) == 0)
> -		pool->shrinker_enabled = true;
>  	return pool;
>  
>  err:
> @@ -2012,7 +1941,6 @@ void zs_destroy_pool(struct zs_pool *pool)
>  {
>  	int i;
>  
> -	zs_unregister_shrinker(pool);
>  	zs_pool_stat_destroy(pool);
>  
>  	for (i = 0; i < zs_size_classes; i++) {
> -- 
> 2.8.0.rc0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
  2016-03-03 14:46   ` Sergey Senozhatsky
@ 2016-03-14  6:53     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> zsmalloc knows the watermark after which classes are considered
> to be ->huge -- every object stored consumes the entire zspage (which
> consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> first non-huge class size is 3264, so starting down from size 3264,
> objects share page(-s) and thus minimize memory wastage.
> 
> zram, however, has its own statically defined watermark for `bad'
> compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> to a ->huge class, this results in increased memory consumption and
> memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> class).
> 
> Introduce zs_huge_object() function which tells whether the supplied
> object's size belongs to a huge class; so zram now can store objects
> to ->huge clases only when those objects have sizes greater than
> huge_class_size_watermark.

I understand the problem you pointed out but I don't like this way.

Huge class is internal thing in zsmalloc so zram shouldn't be coupled
with it. Zram uses just zsmalloc to minimize meory wastage which is
all zram should know about zsmalloc.

Instead, how about changing max_zpage_size?

        static const size_t max_zpage_size = 4096;

So, if compression doesn't help memory efficiency, we don't
need to have decompress overhead. Only that case, we store
decompressed page.

For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
you sent a patch to reduce waste memory as 5/5 so I think it's
a good justification between memory efficiency VS.
decompress overhead.

Thanks.

> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> ---
>  include/linux/zsmalloc.h |  2 ++
>  mm/zsmalloc.c            | 13 +++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 34eb160..7184ee1 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -55,4 +55,6 @@ unsigned long zs_get_total_pages(struct zs_pool *pool);
>  unsigned long zs_compact(struct zs_pool *pool);
>  
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
> +
> +bool zs_huge_object(size_t sz);
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0bb060f..06a7d87 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -188,6 +188,11 @@ static struct dentry *zs_stat_root;
>  static int zs_size_classes;
>  
>  /*
> + * All classes above this class_size are huge classes
> + */
> +static size_t huge_class_size_watermark;
> +
> +/*
>   * We assign a page to ZS_ALMOST_EMPTY fullness group when:
>   *	n <= N / f, where
>   * n = number of allocated objects
> @@ -1244,6 +1249,12 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
>  }
>  EXPORT_SYMBOL_GPL(zs_get_total_pages);
>  
> +bool zs_huge_object(size_t sz)
> +{
> +	return sz > huge_class_size_watermark;
> +}
> +EXPORT_SYMBOL_GPL(zs_huge_object);
> +
>  /**
>   * zs_map_object - get address of allocated object from handle.
>   * @pool: pool from which the object was allocated
> @@ -1922,6 +1933,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  		pool->size_class[i] = class;
>  
>  		prev_class = class;
> +		if (!class->huge && !huge_class_size_watermark)
> +			huge_class_size_watermark = size - ZS_HANDLE_SIZE;
>  	}
>  
>  	pool->flags = flags;
> -- 
> 2.8.0.rc0
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
@ 2016-03-14  6:53     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-14  6:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> zsmalloc knows the watermark after which classes are considered
> to be ->huge -- every object stored consumes the entire zspage (which
> consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> first non-huge class size is 3264, so starting down from size 3264,
> objects share page(-s) and thus minimize memory wastage.
> 
> zram, however, has its own statically defined watermark for `bad'
> compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> to a ->huge class, this results in increased memory consumption and
> memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> class).
> 
> Introduce zs_huge_object() function which tells whether the supplied
> object's size belongs to a huge class; so zram now can store objects
> to ->huge clases only when those objects have sizes greater than
> huge_class_size_watermark.

I understand the problem you pointed out but I don't like this way.

Huge class is internal thing in zsmalloc so zram shouldn't be coupled
with it. Zram uses just zsmalloc to minimize meory wastage which is
all zram should know about zsmalloc.

Instead, how about changing max_zpage_size?

        static const size_t max_zpage_size = 4096;

So, if compression doesn't help memory efficiency, we don't
need to have decompress overhead. Only that case, we store
decompressed page.

For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
you sent a patch to reduce waste memory as 5/5 so I think it's
a good justification between memory efficiency VS.
decompress overhead.

Thanks.

> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> ---
>  include/linux/zsmalloc.h |  2 ++
>  mm/zsmalloc.c            | 13 +++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 34eb160..7184ee1 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -55,4 +55,6 @@ unsigned long zs_get_total_pages(struct zs_pool *pool);
>  unsigned long zs_compact(struct zs_pool *pool);
>  
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
> +
> +bool zs_huge_object(size_t sz);
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0bb060f..06a7d87 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -188,6 +188,11 @@ static struct dentry *zs_stat_root;
>  static int zs_size_classes;
>  
>  /*
> + * All classes above this class_size are huge classes
> + */
> +static size_t huge_class_size_watermark;
> +
> +/*
>   * We assign a page to ZS_ALMOST_EMPTY fullness group when:
>   *	n <= N / f, where
>   * n = number of allocated objects
> @@ -1244,6 +1249,12 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
>  }
>  EXPORT_SYMBOL_GPL(zs_get_total_pages);
>  
> +bool zs_huge_object(size_t sz)
> +{
> +	return sz > huge_class_size_watermark;
> +}
> +EXPORT_SYMBOL_GPL(zs_huge_object);
> +
>  /**
>   * zs_map_object - get address of allocated object from handle.
>   * @pool: pool from which the object was allocated
> @@ -1922,6 +1933,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>  		pool->size_class[i] = class;
>  
>  		prev_class = class;
> +		if (!class->huge && !huge_class_size_watermark)
> +			huge_class_size_watermark = size - ZS_HANDLE_SIZE;
>  	}
>  
>  	pool->flags = flags;
> -- 
> 2.8.0.rc0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-14  6:17     ` Minchan Kim
@ 2016-03-14  7:41       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  7:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

Hello Minchan,

On (03/14/16 15:17), Minchan Kim wrote:
[..]
> > demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> > 1648/4932=33% of unused pages, etc. And the more classes we will have as
> > 'normal' classes (more than one object per-zspage) the bigger this problem
> > will grow. The existing compaction relies on a user space (user can trigger
> > compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> > happen automatically.
> > 
> > This patch introduces a 'watermark' value of unused pages and schedules a
> > compaction work on a per-class basis once class's fragmentation becomes
> > too big. So compaction is not performed in current I/O operation context,
> > but in workqueue workers later.
> > 
> > The current watermark is set to 40% -- if class has 40+% of `freeable'
> > pages then compaction work will be scheduled.
> 
> Could you explain why you select per-class watermark?

yes,

we do less work this way - scan and compact only one class, instead
of locking and compacting all of them; which sounds reasonable.


> Because my plan was we kick background work based on total fragmented memory
> (i.e., considering used_pages/allocated_pages < some threshold).

if we know that a particular class B is fragmented and the rest of them
are just fine, then we can compact only that class B, skipping extra job.

> IOW, if used_pages/allocated_pages is less than some ratio,
> we kick background job with marking index of size class just freed
> and then the job scans size_class from the index circulary.
>
> As well, we should put a upper bound to scan zspages to make it
> deterministic.

you mean that __zs_compact() instead of just checking per-class
zs_can_compact() should check global pool ratio and bail out if
compaction of class Z has dropped the overall fragmentation ratio
below some watermark?

my logic was that
 -- suppose we have class A with fragmentation ratio 49% and class B
 with 8% of wasted pages, so the overall pool fragmentation is
 (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.
 if the aim is to reduce the memory wastage then per-class watermarks
 seem to be more flexible.

> What do you think about it?

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-14  7:41       ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  7:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

Hello Minchan,

On (03/14/16 15:17), Minchan Kim wrote:
[..]
> > demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> > 1648/4932=33% of unused pages, etc. And the more classes we will have as
> > 'normal' classes (more than one object per-zspage) the bigger this problem
> > will grow. The existing compaction relies on a user space (user can trigger
> > compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> > happen automatically.
> > 
> > This patch introduces a 'watermark' value of unused pages and schedules a
> > compaction work on a per-class basis once class's fragmentation becomes
> > too big. So compaction is not performed in current I/O operation context,
> > but in workqueue workers later.
> > 
> > The current watermark is set to 40% -- if class has 40+% of `freeable'
> > pages then compaction work will be scheduled.
> 
> Could you explain why you select per-class watermark?

yes,

we do less work this way - scan and compact only one class, instead
of locking and compacting all of them; which sounds reasonable.


> Because my plan was we kick background work based on total fragmented memory
> (i.e., considering used_pages/allocated_pages < some threshold).

if we know that a particular class B is fragmented and the rest of them
are just fine, then we can compact only that class B, skipping extra job.

> IOW, if used_pages/allocated_pages is less than some ratio,
> we kick background job with marking index of size class just freed
> and then the job scans size_class from the index circulary.
>
> As well, we should put a upper bound to scan zspages to make it
> deterministic.

you mean that __zs_compact() instead of just checking per-class
zs_can_compact() should check global pool ratio and bail out if
compaction of class Z has dropped the overall fragmentation ratio
below some watermark?

my logic was that
 -- suppose we have class A with fragmentation ratio 49% and class B
 with 8% of wasted pages, so the overall pool fragmentation is
 (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.
 if the aim is to reduce the memory wastage then per-class watermarks
 seem to be more flexible.

> What do you think about it?

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-14  6:32     ` Minchan Kim
@ 2016-03-14  7:45       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  7:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

On (03/14/16 15:32), Minchan Kim wrote:
> On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> > Do not register shrinker compaction callbacks anymore, since
> > now we shedule class compaction work each time its fragmentation
> > value goes above the watermark.
> 
> I suggested to remove shrinker compaction but while I review your
> first patch in this thread, I thought we need upper-bound to
> compact zspage so background work can bail out for latency easily.
> IOW, the work could give up the job. In such case, we might need
> fall-back scheme to continue the job. And I think that could be
> a shrinker.
> 
> What do you think?

wouldn't this unnecessarily complicate the whole thing? we would
have
 a) a compaction that can be triggered by used space
 b) a compaction from zs_free() that can bail out
 c) a compaction triggered by the shrinker.

all 3 three can run simultaneously.


_if_ we can keep every class below its watermark, we can reduce the
need of "c)".

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-14  7:45       ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  7:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

On (03/14/16 15:32), Minchan Kim wrote:
> On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> > Do not register shrinker compaction callbacks anymore, since
> > now we shedule class compaction work each time its fragmentation
> > value goes above the watermark.
> 
> I suggested to remove shrinker compaction but while I review your
> first patch in this thread, I thought we need upper-bound to
> compact zspage so background work can bail out for latency easily.
> IOW, the work could give up the job. In such case, we might need
> fall-back scheme to continue the job. And I think that could be
> a shrinker.
> 
> What do you think?

wouldn't this unnecessarily complicate the whole thing? we would
have
 a) a compaction that can be triggered by used space
 b) a compaction from zs_free() that can bail out
 c) a compaction triggered by the shrinker.

all 3 three can run simultaneously.


_if_ we can keep every class below its watermark, we can reduce the
need of "c)".

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
  2016-03-14  6:53     ` Minchan Kim
@ 2016-03-14  8:08       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  8:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

On (03/14/16 15:53), Minchan Kim wrote:
[..]
> On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> > zsmalloc knows the watermark after which classes are considered
> > to be ->huge -- every object stored consumes the entire zspage (which
> > consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> > first non-huge class size is 3264, so starting down from size 3264,
> > objects share page(-s) and thus minimize memory wastage.
> > 
> > zram, however, has its own statically defined watermark for `bad'
> > compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> > larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> > to a ->huge class, this results in increased memory consumption and
> > memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> > adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> > 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> > class).
> > 
> > Introduce zs_huge_object() function which tells whether the supplied
> > object's size belongs to a huge class; so zram now can store objects
> > to ->huge clases only when those objects have sizes greater than
> > huge_class_size_watermark.
> 
> I understand the problem you pointed out but I don't like this way.
> 
> Huge class is internal thing in zsmalloc so zram shouldn't be coupled
> with it. Zram uses just zsmalloc to minimize meory wastage which is
> all zram should know about zsmalloc.

well, zram already coupled with zsmalloc() and it has always been,
that's the reality. there are zs_foo() calls, and not a single one
zpool_foo() call. I'm not in love with zs_huge_object() either, but
that's much better than forcing zsmalloc to be less efficient based
on some pretty random expectations (no offense).

> Instead, how about changing max_zpage_size?
> 
>         static const size_t max_zpage_size = 4096;
> 
> So, if compression doesn't help memory efficiency, we don't
> need to have decompress overhead. Only that case, we store
> decompressed page.

hm, disabling this zram future entirely... this can do the trick,
I think. zswap is quite happy not having any expectations on
"how effectively an unknown compression algorithm will compress
an unknown data set", and that's the "right" thing to do here,
we can't count on anything.


> For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
> you sent a patch to reduce waste memory as 5/5 so I think it's
> a good justification between memory efficiency VS.
> decompress overhead.

so the plan is to raise max_zpage_size to PAGE_SIZE and to increase
the number of huge classes, so zsmalloc can be more helpful. sounds
good to me.


> Thanks.

Thanks.

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
@ 2016-03-14  8:08       ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  8:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm,
	linux-kernel, Sergey Senozhatsky

On (03/14/16 15:53), Minchan Kim wrote:
[..]
> On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> > zsmalloc knows the watermark after which classes are considered
> > to be ->huge -- every object stored consumes the entire zspage (which
> > consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> > first non-huge class size is 3264, so starting down from size 3264,
> > objects share page(-s) and thus minimize memory wastage.
> > 
> > zram, however, has its own statically defined watermark for `bad'
> > compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> > larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> > to a ->huge class, this results in increased memory consumption and
> > memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> > adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> > 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> > class).
> > 
> > Introduce zs_huge_object() function which tells whether the supplied
> > object's size belongs to a huge class; so zram now can store objects
> > to ->huge clases only when those objects have sizes greater than
> > huge_class_size_watermark.
> 
> I understand the problem you pointed out but I don't like this way.
> 
> Huge class is internal thing in zsmalloc so zram shouldn't be coupled
> with it. Zram uses just zsmalloc to minimize meory wastage which is
> all zram should know about zsmalloc.

well, zram already coupled with zsmalloc() and it has always been,
that's the reality. there are zs_foo() calls, and not a single one
zpool_foo() call. I'm not in love with zs_huge_object() either, but
that's much better than forcing zsmalloc to be less efficient based
on some pretty random expectations (no offense).

> Instead, how about changing max_zpage_size?
> 
>         static const size_t max_zpage_size = 4096;
> 
> So, if compression doesn't help memory efficiency, we don't
> need to have decompress overhead. Only that case, we store
> decompressed page.

hm, disabling this zram future entirely... this can do the trick,
I think. zswap is quite happy not having any expectations on
"how effectively an unknown compression algorithm will compress
an unknown data set", and that's the "right" thing to do here,
we can't count on anything.


> For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
> you sent a patch to reduce waste memory as 5/5 so I think it's
> a good justification between memory efficiency VS.
> decompress overhead.

so the plan is to raise max_zpage_size to PAGE_SIZE and to increase
the number of huge classes, so zsmalloc can be more helpful. sounds
good to me.


> Thanks.

Thanks.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-14  7:41       ` Sergey Senozhatsky
@ 2016-03-14  8:20         ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  8:20 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Sergey Senozhatsky, Andrew Morton, Joonsoo Kim,
	linux-mm, linux-kernel

On (03/14/16 16:41), Sergey Senozhatsky wrote:
[..]
> you mean that __zs_compact() instead of just checking per-class
> zs_can_compact() should check global pool ratio and bail out if
> compaction of class Z has dropped the overall fragmentation ratio
> below some watermark?
> 
> my logic was that
>  -- suppose we have class A with fragmentation ratio 49% and class B
>  with 8% of wasted pages, so the overall pool fragmentation is
>  (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.

  "(49 + 8) / 2 < 30%"

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-14  8:20         ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-14  8:20 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Sergey Senozhatsky, Andrew Morton, Joonsoo Kim,
	linux-mm, linux-kernel

On (03/14/16 16:41), Sergey Senozhatsky wrote:
[..]
> you mean that __zs_compact() instead of just checking per-class
> zs_can_compact() should check global pool ratio and bail out if
> compaction of class Z has dropped the overall fragmentation ratio
> below some watermark?
> 
> my logic was that
>  -- suppose we have class A with fragmentation ratio 49% and class B
>  with 8% of wasted pages, so the overall pool fragmentation is
>  (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.

  "(49 + 8) / 2 < 30%"

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-14  7:41       ` Sergey Senozhatsky
@ 2016-03-15  0:46         ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:46 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 04:41:59PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/14/16 15:17), Minchan Kim wrote:
> [..]
> > > demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> > > 1648/4932=33% of unused pages, etc. And the more classes we will have as
> > > 'normal' classes (more than one object per-zspage) the bigger this problem
> > > will grow. The existing compaction relies on a user space (user can trigger
> > > compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> > > happen automatically.
> > > 
> > > This patch introduces a 'watermark' value of unused pages and schedules a
> > > compaction work on a per-class basis once class's fragmentation becomes
> > > too big. So compaction is not performed in current I/O operation context,
> > > but in workqueue workers later.
> > > 
> > > The current watermark is set to 40% -- if class has 40+% of `freeable'
> > > pages then compaction work will be scheduled.
> > 
> > Could you explain why you select per-class watermark?
> 
> yes,
> 
> we do less work this way - scan and compact only one class, instead
> of locking and compacting all of them; which sounds reasonable.

Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
+ sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.
I didn't test your patch but I guess I can make worst case scenario.

* make every class fragmented under 40%
* On the 40% boundary, repeated alloc/free of every class so every free
  can schedule work if it was not scheduled.
* Although class fragment is too high, it's not a problem if the class
  consumes small amount of memory.

I guess it can make degradation if I try to test on zsmalloc
microbenchmark.

As well, although I don't know workqueue internal well, thesedays,
I saw a few of mails related to workqueue(maybe, vmstat) and it had
some trouble if system memory pressure is heavy IIRC.

My approach is as follows, for exmaple.

Let's make a global ratio. Let's say it's 4M.
If zs_free(or something) realizes current fragment is over 4M,
kick compacion backgroud job.
The job scans from highest to lower class and compact zspages
in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
6M fragment ratio).
And in the middle of background compaction, if we find it's too
many scan(e.g., 256 zspages or somethings), just bail out the
job for the latency and reschedule it for next time. At the next
time, we can continue from the last size class.

I know your concern is unncessary scan but I'm not sure it can
affect performance although we try to evaluate performance with
microbenchmark. It just loops and check with zs_can_compact
for 255 size class.

If you still don't like this approach, we can implement each
solution and test/compare. ;-)

> 
> 
> > Because my plan was we kick background work based on total fragmented memory
> > (i.e., considering used_pages/allocated_pages < some threshold).
> 
> if we know that a particular class B is fragmented and the rest of them
> are just fine, then we can compact only that class B, skipping extra job.
> 
> > IOW, if used_pages/allocated_pages is less than some ratio,
> > we kick background job with marking index of size class just freed
> > and then the job scans size_class from the index circulary.
> >
> > As well, we should put a upper bound to scan zspages to make it
> > deterministic.
> 
> you mean that __zs_compact() instead of just checking per-class
> zs_can_compact() should check global pool ratio and bail out if
> compaction of class Z has dropped the overall fragmentation ratio
> below some watermark?

Above my comment can explan the question.

> 
> my logic was that
>  -- suppose we have class A with fragmentation ratio 49% and class B
>  with 8% of wasted pages, so the overall pool fragmentation is
>  (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.
>  if the aim is to reduce the memory wastage then per-class watermarks
>  seem to be more flexible.
> 
> > What do you think about it?
> 
> 	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-15  0:46         ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:46 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 04:41:59PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/14/16 15:17), Minchan Kim wrote:
> [..]
> > > demonstrates that class-896 has 12/26=46% of unused pages, class-2336 has
> > > 1648/4932=33% of unused pages, etc. And the more classes we will have as
> > > 'normal' classes (more than one object per-zspage) the bigger this problem
> > > will grow. The existing compaction relies on a user space (user can trigger
> > > compaction via `compact' zram's sysfs attr) or a shrinker; it does not
> > > happen automatically.
> > > 
> > > This patch introduces a 'watermark' value of unused pages and schedules a
> > > compaction work on a per-class basis once class's fragmentation becomes
> > > too big. So compaction is not performed in current I/O operation context,
> > > but in workqueue workers later.
> > > 
> > > The current watermark is set to 40% -- if class has 40+% of `freeable'
> > > pages then compaction work will be scheduled.
> > 
> > Could you explain why you select per-class watermark?
> 
> yes,
> 
> we do less work this way - scan and compact only one class, instead
> of locking and compacting all of them; which sounds reasonable.

Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
+ sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.
I didn't test your patch but I guess I can make worst case scenario.

* make every class fragmented under 40%
* On the 40% boundary, repeated alloc/free of every class so every free
  can schedule work if it was not scheduled.
* Although class fragment is too high, it's not a problem if the class
  consumes small amount of memory.

I guess it can make degradation if I try to test on zsmalloc
microbenchmark.

As well, although I don't know workqueue internal well, thesedays,
I saw a few of mails related to workqueue(maybe, vmstat) and it had
some trouble if system memory pressure is heavy IIRC.

My approach is as follows, for exmaple.

Let's make a global ratio. Let's say it's 4M.
If zs_free(or something) realizes current fragment is over 4M,
kick compacion backgroud job.
The job scans from highest to lower class and compact zspages
in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
6M fragment ratio).
And in the middle of background compaction, if we find it's too
many scan(e.g., 256 zspages or somethings), just bail out the
job for the latency and reschedule it for next time. At the next
time, we can continue from the last size class.

I know your concern is unncessary scan but I'm not sure it can
affect performance although we try to evaluate performance with
microbenchmark. It just loops and check with zs_can_compact
for 255 size class.

If you still don't like this approach, we can implement each
solution and test/compare. ;-)

> 
> 
> > Because my plan was we kick background work based on total fragmented memory
> > (i.e., considering used_pages/allocated_pages < some threshold).
> 
> if we know that a particular class B is fragmented and the rest of them
> are just fine, then we can compact only that class B, skipping extra job.
> 
> > IOW, if used_pages/allocated_pages is less than some ratio,
> > we kick background job with marking index of size class just freed
> > and then the job scans size_class from the index circulary.
> >
> > As well, we should put a upper bound to scan zspages to make it
> > deterministic.
> 
> you mean that __zs_compact() instead of just checking per-class
> zs_can_compact() should check global pool ratio and bail out if
> compaction of class Z has dropped the overall fragmentation ratio
> below some watermark?

Above my comment can explan the question.

> 
> my logic was that
>  -- suppose we have class A with fragmentation ratio 49% and class B
>  with 8% of wasted pages, so the overall pool fragmentation is
>  (50 + 10)/ 2 < 30%, while we still have almost 50% fragmented class.
>  if the aim is to reduce the memory wastage then per-class watermarks
>  seem to be more flexible.
> 
> > What do you think about it?
> 
> 	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-14  7:45       ` Sergey Senozhatsky
@ 2016-03-15  0:52         ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:52 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 04:45:23PM +0900, Sergey Senozhatsky wrote:
> On (03/14/16 15:32), Minchan Kim wrote:
> > On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> > > Do not register shrinker compaction callbacks anymore, since
> > > now we shedule class compaction work each time its fragmentation
> > > value goes above the watermark.
> > 
> > I suggested to remove shrinker compaction but while I review your
> > first patch in this thread, I thought we need upper-bound to
> > compact zspage so background work can bail out for latency easily.
> > IOW, the work could give up the job. In such case, we might need
> > fall-back scheme to continue the job. And I think that could be
> > a shrinker.
> > 
> > What do you think?
> 
> wouldn't this unnecessarily complicate the whole thing? we would
> have
>  a) a compaction that can be triggered by used space

Maybe, user space? :)

>  b) a compaction from zs_free() that can bail out
>  c) a compaction triggered by the shrinker.
> 
> all 3 three can run simultaneously.

Yeb.

> 
> 
> _if_ we can keep every class below its watermark, we can reduce the
> need of "c)".

But the problem is timing. We cannot guarantee when background
compaction triggers while shrinker is interop with VM so we should
do the job instantly for the system.


> 
> 	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-15  0:52         ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:52 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 04:45:23PM +0900, Sergey Senozhatsky wrote:
> On (03/14/16 15:32), Minchan Kim wrote:
> > On Thu, Mar 03, 2016 at 11:46:00PM +0900, Sergey Senozhatsky wrote:
> > > Do not register shrinker compaction callbacks anymore, since
> > > now we shedule class compaction work each time its fragmentation
> > > value goes above the watermark.
> > 
> > I suggested to remove shrinker compaction but while I review your
> > first patch in this thread, I thought we need upper-bound to
> > compact zspage so background work can bail out for latency easily.
> > IOW, the work could give up the job. In such case, we might need
> > fall-back scheme to continue the job. And I think that could be
> > a shrinker.
> > 
> > What do you think?
> 
> wouldn't this unnecessarily complicate the whole thing? we would
> have
>  a) a compaction that can be triggered by used space

Maybe, user space? :)

>  b) a compaction from zs_free() that can bail out
>  c) a compaction triggered by the shrinker.
> 
> all 3 three can run simultaneously.

Yeb.

> 
> 
> _if_ we can keep every class below its watermark, we can reduce the
> need of "c)".

But the problem is timing. We cannot guarantee when background
compaction triggers while shrinker is interop with VM so we should
do the job instantly for the system.


> 
> 	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
  2016-03-14  8:08       ` Sergey Senozhatsky
@ 2016-03-15  0:54         ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:54 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 05:08:43PM +0900, Sergey Senozhatsky wrote:
> On (03/14/16 15:53), Minchan Kim wrote:
> [..]
> > On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> > > zsmalloc knows the watermark after which classes are considered
> > > to be ->huge -- every object stored consumes the entire zspage (which
> > > consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> > > first non-huge class size is 3264, so starting down from size 3264,
> > > objects share page(-s) and thus minimize memory wastage.
> > > 
> > > zram, however, has its own statically defined watermark for `bad'
> > > compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> > > larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> > > to a ->huge class, this results in increased memory consumption and
> > > memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> > > adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> > > 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> > > class).
> > > 
> > > Introduce zs_huge_object() function which tells whether the supplied
> > > object's size belongs to a huge class; so zram now can store objects
> > > to ->huge clases only when those objects have sizes greater than
> > > huge_class_size_watermark.
> > 
> > I understand the problem you pointed out but I don't like this way.
> > 
> > Huge class is internal thing in zsmalloc so zram shouldn't be coupled
> > with it. Zram uses just zsmalloc to minimize meory wastage which is
> > all zram should know about zsmalloc.
> 
> well, zram already coupled with zsmalloc() and it has always been,
> that's the reality. there are zs_foo() calls, and not a single one
> zpool_foo() call. I'm not in love with zs_huge_object() either, but
> that's much better than forcing zsmalloc to be less efficient based
> on some pretty random expectations (no offense).
> 
> > Instead, how about changing max_zpage_size?
> > 
> >         static const size_t max_zpage_size = 4096;
> > 
> > So, if compression doesn't help memory efficiency, we don't
> > need to have decompress overhead. Only that case, we store
> > decompressed page.
> 
> hm, disabling this zram future entirely... this can do the trick,
> I think. zswap is quite happy not having any expectations on
> "how effectively an unknown compression algorithm will compress
> an unknown data set", and that's the "right" thing to do here,
> we can't count on anything.
> 
> 
> > For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
> > you sent a patch to reduce waste memory as 5/5 so I think it's
> > a good justification between memory efficiency VS.
> > decompress overhead.
> 
> so the plan is to raise max_zpage_size to PAGE_SIZE and to increase
> the number of huge classes, so zsmalloc can be more helpful. sounds
> good to me.

nod.

Thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object()
@ 2016-03-15  0:54         ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  0:54 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Mon, Mar 14, 2016 at 05:08:43PM +0900, Sergey Senozhatsky wrote:
> On (03/14/16 15:53), Minchan Kim wrote:
> [..]
> > On Thu, Mar 03, 2016 at 11:46:01PM +0900, Sergey Senozhatsky wrote:
> > > zsmalloc knows the watermark after which classes are considered
> > > to be ->huge -- every object stored consumes the entire zspage (which
> > > consist of a single order-0 page). On x86_64, PAGE_SHIFT 12 box, the
> > > first non-huge class size is 3264, so starting down from size 3264,
> > > objects share page(-s) and thus minimize memory wastage.
> > > 
> > > zram, however, has its own statically defined watermark for `bad'
> > > compression "3 * PAGE_SIZE / 4 = 3072", and stores every object
> > > larger than this watermark (3072) as a PAGE_SIZE, object, IOW,
> > > to a ->huge class, this results in increased memory consumption and
> > > memory wastage. (With a small exception: 3264 bytes class. zs_malloc()
> > > adds ZS_HANDLE_SIZE to the object's size, so some objects can pass
> > > 3072 bytes and get_size_class_index(size) will return 3264 bytes size
> > > class).
> > > 
> > > Introduce zs_huge_object() function which tells whether the supplied
> > > object's size belongs to a huge class; so zram now can store objects
> > > to ->huge clases only when those objects have sizes greater than
> > > huge_class_size_watermark.
> > 
> > I understand the problem you pointed out but I don't like this way.
> > 
> > Huge class is internal thing in zsmalloc so zram shouldn't be coupled
> > with it. Zram uses just zsmalloc to minimize meory wastage which is
> > all zram should know about zsmalloc.
> 
> well, zram already coupled with zsmalloc() and it has always been,
> that's the reality. there are zs_foo() calls, and not a single one
> zpool_foo() call. I'm not in love with zs_huge_object() either, but
> that's much better than forcing zsmalloc to be less efficient based
> on some pretty random expectations (no offense).
> 
> > Instead, how about changing max_zpage_size?
> > 
> >         static const size_t max_zpage_size = 4096;
> > 
> > So, if compression doesn't help memory efficiency, we don't
> > need to have decompress overhead. Only that case, we store
> > decompressed page.
> 
> hm, disabling this zram future entirely... this can do the trick,
> I think. zswap is quite happy not having any expectations on
> "how effectively an unknown compression algorithm will compress
> an unknown data set", and that's the "right" thing to do here,
> we can't count on anything.
> 
> 
> > For other huge size class(e.g., PAGE_SIZE / 4 * 3 ~ PAGE_SIZE),
> > you sent a patch to reduce waste memory as 5/5 so I think it's
> > a good justification between memory efficiency VS.
> > decompress overhead.
> 
> so the plan is to raise max_zpage_size to PAGE_SIZE and to increase
> the number of huge classes, so zsmalloc can be more helpful. sounds
> good to me.

nod.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-15  0:52         ` Minchan Kim
@ 2016-03-15  1:05           ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-15  1:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/15/16 09:52), Minchan Kim wrote:
[..]
> > > I suggested to remove shrinker compaction but while I review your
> > > first patch in this thread, I thought we need upper-bound to
> > > compact zspage so background work can bail out for latency easily.
> > > IOW, the work could give up the job. In such case, we might need
> > > fall-back scheme to continue the job. And I think that could be
> > > a shrinker.
> > > 
> > > What do you think?
> > 
> > wouldn't this unnecessarily complicate the whole thing? we would
> > have
> >  a) a compaction that can be triggered by used space
> 
> Maybe, user space? :)

haha, yes!  sorry, I do quite a lot of typos.

> >  b) a compaction from zs_free() that can bail out
> >  c) a compaction triggered by the shrinker.
> > 
> > all 3 three can run simultaneously.
> 
> Yeb.
> 
> > 
> > 
> > _if_ we can keep every class below its watermark, we can reduce the
> > need of "c)".
> 
> But the problem is timing. We cannot guarantee when background
> compaction triggers while shrinker is interop with VM so we should
> do the job instantly for the system.

we can have pool's compaction-kthread that we will wake_up()
every time we need a compaction, with no dependency on workqueue
or shrinker.

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-15  1:05           ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-15  1:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/15/16 09:52), Minchan Kim wrote:
[..]
> > > I suggested to remove shrinker compaction but while I review your
> > > first patch in this thread, I thought we need upper-bound to
> > > compact zspage so background work can bail out for latency easily.
> > > IOW, the work could give up the job. In such case, we might need
> > > fall-back scheme to continue the job. And I think that could be
> > > a shrinker.
> > > 
> > > What do you think?
> > 
> > wouldn't this unnecessarily complicate the whole thing? we would
> > have
> >  a) a compaction that can be triggered by used space
> 
> Maybe, user space? :)

haha, yes!  sorry, I do quite a lot of typos.

> >  b) a compaction from zs_free() that can bail out
> >  c) a compaction triggered by the shrinker.
> > 
> > all 3 three can run simultaneously.
> 
> Yeb.
> 
> > 
> > 
> > _if_ we can keep every class below its watermark, we can reduce the
> > need of "c)".
> 
> But the problem is timing. We cannot guarantee when background
> compaction triggers while shrinker is interop with VM so we should
> do the job instantly for the system.

we can have pool's compaction-kthread that we will wake_up()
every time we need a compaction, with no dependency on workqueue
or shrinker.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-15  0:46         ` Minchan Kim
@ 2016-03-15  1:33           ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-15  1:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/15/16 09:46), Minchan Kim wrote:
[..]
> > yes,
> > 
> > we do less work this way - scan and compact only one class, instead
> > of locking and compacting all of them; which sounds reasonable.
> 
> Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
> + sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.

yes, it does. not really happy with it either.

> I didn't test your patch but I guess I can make worst case scenario.
> 
> * make every class fragmented under 40%
> * On the 40% boundary, repeated alloc/free of every class so every free
>   can schedule work if it was not scheduled.
> * Although class fragment is too high, it's not a problem if the class
>   consumes small amount of memory.

hm, in this scenario both solutions are less than perfect. we jump
X times over 40% margin, we have X*NR_CLASS compaction scans in the
end. the difference is that we queue less works, yes, but we don't
have to use workqueue in the first place; compaction can be done
asynchronously by a pool's dedicated kthread. so we will just
wake_up() the process.

> I guess it can make degradation if I try to test on zsmalloc
> microbenchmark.
> 
> As well, although I don't know workqueue internal well, thesedays,
> I saw a few of mails related to workqueue(maybe, vmstat) and it had
> some trouble if system memory pressure is heavy IIRC.

yes, you are right. wq provides WQ_MEM_RECLAIM bit for this
case -- a special kthread that it will wake up to process works.

> My approach is as follows, for exmaple.
>
> Let's make a global ratio. Let's say it's 4M.

ok. should it depend on pool size?  min(20% of pool_size, XXMB)?

> If zs_free(or something) realizes current fragment is over 4M,
> kick compacion backgroud job.

yes, zs_free() is the only place that introduces fragmentation.

> The job scans from highest to lower class and compact zspages
> in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> 6M fragment ratio).

ok.

> And in the middle of background compaction, if we find it's too
> many scan(e.g., 256 zspages or somethings), just bail out the
> job for the latency and reschedule it for next time. At the next
> time, we can continue from the last size class.

ok. I'd probably prefer more simple rules here:
-- bail out because it has compacted XXMB
   so the fragmentation ratio is *expected* to be below the watermark
-- nothing to scan anymore
   compaction is executed concurrently with zs_free()/zs_malloc()
   calls, it's harder to control/guarantee some global state.

overall, no real objections. this approach can work, I think. need
to test it.

> I know your concern is unncessary scan but I'm not sure it can
> affect performance although we try to evaluate performance with
> microbenchmark. It just loops and check with zs_can_compact
> for 255 size class.

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-15  1:33           ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-15  1:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/15/16 09:46), Minchan Kim wrote:
[..]
> > yes,
> > 
> > we do less work this way - scan and compact only one class, instead
> > of locking and compacting all of them; which sounds reasonable.
> 
> Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
> + sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.

yes, it does. not really happy with it either.

> I didn't test your patch but I guess I can make worst case scenario.
> 
> * make every class fragmented under 40%
> * On the 40% boundary, repeated alloc/free of every class so every free
>   can schedule work if it was not scheduled.
> * Although class fragment is too high, it's not a problem if the class
>   consumes small amount of memory.

hm, in this scenario both solutions are less than perfect. we jump
X times over 40% margin, we have X*NR_CLASS compaction scans in the
end. the difference is that we queue less works, yes, but we don't
have to use workqueue in the first place; compaction can be done
asynchronously by a pool's dedicated kthread. so we will just
wake_up() the process.

> I guess it can make degradation if I try to test on zsmalloc
> microbenchmark.
> 
> As well, although I don't know workqueue internal well, thesedays,
> I saw a few of mails related to workqueue(maybe, vmstat) and it had
> some trouble if system memory pressure is heavy IIRC.

yes, you are right. wq provides WQ_MEM_RECLAIM bit for this
case -- a special kthread that it will wake up to process works.

> My approach is as follows, for exmaple.
>
> Let's make a global ratio. Let's say it's 4M.

ok. should it depend on pool size?  min(20% of pool_size, XXMB)?

> If zs_free(or something) realizes current fragment is over 4M,
> kick compacion backgroud job.

yes, zs_free() is the only place that introduces fragmentation.

> The job scans from highest to lower class and compact zspages
> in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> 6M fragment ratio).

ok.

> And in the middle of background compaction, if we find it's too
> many scan(e.g., 256 zspages or somethings), just bail out the
> job for the latency and reschedule it for next time. At the next
> time, we can continue from the last size class.

ok. I'd probably prefer more simple rules here:
-- bail out because it has compacted XXMB
   so the fragmentation ratio is *expected* to be below the watermark
-- nothing to scan anymore
   compaction is executed concurrently with zs_free()/zs_malloc()
   calls, it's harder to control/guarantee some global state.

overall, no real objections. this approach can work, I think. need
to test it.

> I know your concern is unncessary scan but I'm not sure it can
> affect performance although we try to evaluate performance with
> microbenchmark. It just loops and check with zs_can_compact
> for 255 size class.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
  2016-03-15  1:05           ` Sergey Senozhatsky
@ 2016-03-15  2:19             ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  2:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Tue, Mar 15, 2016 at 10:05:42AM +0900, Sergey Senozhatsky wrote:
> On (03/15/16 09:52), Minchan Kim wrote:
> [..]
> > > > I suggested to remove shrinker compaction but while I review your
> > > > first patch in this thread, I thought we need upper-bound to
> > > > compact zspage so background work can bail out for latency easily.
> > > > IOW, the work could give up the job. In such case, we might need
> > > > fall-back scheme to continue the job. And I think that could be
> > > > a shrinker.
> > > > 
> > > > What do you think?
> > > 
> > > wouldn't this unnecessarily complicate the whole thing? we would
> > > have
> > >  a) a compaction that can be triggered by used space
> > 
> > Maybe, user space? :)
> 
> haha, yes!  sorry, I do quite a lot of typos.
> 
> > >  b) a compaction from zs_free() that can bail out
> > >  c) a compaction triggered by the shrinker.
> > > 
> > > all 3 three can run simultaneously.
> > 
> > Yeb.
> > 
> > > 
> > > 
> > > _if_ we can keep every class below its watermark, we can reduce the
> > > need of "c)".
> > 
> > But the problem is timing. We cannot guarantee when background
> > compaction triggers while shrinker is interop with VM so we should
> > do the job instantly for the system.
> 
> we can have pool's compaction-kthread that we will wake_up()
> every time we need a compaction, with no dependency on workqueue
> or shrinker.

Hmm, I don't think it can work either because wake_up doesn't
guarantee instant execution of the thread.
I think it would be better to have both direct/background
compaction.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks
@ 2016-03-15  2:19             ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  2:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Tue, Mar 15, 2016 at 10:05:42AM +0900, Sergey Senozhatsky wrote:
> On (03/15/16 09:52), Minchan Kim wrote:
> [..]
> > > > I suggested to remove shrinker compaction but while I review your
> > > > first patch in this thread, I thought we need upper-bound to
> > > > compact zspage so background work can bail out for latency easily.
> > > > IOW, the work could give up the job. In such case, we might need
> > > > fall-back scheme to continue the job. And I think that could be
> > > > a shrinker.
> > > > 
> > > > What do you think?
> > > 
> > > wouldn't this unnecessarily complicate the whole thing? we would
> > > have
> > >  a) a compaction that can be triggered by used space
> > 
> > Maybe, user space? :)
> 
> haha, yes!  sorry, I do quite a lot of typos.
> 
> > >  b) a compaction from zs_free() that can bail out
> > >  c) a compaction triggered by the shrinker.
> > > 
> > > all 3 three can run simultaneously.
> > 
> > Yeb.
> > 
> > > 
> > > 
> > > _if_ we can keep every class below its watermark, we can reduce the
> > > need of "c)".
> > 
> > But the problem is timing. We cannot guarantee when background
> > compaction triggers while shrinker is interop with VM so we should
> > do the job instantly for the system.
> 
> we can have pool's compaction-kthread that we will wake_up()
> every time we need a compaction, with no dependency on workqueue
> or shrinker.

Hmm, I don't think it can work either because wake_up doesn't
guarantee instant execution of the thread.
I think it would be better to have both direct/background
compaction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-15  1:33           ` Sergey Senozhatsky
@ 2016-03-15  6:17             ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  6:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Tue, Mar 15, 2016 at 10:33:03AM +0900, Sergey Senozhatsky wrote:
> On (03/15/16 09:46), Minchan Kim wrote:
> [..]
> > > yes,
> > > 
> > > we do less work this way - scan and compact only one class, instead
> > > of locking and compacting all of them; which sounds reasonable.
> > 
> > Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
> > + sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.
> 
> yes, it does. not really happy with it either.
> 
> > I didn't test your patch but I guess I can make worst case scenario.
> > 
> > * make every class fragmented under 40%
> > * On the 40% boundary, repeated alloc/free of every class so every free
> >   can schedule work if it was not scheduled.
> > * Although class fragment is too high, it's not a problem if the class
> >   consumes small amount of memory.
> 
> hm, in this scenario both solutions are less than perfect. we jump
> X times over 40% margin, we have X*NR_CLASS compaction scans in the
> end. the difference is that we queue less works, yes, but we don't
> have to use workqueue in the first place; compaction can be done
> asynchronously by a pool's dedicated kthread. so we will just
> wake_up() the process.

Hmm, kthread is over-engineered to me. If we want to create new kthread
in the system, I guess we should persuade many people to merge in.
Surely, we should have why it couldn't be done by others(e.g., workqueue).

I think your workqueue approach is good to me.
Only problem I can see with it is we cannot start compaction when
we want instantly so my conclusion is we need both direct and
background compaction.

For shrinker and user-space trigger knob, we could compact in that context
while we could queue background job to compact in zs_free.

> 
> > I guess it can make degradation if I try to test on zsmalloc
> > microbenchmark.
> > 
> > As well, although I don't know workqueue internal well, thesedays,
> > I saw a few of mails related to workqueue(maybe, vmstat) and it had
> > some trouble if system memory pressure is heavy IIRC.
> 
> yes, you are right. wq provides WQ_MEM_RECLAIM bit for this
> case -- a special kthread that it will wake up to process works.
> 
> > My approach is as follows, for exmaple.
> >
> > Let's make a global ratio. Let's say it's 4M.
> 
> ok. should it depend on pool size?  min(20% of pool_size, XXMB)?

Maybe, that could be a knob but need to think more what should be
default. In this moment, clear thing is that we should prevent
frequent ping-pong background compaction as repeated alloc/free
with dancing on threshold boundary.

> 
> > If zs_free(or something) realizes current fragment is over 4M,
> > kick compacion backgroud job.
> 
> yes, zs_free() is the only place that introduces fragmentation.
> 
> > The job scans from highest to lower class and compact zspages
> > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > 6M fragment ratio).
> 
> ok.
> 
> > And in the middle of background compaction, if we find it's too
> > many scan(e.g., 256 zspages or somethings), just bail out the
> > job for the latency and reschedule it for next time. At the next
> > time, we can continue from the last size class.
> 
> ok. I'd probably prefer more simple rules here:
> -- bail out because it has compacted XXMB
>    so the fragmentation ratio is *expected* to be below the watermark

Need high watermark to stop compaction.
It will prevent frequent background compaction triggering.


> -- nothing to scan anymore
>    compaction is executed concurrently with zs_free()/zs_malloc()
>    calls, it's harder to control/guarantee some global state.
> 
> overall, no real objections. this approach can work, I think. need
> to test it.

Thanks, Sergey!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-15  6:17             ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-15  6:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Tue, Mar 15, 2016 at 10:33:03AM +0900, Sergey Senozhatsky wrote:
> On (03/15/16 09:46), Minchan Kim wrote:
> [..]
> > > yes,
> > > 
> > > we do less work this way - scan and compact only one class, instead
> > > of locking and compacting all of them; which sounds reasonable.
> > 
> > Hmm,, It consumes more memory(i.e., sizeof(work_struct) + sizeof(void *)
> > + sizeof(bool) * NR_CLASS) as well as kicking many work up to NR_CLASS.
> 
> yes, it does. not really happy with it either.
> 
> > I didn't test your patch but I guess I can make worst case scenario.
> > 
> > * make every class fragmented under 40%
> > * On the 40% boundary, repeated alloc/free of every class so every free
> >   can schedule work if it was not scheduled.
> > * Although class fragment is too high, it's not a problem if the class
> >   consumes small amount of memory.
> 
> hm, in this scenario both solutions are less than perfect. we jump
> X times over 40% margin, we have X*NR_CLASS compaction scans in the
> end. the difference is that we queue less works, yes, but we don't
> have to use workqueue in the first place; compaction can be done
> asynchronously by a pool's dedicated kthread. so we will just
> wake_up() the process.

Hmm, kthread is over-engineered to me. If we want to create new kthread
in the system, I guess we should persuade many people to merge in.
Surely, we should have why it couldn't be done by others(e.g., workqueue).

I think your workqueue approach is good to me.
Only problem I can see with it is we cannot start compaction when
we want instantly so my conclusion is we need both direct and
background compaction.

For shrinker and user-space trigger knob, we could compact in that context
while we could queue background job to compact in zs_free.

> 
> > I guess it can make degradation if I try to test on zsmalloc
> > microbenchmark.
> > 
> > As well, although I don't know workqueue internal well, thesedays,
> > I saw a few of mails related to workqueue(maybe, vmstat) and it had
> > some trouble if system memory pressure is heavy IIRC.
> 
> yes, you are right. wq provides WQ_MEM_RECLAIM bit for this
> case -- a special kthread that it will wake up to process works.
> 
> > My approach is as follows, for exmaple.
> >
> > Let's make a global ratio. Let's say it's 4M.
> 
> ok. should it depend on pool size?  min(20% of pool_size, XXMB)?

Maybe, that could be a knob but need to think more what should be
default. In this moment, clear thing is that we should prevent
frequent ping-pong background compaction as repeated alloc/free
with dancing on threshold boundary.

> 
> > If zs_free(or something) realizes current fragment is over 4M,
> > kick compacion backgroud job.
> 
> yes, zs_free() is the only place that introduces fragmentation.
> 
> > The job scans from highest to lower class and compact zspages
> > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > 6M fragment ratio).
> 
> ok.
> 
> > And in the middle of background compaction, if we find it's too
> > many scan(e.g., 256 zspages or somethings), just bail out the
> > job for the latency and reschedule it for next time. At the next
> > time, we can continue from the last size class.
> 
> ok. I'd probably prefer more simple rules here:
> -- bail out because it has compacted XXMB
>    so the fragmentation ratio is *expected* to be below the watermark

Need high watermark to stop compaction.
It will prevent frequent background compaction triggering.


> -- nothing to scan anymore
>    compaction is executed concurrently with zs_free()/zs_malloc()
>    calls, it's harder to control/guarantee some global state.
> 
> overall, no real objections. this approach can work, I think. need
> to test it.

Thanks, Sergey!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-15  6:17             ` Minchan Kim
@ 2016-03-17  1:29               ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-17  1:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

Hello Minchan,

On (03/15/16 15:17), Minchan Kim wrote:
[..]
> > hm, in this scenario both solutions are less than perfect. we jump
> > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > end. the difference is that we queue less works, yes, but we don't
> > have to use workqueue in the first place; compaction can be done
> > asynchronously by a pool's dedicated kthread. so we will just
> > wake_up() the process.
> 
> Hmm, kthread is over-engineered to me. If we want to create new kthread
> in the system, I guess we should persuade many people to merge in.
> Surely, we should have why it couldn't be done by others(e.g., workqueue).
> 
> I think your workqueue approach is good to me.
> Only problem I can see with it is we cannot start compaction when
> we want instantly so my conclusion is we need both direct and
> background compaction.

well, if we will keep the shrinker callbacks then it's not such a huge
issue, IMHO. for that type of forward progress guarantees we can have
our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).

> > > If zs_free(or something) realizes current fragment is over 4M,
> > > kick compacion backgroud job.
> > 
> > yes, zs_free() is the only place that introduces fragmentation.
> > 
> > > The job scans from highest to lower class and compact zspages
> > > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > > 6M fragment ratio).

just thought... I think it'll be tricky to implement this. We scan classes
from HIGH class_size to SMALL class_size, counting fragmentation value and
re-calculating the global fragmentation all the time; once the global
fragmentation passes the watermark, we start compacting from HIGH to
SMALL. the problem here is that as soon as we calculated the class B
fragmentation index and moved to class A we can't trust B anymore. classes
are not locked and absolutely free to change. so the global fragmentation
index likely will be inaccurate.

so I'm thinking about triggering a global compaction from zs_free() (to
queue less works), but instead of calculating global watermark and compacting
afterwards, just compact every class that has fragmentation over XY% (for
example 30%). "iterate from HI to LO and compact everything that is too
fragmented".

we still need some sort of a pool->compact_ts timestamp to prevent too
frequent compaction jobs.

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-17  1:29               ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-17  1:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

Hello Minchan,

On (03/15/16 15:17), Minchan Kim wrote:
[..]
> > hm, in this scenario both solutions are less than perfect. we jump
> > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > end. the difference is that we queue less works, yes, but we don't
> > have to use workqueue in the first place; compaction can be done
> > asynchronously by a pool's dedicated kthread. so we will just
> > wake_up() the process.
> 
> Hmm, kthread is over-engineered to me. If we want to create new kthread
> in the system, I guess we should persuade many people to merge in.
> Surely, we should have why it couldn't be done by others(e.g., workqueue).
> 
> I think your workqueue approach is good to me.
> Only problem I can see with it is we cannot start compaction when
> we want instantly so my conclusion is we need both direct and
> background compaction.

well, if we will keep the shrinker callbacks then it's not such a huge
issue, IMHO. for that type of forward progress guarantees we can have
our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).

> > > If zs_free(or something) realizes current fragment is over 4M,
> > > kick compacion backgroud job.
> > 
> > yes, zs_free() is the only place that introduces fragmentation.
> > 
> > > The job scans from highest to lower class and compact zspages
> > > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > > 6M fragment ratio).

just thought... I think it'll be tricky to implement this. We scan classes
from HIGH class_size to SMALL class_size, counting fragmentation value and
re-calculating the global fragmentation all the time; once the global
fragmentation passes the watermark, we start compacting from HIGH to
SMALL. the problem here is that as soon as we calculated the class B
fragmentation index and moved to class A we can't trust B anymore. classes
are not locked and absolutely free to change. so the global fragmentation
index likely will be inaccurate.

so I'm thinking about triggering a global compaction from zs_free() (to
queue less works), but instead of calculating global watermark and compacting
afterwards, just compact every class that has fragmentation over XY% (for
example 30%). "iterate from HI to LO and compact everything that is too
fragmented".

we still need some sort of a pool->compact_ts timestamp to prevent too
frequent compaction jobs.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-17  1:29               ` Sergey Senozhatsky
@ 2016-03-18  1:17                 ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-18  1:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

Hi Sergey,

On Thu, Mar 17, 2016 at 10:29:30AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/15/16 15:17), Minchan Kim wrote:
> [..]
> > > hm, in this scenario both solutions are less than perfect. we jump
> > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > end. the difference is that we queue less works, yes, but we don't
> > > have to use workqueue in the first place; compaction can be done
> > > asynchronously by a pool's dedicated kthread. so we will just
> > > wake_up() the process.
> > 
> > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > in the system, I guess we should persuade many people to merge in.
> > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > 
> > I think your workqueue approach is good to me.
> > Only problem I can see with it is we cannot start compaction when
> > we want instantly so my conclusion is we need both direct and
> > background compaction.
> 
> well, if we will keep the shrinker callbacks then it's not such a huge
> issue, IMHO. for that type of forward progress guarantees we can have
> our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).

What I meant with direct compaction is shrinker while backgroud
compaction is workqueue.
So do you mean that you agree to remain shrinker?
And do you want to use workqueue with WQ_MEM_RECLAIM rather than
new kthread?

> 
> > > > If zs_free(or something) realizes current fragment is over 4M,
> > > > kick compacion backgroud job.
> > > 
> > > yes, zs_free() is the only place that introduces fragmentation.
> > > 
> > > > The job scans from highest to lower class and compact zspages
> > > > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > > > 6M fragment ratio).
> 
> just thought... I think it'll be tricky to implement this. We scan classes
> from HIGH class_size to SMALL class_size, counting fragmentation value and
> re-calculating the global fragmentation all the time; once the global
> fragmentation passes the watermark, we start compacting from HIGH to
> SMALL. the problem here is that as soon as we calculated the class B
> fragmentation index and moved to class A we can't trust B anymore. classes
> are not locked and absolutely free to change. so the global fragmentation
> index likely will be inaccurate.
> 

Actually, I don't think such inaccuracy will make big trouble here.
But How about this simple idea?

If zs_free find wasted space is bigger than threshold(e.g., 10M)
user defined, zs_free can queue work for background compaction(
that background compaction work should be WQ_MEM_RECLAIM |
WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
all size_class unconditionally.

With it, less background compaction and more simple algorithm,
no harmful other works by WQ_CPU_INTENSIVE.

> so I'm thinking about triggering a global compaction from zs_free() (to
> queue less works), but instead of calculating global watermark and compacting
> afterwards, just compact every class that has fragmentation over XY% (for
> example 30%). "iterate from HI to LO and compact everything that is too
> fragmented".

The problem with approach is we can compact only small size class which
is fragment ratio is higher than bigger size class but compaction benefit
is smaller than higher size class which is lower fragment ratio.
With that, continue to need to background work until it meets user-defined
global threshold.

> 
> we still need some sort of a pool->compact_ts timestamp to prevent too
> frequent compaction jobs.

Yes, we need something to throttle mechanism. Need time to think more. :)

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-18  1:17                 ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-18  1:17 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

Hi Sergey,

On Thu, Mar 17, 2016 at 10:29:30AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/15/16 15:17), Minchan Kim wrote:
> [..]
> > > hm, in this scenario both solutions are less than perfect. we jump
> > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > end. the difference is that we queue less works, yes, but we don't
> > > have to use workqueue in the first place; compaction can be done
> > > asynchronously by a pool's dedicated kthread. so we will just
> > > wake_up() the process.
> > 
> > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > in the system, I guess we should persuade many people to merge in.
> > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > 
> > I think your workqueue approach is good to me.
> > Only problem I can see with it is we cannot start compaction when
> > we want instantly so my conclusion is we need both direct and
> > background compaction.
> 
> well, if we will keep the shrinker callbacks then it's not such a huge
> issue, IMHO. for that type of forward progress guarantees we can have
> our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).

What I meant with direct compaction is shrinker while backgroud
compaction is workqueue.
So do you mean that you agree to remain shrinker?
And do you want to use workqueue with WQ_MEM_RECLAIM rather than
new kthread?

> 
> > > > If zs_free(or something) realizes current fragment is over 4M,
> > > > kick compacion backgroud job.
> > > 
> > > yes, zs_free() is the only place that introduces fragmentation.
> > > 
> > > > The job scans from highest to lower class and compact zspages
> > > > in each size_class until it meets high watermark(e.g, 4M + 4M /2 =
> > > > 6M fragment ratio).
> 
> just thought... I think it'll be tricky to implement this. We scan classes
> from HIGH class_size to SMALL class_size, counting fragmentation value and
> re-calculating the global fragmentation all the time; once the global
> fragmentation passes the watermark, we start compacting from HIGH to
> SMALL. the problem here is that as soon as we calculated the class B
> fragmentation index and moved to class A we can't trust B anymore. classes
> are not locked and absolutely free to change. so the global fragmentation
> index likely will be inaccurate.
> 

Actually, I don't think such inaccuracy will make big trouble here.
But How about this simple idea?

If zs_free find wasted space is bigger than threshold(e.g., 10M)
user defined, zs_free can queue work for background compaction(
that background compaction work should be WQ_MEM_RECLAIM |
WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
all size_class unconditionally.

With it, less background compaction and more simple algorithm,
no harmful other works by WQ_CPU_INTENSIVE.

> so I'm thinking about triggering a global compaction from zs_free() (to
> queue less works), but instead of calculating global watermark and compacting
> afterwards, just compact every class that has fragmentation over XY% (for
> example 30%). "iterate from HI to LO and compact everything that is too
> fragmented".

The problem with approach is we can compact only small size class which
is fragment ratio is higher than bigger size class but compaction benefit
is smaller than higher size class which is lower fragment ratio.
With that, continue to need to background work until it meets user-defined
global threshold.

> 
> we still need some sort of a pool->compact_ts timestamp to prevent too
> frequent compaction jobs.

Yes, we need something to throttle mechanism. Need time to think more. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-18  1:17                 ` Minchan Kim
@ 2016-03-18  2:00                   ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-18  2:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

Hi,

On (03/18/16 10:17), Minchan Kim wrote:
> > > > hm, in this scenario both solutions are less than perfect. we jump
> > > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > > end. the difference is that we queue less works, yes, but we don't
> > > > have to use workqueue in the first place; compaction can be done
> > > > asynchronously by a pool's dedicated kthread. so we will just
> > > > wake_up() the process.
> > > 
> > > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > > in the system, I guess we should persuade many people to merge in.
> > > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > > 
> > > I think your workqueue approach is good to me.
> > > Only problem I can see with it is we cannot start compaction when
> > > we want instantly so my conclusion is we need both direct and
> > > background compaction.
> > 
> > well, if we will keep the shrinker callbacks then it's not such a huge
> > issue, IMHO. for that type of forward progress guarantees we can have
> > our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).
> 
> What I meant with direct compaction is shrinker while backgroud
> compaction is workqueue.
> So do you mean that you agree to remain shrinker?

hm, probably yes, hard to say. we don't have yet a solution for background
compaction.

> And do you want to use workqueue with WQ_MEM_RECLAIM rather than
> new kthread?

I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
attached to wq, a rescuer thread, which will be idle until wq declares mayday.
But the kthread will be allocated anyway. And we can queue only one global
compaction work at a time; so wq does not buy us a lot here and a simple
wake_up_process() looks much better. it make sense to use wq if we can have
N compaction jobs queued, like I did in my initial patch, but otherwise
it's sort of overkill, isn't it?

> > just thought... I think it'll be tricky to implement this. We scan classes
> > from HIGH class_size to SMALL class_size, counting fragmentation value and
> > re-calculating the global fragmentation all the time; once the global
> > fragmentation passes the watermark, we start compacting from HIGH to
> > SMALL. the problem here is that as soon as we calculated the class B
> > fragmentation index and moved to class A we can't trust B anymore. classes
> > are not locked and absolutely free to change. so the global fragmentation
> > index likely will be inaccurate.
> > 
> 
> Actually, I don't think such inaccuracy will make big trouble here.
> But How about this simple idea?
> 
> If zs_free find wasted space is bigger than threshold(e.g., 10M)
>
> user defined, zs_free can queue work for background compaction(
> that background compaction work should be WQ_MEM_RECLAIM |
> WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
> all size_class unconditionally.

ok. global pool stats that will give us a fragmentation index, so we can
start compaction when the entire pool passes the watermark, not an
individual class.

> With it, less background compaction and more simple algorithm,

so you want to have

	zs_free()
		check pool watermark
			queue class compaction
			queue pool compaction

?

I think a simpler one will be to just queue global compaction, if pool
is fragmented -- compact everything, like we do in shrinker callback.

> no harmful other works by WQ_CPU_INTENSIVE.
> 
> > so I'm thinking about triggering a global compaction from zs_free() (to
> > queue less works), but instead of calculating global watermark and compacting
> > afterwards, just compact every class that has fragmentation over XY% (for
> > example 30%). "iterate from HI to LO and compact everything that is too
> > fragmented".
> 
> The problem with approach is we can compact only small size class which
> is fragment ratio is higher than bigger size class but compaction benefit
> is smaller than higher size class which is lower fragment ratio.
> With that, continue to need to background work until it meets user-defined
> global threshold.

good point.

> > 
> > we still need some sort of a pool->compact_ts timestamp to prevent too
> > frequent compaction jobs.
> 
> Yes, we need something to throttle mechanism. Need time to think more. :)

yes, need to think more :)

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-18  2:00                   ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-18  2:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

Hi,

On (03/18/16 10:17), Minchan Kim wrote:
> > > > hm, in this scenario both solutions are less than perfect. we jump
> > > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > > end. the difference is that we queue less works, yes, but we don't
> > > > have to use workqueue in the first place; compaction can be done
> > > > asynchronously by a pool's dedicated kthread. so we will just
> > > > wake_up() the process.
> > > 
> > > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > > in the system, I guess we should persuade many people to merge in.
> > > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > > 
> > > I think your workqueue approach is good to me.
> > > Only problem I can see with it is we cannot start compaction when
> > > we want instantly so my conclusion is we need both direct and
> > > background compaction.
> > 
> > well, if we will keep the shrinker callbacks then it's not such a huge
> > issue, IMHO. for that type of forward progress guarantees we can have
> > our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).
> 
> What I meant with direct compaction is shrinker while backgroud
> compaction is workqueue.
> So do you mean that you agree to remain shrinker?

hm, probably yes, hard to say. we don't have yet a solution for background
compaction.

> And do you want to use workqueue with WQ_MEM_RECLAIM rather than
> new kthread?

I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
attached to wq, a rescuer thread, which will be idle until wq declares mayday.
But the kthread will be allocated anyway. And we can queue only one global
compaction work at a time; so wq does not buy us a lot here and a simple
wake_up_process() looks much better. it make sense to use wq if we can have
N compaction jobs queued, like I did in my initial patch, but otherwise
it's sort of overkill, isn't it?

> > just thought... I think it'll be tricky to implement this. We scan classes
> > from HIGH class_size to SMALL class_size, counting fragmentation value and
> > re-calculating the global fragmentation all the time; once the global
> > fragmentation passes the watermark, we start compacting from HIGH to
> > SMALL. the problem here is that as soon as we calculated the class B
> > fragmentation index and moved to class A we can't trust B anymore. classes
> > are not locked and absolutely free to change. so the global fragmentation
> > index likely will be inaccurate.
> > 
> 
> Actually, I don't think such inaccuracy will make big trouble here.
> But How about this simple idea?
> 
> If zs_free find wasted space is bigger than threshold(e.g., 10M)
>
> user defined, zs_free can queue work for background compaction(
> that background compaction work should be WQ_MEM_RECLAIM |
> WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
> all size_class unconditionally.

ok. global pool stats that will give us a fragmentation index, so we can
start compaction when the entire pool passes the watermark, not an
individual class.

> With it, less background compaction and more simple algorithm,

so you want to have

	zs_free()
		check pool watermark
			queue class compaction
			queue pool compaction

?

I think a simpler one will be to just queue global compaction, if pool
is fragmented -- compact everything, like we do in shrinker callback.

> no harmful other works by WQ_CPU_INTENSIVE.
> 
> > so I'm thinking about triggering a global compaction from zs_free() (to
> > queue less works), but instead of calculating global watermark and compacting
> > afterwards, just compact every class that has fragmentation over XY% (for
> > example 30%). "iterate from HI to LO and compact everything that is too
> > fragmented".
> 
> The problem with approach is we can compact only small size class which
> is fragment ratio is higher than bigger size class but compaction benefit
> is smaller than higher size class which is lower fragment ratio.
> With that, continue to need to background work until it meets user-defined
> global threshold.

good point.

> > 
> > we still need some sort of a pool->compact_ts timestamp to prevent too
> > frequent compaction jobs.
> 
> Yes, we need something to throttle mechanism. Need time to think more. :)

yes, need to think more :)

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-18  2:00                   ` Sergey Senozhatsky
@ 2016-03-18  4:03                     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-18  4:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Fri, Mar 18, 2016 at 11:00:29AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (03/18/16 10:17), Minchan Kim wrote:
> > > > > hm, in this scenario both solutions are less than perfect. we jump
> > > > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > > > end. the difference is that we queue less works, yes, but we don't
> > > > > have to use workqueue in the first place; compaction can be done
> > > > > asynchronously by a pool's dedicated kthread. so we will just
> > > > > wake_up() the process.
> > > > 
> > > > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > > > in the system, I guess we should persuade many people to merge in.
> > > > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > > > 
> > > > I think your workqueue approach is good to me.
> > > > Only problem I can see with it is we cannot start compaction when
> > > > we want instantly so my conclusion is we need both direct and
> > > > background compaction.
> > > 
> > > well, if we will keep the shrinker callbacks then it's not such a huge
> > > issue, IMHO. for that type of forward progress guarantees we can have
> > > our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).
> > 
> > What I meant with direct compaction is shrinker while backgroud
> > compaction is workqueue.
> > So do you mean that you agree to remain shrinker?
> 
> hm, probably yes, hard to say. we don't have yet a solution for background
> compaction.

Although we introduce right background compaction in future, we still need
direct compaction solution, too.

> 
> > And do you want to use workqueue with WQ_MEM_RECLAIM rather than
> > new kthread?
> 
> I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
> attached to wq, a rescuer thread, which will be idle until wq declares mayday.
> But the kthread will be allocated anyway. And we can queue only one global
> compaction work at a time; so wq does not buy us a lot here and a simple
> wake_up_process() looks much better. it make sense to use wq if we can have
> N compaction jobs queued, like I did in my initial patch, but otherwise
> it's sort of overkill, isn't it?

So do you mean to want a kthread for zsmalloc?
It means if we create several instance of zram, zsmallocd-1, zsmallocd-2
and so on? And although we use own kthread, we should create it and will
be idle, allocated anyway. It's same.

Frankly speaking, I don't understand why we should use WQ_MEM_RECLAIM.
We don't need to guarantee that work should be executed, IMO because
we has direct compaction as fallback.

If we can use normal wq rather than WQ_MEM_RECLAIM, wq doesn't need
own kthread attached the work. Right? If so, we can blow away that
resource reservation problem.

> 
> > > just thought... I think it'll be tricky to implement this. We scan classes
> > > from HIGH class_size to SMALL class_size, counting fragmentation value and
> > > re-calculating the global fragmentation all the time; once the global
> > > fragmentation passes the watermark, we start compacting from HIGH to
> > > SMALL. the problem here is that as soon as we calculated the class B
> > > fragmentation index and moved to class A we can't trust B anymore. classes
> > > are not locked and absolutely free to change. so the global fragmentation
> > > index likely will be inaccurate.
> > > 
> > 
> > Actually, I don't think such inaccuracy will make big trouble here.
> > But How about this simple idea?
> > 
> > If zs_free find wasted space is bigger than threshold(e.g., 10M)
> >
> > user defined, zs_free can queue work for background compaction(
> > that background compaction work should be WQ_MEM_RECLAIM |
> > WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
> > all size_class unconditionally.
> 
> ok. global pool stats that will give us a fragmentation index, so we can
> start compaction when the entire pool passes the watermark, not an
> individual class.
> 
> > With it, less background compaction and more simple algorithm,
> 
> so you want to have
> 
> 	zs_free()
> 		check pool watermark
> 			queue class compaction

No queue class compaction.

> 			queue pool compaction

Yes. queue pool compaction.

> 
> ?
> 
> I think a simpler one will be to just queue global compaction, if pool
> is fragmented -- compact everything, like we do in shrinker callback.

That's what I said. :)

> 
> > no harmful other works by WQ_CPU_INTENSIVE.
> > 
> > > so I'm thinking about triggering a global compaction from zs_free() (to
> > > queue less works), but instead of calculating global watermark and compacting
> > > afterwards, just compact every class that has fragmentation over XY% (for
> > > example 30%). "iterate from HI to LO and compact everything that is too
> > > fragmented".
> > 
> > The problem with approach is we can compact only small size class which
> > is fragment ratio is higher than bigger size class but compaction benefit
> > is smaller than higher size class which is lower fragment ratio.
> > With that, continue to need to background work until it meets user-defined
> > global threshold.
> 
> good point.
> 
> > > 
> > > we still need some sort of a pool->compact_ts timestamp to prevent too
> > > frequent compaction jobs.
> > 
> > Yes, we need something to throttle mechanism. Need time to think more. :)
> 
> yes, need to think more :)
> 
> 	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-18  4:03                     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2016-03-18  4:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Joonsoo Kim, linux-mm, linux-kernel

On Fri, Mar 18, 2016 at 11:00:29AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (03/18/16 10:17), Minchan Kim wrote:
> > > > > hm, in this scenario both solutions are less than perfect. we jump
> > > > > X times over 40% margin, we have X*NR_CLASS compaction scans in the
> > > > > end. the difference is that we queue less works, yes, but we don't
> > > > > have to use workqueue in the first place; compaction can be done
> > > > > asynchronously by a pool's dedicated kthread. so we will just
> > > > > wake_up() the process.
> > > > 
> > > > Hmm, kthread is over-engineered to me. If we want to create new kthread
> > > > in the system, I guess we should persuade many people to merge in.
> > > > Surely, we should have why it couldn't be done by others(e.g., workqueue).
> > > > 
> > > > I think your workqueue approach is good to me.
> > > > Only problem I can see with it is we cannot start compaction when
> > > > we want instantly so my conclusion is we need both direct and
> > > > background compaction.
> > > 
> > > well, if we will keep the shrinker callbacks then it's not such a huge
> > > issue, IMHO. for that type of forward progress guarantees we can have
> > > our own, dedicated, workqueue with a rescuer thread (WQ_MEM_RECLAIM).
> > 
> > What I meant with direct compaction is shrinker while backgroud
> > compaction is workqueue.
> > So do you mean that you agree to remain shrinker?
> 
> hm, probably yes, hard to say. we don't have yet a solution for background
> compaction.

Although we introduce right background compaction in future, we still need
direct compaction solution, too.

> 
> > And do you want to use workqueue with WQ_MEM_RECLAIM rather than
> > new kthread?
> 
> I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
> attached to wq, a rescuer thread, which will be idle until wq declares mayday.
> But the kthread will be allocated anyway. And we can queue only one global
> compaction work at a time; so wq does not buy us a lot here and a simple
> wake_up_process() looks much better. it make sense to use wq if we can have
> N compaction jobs queued, like I did in my initial patch, but otherwise
> it's sort of overkill, isn't it?

So do you mean to want a kthread for zsmalloc?
It means if we create several instance of zram, zsmallocd-1, zsmallocd-2
and so on? And although we use own kthread, we should create it and will
be idle, allocated anyway. It's same.

Frankly speaking, I don't understand why we should use WQ_MEM_RECLAIM.
We don't need to guarantee that work should be executed, IMO because
we has direct compaction as fallback.

If we can use normal wq rather than WQ_MEM_RECLAIM, wq doesn't need
own kthread attached the work. Right? If so, we can blow away that
resource reservation problem.

> 
> > > just thought... I think it'll be tricky to implement this. We scan classes
> > > from HIGH class_size to SMALL class_size, counting fragmentation value and
> > > re-calculating the global fragmentation all the time; once the global
> > > fragmentation passes the watermark, we start compacting from HIGH to
> > > SMALL. the problem here is that as soon as we calculated the class B
> > > fragmentation index and moved to class A we can't trust B anymore. classes
> > > are not locked and absolutely free to change. so the global fragmentation
> > > index likely will be inaccurate.
> > > 
> > 
> > Actually, I don't think such inaccuracy will make big trouble here.
> > But How about this simple idea?
> > 
> > If zs_free find wasted space is bigger than threshold(e.g., 10M)
> >
> > user defined, zs_free can queue work for background compaction(
> > that background compaction work should be WQ_MEM_RECLAIM |
> > WQ_CPU_INTENSIVE?). Once that work is executed, the work compacts
> > all size_class unconditionally.
> 
> ok. global pool stats that will give us a fragmentation index, so we can
> start compaction when the entire pool passes the watermark, not an
> individual class.
> 
> > With it, less background compaction and more simple algorithm,
> 
> so you want to have
> 
> 	zs_free()
> 		check pool watermark
> 			queue class compaction

No queue class compaction.

> 			queue pool compaction

Yes. queue pool compaction.

> 
> ?
> 
> I think a simpler one will be to just queue global compaction, if pool
> is fragmented -- compact everything, like we do in shrinker callback.

That's what I said. :)

> 
> > no harmful other works by WQ_CPU_INTENSIVE.
> > 
> > > so I'm thinking about triggering a global compaction from zs_free() (to
> > > queue less works), but instead of calculating global watermark and compacting
> > > afterwards, just compact every class that has fragmentation over XY% (for
> > > example 30%). "iterate from HI to LO and compact everything that is too
> > > fragmented".
> > 
> > The problem with approach is we can compact only small size class which
> > is fragment ratio is higher than bigger size class but compaction benefit
> > is smaller than higher size class which is lower fragment ratio.
> > With that, continue to need to background work until it meets user-defined
> > global threshold.
> 
> good point.
> 
> > > 
> > > we still need some sort of a pool->compact_ts timestamp to prevent too
> > > frequent compaction jobs.
> > 
> > Yes, we need something to throttle mechanism. Need time to think more. :)
> 
> yes, need to think more :)
> 
> 	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
  2016-03-18  4:03                     ` Minchan Kim
@ 2016-03-18  4:10                       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-18  4:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/18/16 13:03), Minchan Kim wrote:
[..]
> > I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
> > attached to wq, a rescuer thread, which will be idle until wq declares mayday.
> > But the kthread will be allocated anyway. And we can queue only one global
> > compaction work at a time; so wq does not buy us a lot here and a simple
> > wake_up_process() looks much better. it make sense to use wq if we can have
> > N compaction jobs queued, like I did in my initial patch, but otherwise
> > it's sort of overkill, isn't it?
[..]
> If we can use normal wq rather than WQ_MEM_RECLAIM, wq doesn't need
> own kthread attached the work. Right? If so, we can blow away that
> resource reservation problem.

right. if shrinker callbacks will be around (and it seems
they will), then we don't have to guarantee any forward
progress in background compaction. so yes, we can use normal
wq and there is no need in WQ_MEM_RECLAIM.

[..]
> > so you want to have
> > 
> > 	zs_free()
> > 		check pool watermark
> > 			queue class compaction
> 
> No queue class compaction.
> 
> > 			queue pool compaction
> 
> Yes. queue pool compaction.
> 
> > 
> > ?
> > 
> > I think a simpler one will be to just queue global compaction, if pool
> > is fragmented -- compact everything, like we do in shrinker callback.
> 
> That's what I said. :)

ah, ok.

	-ss

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction
@ 2016-03-18  4:10                       ` Sergey Senozhatsky
  0 siblings, 0 replies; 50+ messages in thread
From: Sergey Senozhatsky @ 2016-03-18  4:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Joonsoo Kim, linux-mm, linux-kernel

On (03/18/16 13:03), Minchan Kim wrote:
[..]
> > I have some concerns here. WQ_MEM_RECLAIM implies that there is a kthread
> > attached to wq, a rescuer thread, which will be idle until wq declares mayday.
> > But the kthread will be allocated anyway. And we can queue only one global
> > compaction work at a time; so wq does not buy us a lot here and a simple
> > wake_up_process() looks much better. it make sense to use wq if we can have
> > N compaction jobs queued, like I did in my initial patch, but otherwise
> > it's sort of overkill, isn't it?
[..]
> If we can use normal wq rather than WQ_MEM_RECLAIM, wq doesn't need
> own kthread attached the work. Right? If so, we can blow away that
> resource reservation problem.

right. if shrinker callbacks will be around (and it seems
they will), then we don't have to guarantee any forward
progress in background compaction. so yes, we can use normal
wq and there is no need in WQ_MEM_RECLAIM.

[..]
> > so you want to have
> > 
> > 	zs_free()
> > 		check pool watermark
> > 			queue class compaction
> 
> No queue class compaction.
> 
> > 			queue pool compaction
> 
> Yes. queue pool compaction.
> 
> > 
> > ?
> > 
> > I think a simpler one will be to just queue global compaction, if pool
> > is fragmented -- compact everything, like we do in shrinker callback.
> 
> That's what I said. :)

ah, ok.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2016-03-18  4:09 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-03 14:45 [RFC][PATCH v3 0/5] mm/zsmalloc: rework compaction and increase density Sergey Senozhatsky
2016-03-03 14:45 ` Sergey Senozhatsky
2016-03-03 14:45 ` [RFC][PATCH v3 1/5] mm/zsmalloc: introduce class auto-compaction Sergey Senozhatsky
2016-03-03 14:45   ` Sergey Senozhatsky
2016-03-14  6:17   ` Minchan Kim
2016-03-14  6:17     ` Minchan Kim
2016-03-14  7:41     ` Sergey Senozhatsky
2016-03-14  7:41       ` Sergey Senozhatsky
2016-03-14  8:20       ` Sergey Senozhatsky
2016-03-14  8:20         ` Sergey Senozhatsky
2016-03-15  0:46       ` Minchan Kim
2016-03-15  0:46         ` Minchan Kim
2016-03-15  1:33         ` Sergey Senozhatsky
2016-03-15  1:33           ` Sergey Senozhatsky
2016-03-15  6:17           ` Minchan Kim
2016-03-15  6:17             ` Minchan Kim
2016-03-17  1:29             ` Sergey Senozhatsky
2016-03-17  1:29               ` Sergey Senozhatsky
2016-03-18  1:17               ` Minchan Kim
2016-03-18  1:17                 ` Minchan Kim
2016-03-18  2:00                 ` Sergey Senozhatsky
2016-03-18  2:00                   ` Sergey Senozhatsky
2016-03-18  4:03                   ` Minchan Kim
2016-03-18  4:03                     ` Minchan Kim
2016-03-18  4:10                     ` Sergey Senozhatsky
2016-03-18  4:10                       ` Sergey Senozhatsky
2016-03-03 14:46 ` [RFC][PATCH v3 2/5] mm/zsmalloc: remove shrinker compaction callbacks Sergey Senozhatsky
2016-03-03 14:46   ` Sergey Senozhatsky
2016-03-14  6:32   ` Minchan Kim
2016-03-14  6:32     ` Minchan Kim
2016-03-14  7:45     ` Sergey Senozhatsky
2016-03-14  7:45       ` Sergey Senozhatsky
2016-03-15  0:52       ` Minchan Kim
2016-03-15  0:52         ` Minchan Kim
2016-03-15  1:05         ` Sergey Senozhatsky
2016-03-15  1:05           ` Sergey Senozhatsky
2016-03-15  2:19           ` Minchan Kim
2016-03-15  2:19             ` Minchan Kim
2016-03-03 14:46 ` [RFC][PATCH v3 3/5] mm/zsmalloc: introduce zs_huge_object() Sergey Senozhatsky
2016-03-03 14:46   ` Sergey Senozhatsky
2016-03-14  6:53   ` Minchan Kim
2016-03-14  6:53     ` Minchan Kim
2016-03-14  8:08     ` Sergey Senozhatsky
2016-03-14  8:08       ` Sergey Senozhatsky
2016-03-15  0:54       ` Minchan Kim
2016-03-15  0:54         ` Minchan Kim
2016-03-03 14:46 ` [RFC][PATCH v3 4/5] zram: use zs_huge_object() Sergey Senozhatsky
2016-03-03 14:46   ` Sergey Senozhatsky
2016-03-03 14:46 ` [RFC][PATCH v3 5/5] mm/zsmalloc: reduce the number of huge classes Sergey Senozhatsky
2016-03-03 14:46   ` Sergey Senozhatsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.