[PATCHv4 0/9] zsmalloc/zram: configurable zspage size

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
@ 2022-10-31  5:40 Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 1/9] zram: add size class equals check into recompression Sergey Senozhatsky
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:40 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

	Hello,

	Some use-cases and/or data patterns may benefit from
larger zspages. Currently the limit on the number of physical
pages that are linked into a zspage is hardcoded to 4. Higher
limit changes key characteristics of a number of the size
classes, improving compactness of the pool and redusing the
amount of memory zsmalloc pool uses. More on this in 0002
commit message.

v4:
-- Fixed type of the max_pages_per_zspage (kbuild reported a
   "warning: right shift count >= width of type" warning)
-- Renamed max_pages_per_zspage variable

v3:
-- Removed lots of text from 0002 commit message. Now it's shorter
   and simpler.

v2:
-- Cherry picked a patch from Alexey (minor code tweaks to move
   it ahead of this series)
-- zsmalloc does not require anymore pages-per-zspage limit to be a
   pow of 2 value, and overall doesn't use "order" any longer
-- zram does not require "zspage order" (pow of 2) value anymore
   and instead accepts an integer in [1,16] range
-- There is no global huge_class_size in zsmalloc anymore.
   huge_class_size is per-pool, since it depends on pager-per-zspage,
   which can be different for different pools.
-- There is no global huge_class_size in zram anymore. It should
   be per-pool (per-device).
-- Updated documentation
-- Fixed documentation htmldocs warning (Stephen)
-- Dropped get_pages_per_zspage() patch
-- Renamed zram sysfs knob (device attribute)
-- Re-worked "synthetic test" section in the first commit: more numbers,
   objects distribution analysis, etc.

Alexey Romanov (1):
  zram: add size class equals check into recompression

Sergey Senozhatsky (8):
  zsmalloc: turn zspage order into runtime variable
  zsmalloc: move away from page order defines
  zsmalloc: make huge class watermark zs_pool member
  zram: huge size watermark cannot be global
  zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  zram: add pages_per_pool_page device attribute
  Documentation: document zram pages_per_pool_page attribute
  zsmalloc: break out of loop when found perfect zspage order

 Documentation/admin-guide/blockdev/zram.rst |  38 +++++--
 drivers/block/zram/zram_drv.c               |  63 +++++++++--
 drivers/block/zram/zram_drv.h               |   7 ++
 include/linux/zsmalloc.h                    |  14 ++-
 mm/zsmalloc.c                               | 112 +++++++++++++-------
 5 files changed, 176 insertions(+), 58 deletions(-)

-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCHv4 1/9] zram: add size class equals check into recompression
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable Sergey Senozhatsky
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Alexey Romanov, Sergey Senozhatsky

From: Alexey Romanov <avromanov@sberdevices.ru>

It makes no sense for us to recompress the object if it will
be in the same size class. We anyway don't get any memory gain.
But, at the same time, we get a CPU time overhead when inserting
this object into zspage and decompressing it afterwards.

Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c |  5 +++++
 include/linux/zsmalloc.h      |  2 ++
 mm/zsmalloc.c                 | 21 +++++++++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 364323713393..fd31beb6491a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1632,6 +1632,8 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 	unsigned long handle_next;
 	unsigned int comp_len_next;
 	unsigned int comp_len_prev;
+	unsigned int class_index_prev;
+	unsigned int class_index_next;
 	struct zcomp_strm *zstrm;
 	void *src, *dst;
 	int ret;
@@ -1656,6 +1658,8 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 	ret = zcomp_compress(zstrm, src, &comp_len_next);
 	kunmap_atomic(src);
 
+	class_index_prev = zs_lookup_class_index(zram->mem_pool, comp_len_prev);
+	class_index_next = zs_lookup_class_index(zram->mem_pool, comp_len_next);
 	/*
 	 * Either a compression error or we failed to compressed the object
 	 * in a way that will save us memory. Mark the object so that we
@@ -1663,6 +1667,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 	 */
 	if (comp_len_next >= huge_class_size ||
 	    comp_len_next >= comp_len_prev ||
+	    class_index_next >= class_index_prev ||
 	    ret) {
 		zram_set_flag(zram, index, ZRAM_RECOMP_SKIP);
 		zram_clear_flag(zram, index, ZRAM_IDLE);
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 2a430e713ce5..a48cd0ffe57d 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -55,5 +55,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
 unsigned long zs_get_total_pages(struct zs_pool *pool);
 unsigned long zs_compact(struct zs_pool *pool);
 
+unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
+
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d03941cace2c..065744b7e9d8 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1205,6 +1205,27 @@ static bool zspage_full(struct size_class *class, struct zspage *zspage)
 	return get_zspage_inuse(zspage) == class->objs_per_zspage;
 }
 
+/**
+ * zs_lookup_class_index() - Returns index of the zsmalloc &size_class
+ * that hold objects of the provided size.
+ * @pool: zsmalloc pool to use
+ * @size: object size
+ *
+ * Context: Any context.
+ *
+ * Return: the index of the zsmalloc &size_class that hold objects of the
+ * provided size.
+ */
+unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size)
+{
+	struct size_class *class;
+
+	class = pool->size_class[get_size_class_index(size)];
+
+	return class->index;
+}
+EXPORT_SYMBOL_GPL(zs_lookup_class_index);
+
 unsigned long zs_get_total_pages(struct zs_pool *pool)
 {
 	return atomic_long_read(&pool->pages_allocated);
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 1/9] zram: add size class equals check into recompression Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-10 21:59   ` Minchan Kim
  2022-10-31  5:41 ` [PATCHv4 3/9] zsmalloc: move away from page order defines Sergey Senozhatsky
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

zsmalloc has 255 size classes. Size classes contain a number of zspages,
which store objects of the same size. zspage can consist of up to four
physical pages. The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of
objects zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes #95-99 are merged with size class #100. That is, each time
we store an object of size, say, 1568 bytes instead of using class #96
we end up storing it in size class #100. Class #100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class #100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class #96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class #96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class #100 will allocate.

A higher order zspage for class #96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes #96 and #100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes #96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is #202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class #254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from #203 to #254. Similarly to class size #96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes #112-#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class #211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h | 12 +++++++
 mm/zsmalloc.c            | 73 +++++++++++++++++++++++-----------------
 2 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..6cd1d95b928a 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -33,6 +33,18 @@ enum zs_mapmode {
 	 */
 };
 
+#define ZS_PAGE_ORDER_2		2
+#define ZS_PAGE_ORDER_4		4
+
+/*
+ * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
+ * pages. ZS_MAX_PAGE_ORDER defines upper limit on N, ZS_MIN_PAGE_ORDER
+ * defines lower limit on N. ZS_DEFAULT_PAGE_ORDER is recommended value.
+ */
+#define ZS_MIN_PAGE_ORDER	ZS_PAGE_ORDER_2
+#define ZS_MAX_PAGE_ORDER	ZS_PAGE_ORDER_4
+#define ZS_DEFAULT_PAGE_ORDER	ZS_PAGE_ORDER_2
+
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
 	atomic_long_t pages_compacted;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 065744b7e9d8..a9773566f85b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -74,12 +74,7 @@
  */
 #define ZS_ALIGN		8
 
-/*
- * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
- * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
- */
-#define ZS_MAX_ZSPAGE_ORDER 2
-#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
+#define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(1, UL) << ZS_MAX_PAGE_ORDER)
 
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
@@ -124,10 +119,8 @@
 #define ISOLATED_BITS	3
 #define MAGIC_VAL_BITS	8
 
-#define MAX(a, b) ((a) >= (b) ? (a) : (b))
-/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
-#define ZS_MIN_ALLOC_SIZE \
-	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
+#define ZS_MIN_ALLOC_SIZE	32U
+
 /* each chunk includes extra space to keep handle */
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
 
@@ -141,12 +134,10 @@
  *    determined). NOTE: all those class sizes must be set as multiple of
  *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
  *
- *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
- *  (reason above)
+ *  pool->min_alloc_size (ZS_MIN_ALLOC_SIZE) and ZS_SIZE_CLASS_DELTA must
+ *  be multiple of ZS_ALIGN (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
-#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
-				      ZS_SIZE_CLASS_DELTA) + 1)
 
 enum fullness_group {
 	ZS_EMPTY,
@@ -230,12 +221,15 @@ struct link_free {
 struct zs_pool {
 	const char *name;
 
-	struct size_class *size_class[ZS_SIZE_CLASSES];
+	struct size_class **size_class;
 	struct kmem_cache *handle_cachep;
 	struct kmem_cache *zspage_cachep;
 
 	atomic_long_t pages_allocated;
 
+	u32 num_size_classes;
+	u32 min_alloc_size;
+
 	struct zs_pool_stats stats;
 
 	/* Compact classes */
@@ -523,15 +517,15 @@ static void set_zspage_mapping(struct zspage *zspage,
  * classes depending on its size. This function returns index of the
  * size class which has chunk size big enough to hold the given size.
  */
-static int get_size_class_index(int size)
+static int get_size_class_index(struct zs_pool *pool, int size)
 {
 	int idx = 0;
 
-	if (likely(size > ZS_MIN_ALLOC_SIZE))
-		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+	if (likely(size > pool->min_alloc_size))
+		idx = DIV_ROUND_UP(size - pool->min_alloc_size,
 				ZS_SIZE_CLASS_DELTA);
 
-	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	return min_t(int, pool->num_size_classes - 1, idx);
 }
 
 /* type can be of enum type class_stat_type or fullness_group */
@@ -591,7 +585,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 			"obj_allocated", "obj_used", "pages_used",
 			"pages_per_zspage", "freeable");
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < pool->num_size_classes; i++) {
 		class = pool->size_class[i];
 
 		if (class->index != i)
@@ -777,13 +771,13 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
  * link together 3 PAGE_SIZE sized pages to form a zspage
  * since then we can perfectly fit in 8 such objects.
  */
-static int get_pages_per_zspage(int class_size)
+static int get_pages_per_zspage(u32 class_size, u32 num_pages)
 {
 	int i, max_usedpc = 0;
 	/* zspage order which gives maximum used size per KB */
 	int max_usedpc_order = 1;
 
-	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
+	for (i = 1; i <= num_pages; i++) {
 		int zspage_size;
 		int waste, usedpc;
 
@@ -1220,7 +1214,7 @@ unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size)
 {
 	struct size_class *class;
 
-	class = pool->size_class[get_size_class_index(size)];
+	class = pool->size_class[get_size_class_index(pool, size)];
 
 	return class->index;
 }
@@ -1431,7 +1425,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	/* extra space in chunk to keep the handle */
 	size += ZS_HANDLE_SIZE;
-	class = pool->size_class[get_size_class_index(size)];
+	class = pool->size_class[get_size_class_index(pool, size)];
 
 	/* class->lock effectively protects the zpage migration */
 	spin_lock(&class->lock);
@@ -1980,7 +1974,7 @@ static void async_free_zspage(struct work_struct *work)
 	struct zs_pool *pool = container_of(work, struct zs_pool,
 					free_work);
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < pool->num_size_classes; i++) {
 		class = pool->size_class[i];
 		if (class->index != i)
 			continue;
@@ -2129,7 +2123,7 @@ unsigned long zs_compact(struct zs_pool *pool)
 	struct size_class *class;
 	unsigned long pages_freed = 0;
 
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = pool->num_size_classes - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (class->index != i)
 			continue;
@@ -2173,7 +2167,7 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
 	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
 			shrinker);
 
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = pool->num_size_classes - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (class->index != i)
 			continue;
@@ -2215,11 +2209,27 @@ struct zs_pool *zs_create_pool(const char *name)
 	int i;
 	struct zs_pool *pool;
 	struct size_class *prev_class = NULL;
+	unsigned long num_pages;
 
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		return NULL;
 
+	num_pages = 1UL << ZS_DEFAULT_PAGE_ORDER;
+	/* min_alloc_size must be multiple of ZS_ALIGN */
+	pool->min_alloc_size = num_pages << PAGE_SHIFT >> OBJ_INDEX_BITS;
+	pool->min_alloc_size = max(pool->min_alloc_size, ZS_MIN_ALLOC_SIZE);
+
+	pool->num_size_classes =
+		DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - pool->min_alloc_size,
+			     ZS_SIZE_CLASS_DELTA) + 1;
+
+	pool->size_class = kmalloc_array(pool->num_size_classes,
+					 sizeof(struct size_class *),
+					 GFP_KERNEL | __GFP_ZERO);
+	if (!pool->size_class)
+		goto err;
+
 	init_deferred_free(pool);
 	rwlock_init(&pool->migrate_lock);
 
@@ -2234,17 +2244,17 @@ struct zs_pool *zs_create_pool(const char *name)
 	 * Iterate reversely, because, size of size_class that we want to use
 	 * for merging should be larger or equal to current size.
 	 */
-	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
+	for (i = pool->num_size_classes - 1; i >= 0; i--) {
 		int size;
 		int pages_per_zspage;
 		int objs_per_zspage;
 		struct size_class *class;
 		int fullness = 0;
 
-		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
+		size = pool->min_alloc_size + i * ZS_SIZE_CLASS_DELTA;
 		if (size > ZS_MAX_ALLOC_SIZE)
 			size = ZS_MAX_ALLOC_SIZE;
-		pages_per_zspage = get_pages_per_zspage(size);
+		pages_per_zspage = get_pages_per_zspage(size, num_pages);
 		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
 
 		/*
@@ -2328,7 +2338,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 	zs_flush_migration(pool);
 	zs_pool_stat_destroy(pool);
 
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+	for (i = 0; i < pool->num_size_classes; i++) {
 		int fg;
 		struct size_class *class = pool->size_class[i];
 
@@ -2348,6 +2358,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 	}
 
 	destroy_cache(pool);
+	kfree(pool->size_class);
 	kfree(pool->name);
 	kfree(pool);
 }
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 3/9] zsmalloc: move away from page order defines
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 1/9] zram: add size class equals check into recompression Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-10 22:02   ` Minchan Kim
  2022-10-31  5:41 ` [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member Sergey Senozhatsky
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

There is no reason for us to require pages per-zspage to be a
power of two. Rename macros and use plain limit numbers there
instead of 2 ^ N values. This will let us to have more tunable
limits.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h | 16 +++++++---------
 mm/zsmalloc.c            |  4 +---
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 6cd1d95b928a..b6b8654a2d45 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -33,17 +33,15 @@ enum zs_mapmode {
 	 */
 };
 
-#define ZS_PAGE_ORDER_2		2
-#define ZS_PAGE_ORDER_4		4
-
 /*
- * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
- * pages. ZS_MAX_PAGE_ORDER defines upper limit on N, ZS_MIN_PAGE_ORDER
- * defines lower limit on N. ZS_DEFAULT_PAGE_ORDER is recommended value.
+ * A single 'zspage' is composed of up to N discontiguous 0-order
+ * (single) pages. ZS_MAX_PAGES_PER_ZSPAGE defines upper limit on N,
+ * ZS_MIN_PAGES_PER_ZSPAGE defines lower limit on N.
+ * ZS_DEFAULT_PAGES_PER_ZSPAGE is a recommended value.
  */
-#define ZS_MIN_PAGE_ORDER	ZS_PAGE_ORDER_2
-#define ZS_MAX_PAGE_ORDER	ZS_PAGE_ORDER_4
-#define ZS_DEFAULT_PAGE_ORDER	ZS_PAGE_ORDER_2
+#define ZS_MIN_PAGES_PER_ZSPAGE	1
+#define ZS_MAX_PAGES_PER_ZSPAGE	16
+#define ZS_DEFAULT_PAGES_PER_ZSPAGE	4
 
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a9773566f85b..5f79223e7bfe 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -74,8 +74,6 @@
  */
 #define ZS_ALIGN		8
 
-#define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(1, UL) << ZS_MAX_PAGE_ORDER)
-
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
 /*
@@ -2215,7 +2213,7 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (!pool)
 		return NULL;
 
-	num_pages = 1UL << ZS_DEFAULT_PAGE_ORDER;
+	num_pages = ZS_DEFAULT_PAGES_PER_ZSPAGE;
 	/* min_alloc_size must be multiple of ZS_ALIGN */
 	pool->min_alloc_size = num_pages << PAGE_SHIFT >> OBJ_INDEX_BITS;
 	pool->min_alloc_size = max(pool->min_alloc_size, ZS_MIN_ALLOC_SIZE);
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (2 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 3/9] zsmalloc: move away from page order defines Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-10 22:25   ` Minchan Kim
  2022-10-31  5:41 ` [PATCHv4 5/9] zram: huge size watermark cannot be global Sergey Senozhatsky
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

We will permit per-pool configuration of pages per-zspage value,
which changes characteristics of the classes and moves around
huge class size watermark. Thus huge class size needs to be
a per-pool variable.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5f79223e7bfe..d329bd673baa 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -178,7 +178,6 @@ static struct dentry *zs_stat_root;
  * (see: fix_fullness_group())
  */
 static const int fullness_threshold_frac = 4;
-static size_t huge_class_size;
 
 struct size_class {
 	spinlock_t lock;
@@ -227,6 +226,7 @@ struct zs_pool {
 
 	u32 num_size_classes;
 	u32 min_alloc_size;
+	size_t huge_class_size;
 
 	struct zs_pool_stats stats;
 
@@ -1350,7 +1350,7 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
  */
 size_t zs_huge_class_size(struct zs_pool *pool)
 {
-	return huge_class_size;
+	return pool->huge_class_size;
 }
 EXPORT_SYMBOL_GPL(zs_huge_class_size);
 
@@ -2262,8 +2262,8 @@ struct zs_pool *zs_create_pool(const char *name)
 		 * endup in the huge class.
 		 */
 		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
-				!huge_class_size) {
-			huge_class_size = size;
+				!pool->huge_class_size) {
+			pool->huge_class_size = size;
 			/*
 			 * The object uses ZS_HANDLE_SIZE bytes to store the
 			 * handle. We need to subtract it, because zs_malloc()
@@ -2273,7 +2273,7 @@ struct zs_pool *zs_create_pool(const char *name)
 			 * class because it grows by ZS_HANDLE_SIZE extra bytes
 			 * right before class lookup.
 			 */
-			huge_class_size -= (ZS_HANDLE_SIZE - 1);
+			pool->huge_class_size -= (ZS_HANDLE_SIZE - 1);
 		}
 
 		/*
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 5/9] zram: huge size watermark cannot be global
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (3 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool() Sergey Senozhatsky
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

ZRAM will pass pool specific limit on number of pages
per-zspages which will affect pool's characteristics.
Namely huge size class watermark value. Move huge_class_size
to struct zram, because this value now can be unique to the
pool (zram device).

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 13 +++----------
 drivers/block/zram/zram_drv.h |  5 +++++
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fd31beb6491a..90b0c66bbd5b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -50,12 +50,6 @@ static const char *default_comp_algs[ZRAM_MAX_ZCOMPS] = {
 
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
-/*
- * Pages that compress to sizes equals or greater than this are stored
- * uncompressed in memory.
- */
-static size_t huge_class_size;
-
 static const struct block_device_operations zram_devops;
 
 static void zram_free_page(struct zram *zram, size_t index);
@@ -1259,8 +1253,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	if (!huge_class_size)
-		huge_class_size = zs_huge_class_size(zram->mem_pool);
+	zram->huge_class_size = zs_huge_class_size(zram->mem_pool);
 	return true;
 }
 
@@ -1488,7 +1481,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 		return ret;
 	}
 
-	if (comp_len >= huge_class_size)
+	if (comp_len >= zram->huge_class_size)
 		comp_len = PAGE_SIZE;
 	/*
 	 * handle allocation has 2 paths:
@@ -1665,7 +1658,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 	 * in a way that will save us memory. Mark the object so that we
 	 * don't attempt to re-compress it again (RECOMP_SKIP).
 	 */
-	if (comp_len_next >= huge_class_size ||
+	if (comp_len_next >= zram->huge_class_size ||
 	    comp_len_next >= comp_len_prev ||
 	    class_index_next >= class_index_prev ||
 	    ret) {
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 09b9ceb5dfa3..9d6fcfdf7aa7 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -120,6 +120,11 @@ struct zram {
 	 */
 	u64 disksize;	/* bytes */
 	const char *comp_algs[ZRAM_MAX_ZCOMPS];
+	/*
+	 * Pages that compress to sizes equal or greater than this are stored
+	 * uncompressed in memory.
+	 */
+	size_t huge_class_size;
 	/*
 	 * zram is claimed so open request will be failed
 	 */
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (4 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 5/9] zram: huge size watermark cannot be global Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-09  6:24   ` Sergey Senozhatsky
  2022-11-11  2:10   ` Minchan Kim
  2022-10-31  5:41 ` [PATCHv4 7/9] zram: add pages_per_pool_page device attribute Sergey Senozhatsky
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

Allow zsmalloc pool owner to specify max number of pages
per-zspage (during pool creation), so that different pools
can have different characteristics.

By default we pass ZS_DEFAULT_PAGES_PER_ZSPAGE which is 4
(matches the current order 2 zspages limit).

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c |  3 ++-
 include/linux/zsmalloc.h      |  2 +-
 mm/zsmalloc.c                 | 11 +++++++----
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 90b0c66bbd5b..bec02f636bce 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1247,7 +1247,8 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 	if (!zram->table)
 		return false;
 
-	zram->mem_pool = zs_create_pool(zram->disk->disk_name);
+	zram->mem_pool = zs_create_pool(zram->disk->disk_name,
+					ZS_DEFAULT_PAGES_PER_ZSPAGE);
 	if (!zram->mem_pool) {
 		vfree(zram->table);
 		return false;
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index b6b8654a2d45..28f2b9cb1c47 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -50,7 +50,7 @@ struct zs_pool_stats {
 
 struct zs_pool;
 
-struct zs_pool *zs_create_pool(const char *name);
+struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages);
 void zs_destroy_pool(struct zs_pool *pool);
 
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d329bd673baa..42987a913f45 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -366,7 +366,7 @@ static void *zs_zpool_create(const char *name, gfp_t gfp,
 	 * different contexts and its caller must provide a valid
 	 * gfp mask.
 	 */
-	return zs_create_pool(name);
+	return zs_create_pool(name, ZS_DEFAULT_PAGES_PER_ZSPAGE);
 }
 
 static void zs_zpool_destroy(void *pool)
@@ -2195,6 +2195,7 @@ static int zs_register_shrinker(struct zs_pool *pool)
 /**
  * zs_create_pool - Creates an allocation pool to work from.
  * @name: pool name to be created
+ * @num_pages: maximum number of pages per-zspage
  *
  * This function must be called before anything when using
  * the zsmalloc allocator.
@@ -2202,18 +2203,20 @@ static int zs_register_shrinker(struct zs_pool *pool)
  * On success, a pointer to the newly created pool is returned,
  * otherwise NULL.
  */
-struct zs_pool *zs_create_pool(const char *name)
+struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages)
 {
 	int i;
 	struct zs_pool *pool;
 	struct size_class *prev_class = NULL;
-	unsigned long num_pages;
+
+	if (WARN_ON(num_pages < ZS_MIN_PAGES_PER_ZSPAGE ||
+		    num_pages > ZS_MAX_PAGES_PER_ZSPAGE))
+		return NULL;
 
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		return NULL;
 
-	num_pages = ZS_DEFAULT_PAGES_PER_ZSPAGE;
 	/* min_alloc_size must be multiple of ZS_ALIGN */
 	pool->min_alloc_size = num_pages << PAGE_SHIFT >> OBJ_INDEX_BITS;
 	pool->min_alloc_size = max(pool->min_alloc_size, ZS_MIN_ALLOC_SIZE);
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 7/9] zram: add pages_per_pool_page device attribute
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (5 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool() Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-09  4:34   ` Sergey Senozhatsky
  2022-10-31  5:41 ` [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute Sergey Senozhatsky
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

Add a new sysfs knob that allows user-space to set
zsmalloc pages per-zspage limit value on per-device
basis.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 44 ++++++++++++++++++++++++++++++++++-
 drivers/block/zram/zram_drv.h |  2 ++
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bec02f636bce..cf9d3474b80c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1180,6 +1180,45 @@ static ssize_t mm_stat_show(struct device *dev,
 	return ret;
 }
 
+static ssize_t pages_per_pool_page_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	u32 val;
+	struct zram *zram = dev_to_zram(dev);
+
+	down_read(&zram->init_lock);
+	val = zram->pages_per_pool_page;
+	up_read(&zram->init_lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t pages_per_pool_page_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t len)
+{
+	struct zram *zram = dev_to_zram(dev);
+	u32 val;
+
+	if (kstrtou32(buf, 10, &val))
+		return -EINVAL;
+
+	if (val < ZS_MIN_PAGES_PER_ZSPAGE || val > ZS_MAX_PAGES_PER_ZSPAGE)
+		return -EINVAL;
+
+	down_read(&zram->init_lock);
+	if (init_done(zram)) {
+		up_read(&zram->init_lock);
+		return -EINVAL;
+	}
+
+	zram->pages_per_pool_page = val;
+	up_read(&zram->init_lock);
+
+	return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 #define FOUR_K(x) ((x) * (1 << (PAGE_SHIFT - 12)))
 static ssize_t bd_stat_show(struct device *dev,
@@ -1248,7 +1287,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 
 	zram->mem_pool = zs_create_pool(zram->disk->disk_name,
-					ZS_DEFAULT_PAGES_PER_ZSPAGE);
+					zram->pages_per_pool_page);
 	if (!zram->mem_pool) {
 		vfree(zram->table);
 		return false;
@@ -2174,6 +2213,7 @@ static DEVICE_ATTR_RW(writeback_limit_enable);
 static DEVICE_ATTR_RW(recomp_algorithm);
 static DEVICE_ATTR_WO(recompress);
 #endif
+static DEVICE_ATTR_RW(pages_per_pool_page);
 
 static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_disksize.attr,
@@ -2201,6 +2241,7 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_recomp_algorithm.attr,
 	&dev_attr_recompress.attr,
 #endif
+	&dev_attr_pages_per_pool_page.attr,
 	NULL,
 };
 
@@ -2238,6 +2279,7 @@ static int zram_add(void)
 		goto out_free_idr;
 	}
 
+	zram->pages_per_pool_page = ZS_DEFAULT_PAGES_PER_ZSPAGE;
 	zram->disk->major = zram_major;
 	zram->disk->first_minor = device_id;
 	zram->disk->minors = 1;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 9d6fcfdf7aa7..bdfc9bf0bdd5 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -120,6 +120,8 @@ struct zram {
 	 */
 	u64 disksize;	/* bytes */
 	const char *comp_algs[ZRAM_MAX_ZCOMPS];
+
+	u32 pages_per_pool_page;
 	/*
 	 * Pages that compress to sizes equal or greater than this are stored
 	 * uncompressed in memory.
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (6 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 7/9] zram: add pages_per_pool_page device attribute Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-11  2:20   ` Minchan Kim
  2022-10-31  5:41 ` [PATCHv4 9/9] zsmalloc: break out of loop when found perfect zspage order Sergey Senozhatsky
  2022-11-10 22:44 ` [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Minchan Kim
  9 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

Provide a simple documentation for pages_per_pool_page ZRAM
device attribute.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 Documentation/admin-guide/blockdev/zram.rst | 38 ++++++++++++++++-----
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 010fb05a5999..4cb287520d45 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -112,7 +112,29 @@ to list all of them using, for instance, /proc/crypto or any other
 method. This, however, has an advantage of permitting the usage of
 custom crypto compression modules (implementing S/W or H/W compression).
 
-4) Set Disksize
+4) Set pages per-pool page limit: Optional
+==========================================
+
+zsmalloc pages can consist of up to ZS_DEFAULT_PAGES_PER_ZSPAGE (single)
+physical pages. The exact number is calculated for each zsmalloc size
+class during zsmalloc pool creation. ZRAM provides pages_per_pool_page
+device attribute that lets one adjust that limit (maximum possible value
+is ZS_MAX_PAGES_PER_ZSPAGE). The default limit is considered to be good
+enough, so tweak this value only when the changes in zsmalloc size classes
+characteristics are beneficial for your data patterns. The limit on the
+pages per zspages (currently) should be in [1,16] range; default value
+is 4.
+
+Examples::
+
+	#show current zsmalloc pages per-pool page limit
+	cat /sys/block/zramX/pages_per_pool_page
+	4
+
+	#set zsmalloc pages per-pool page limit
+	echo 8 > /sys/block/zramX/pages_per_pool_page
+
+5) Set Disksize
 ===============
 
 Set disk size by writing the value to sysfs node 'disksize'.
@@ -132,7 +154,7 @@ There is little point creating a zram of greater than twice the size of memory
 since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
 size of the disk when not in use so a huge zram is wasteful.
 
-5) Set memory limit: Optional
+6) Set memory limit: Optional
 =============================
 
 Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -151,7 +173,7 @@ Examples::
 	# To disable memory limit
 	echo 0 > /sys/block/zram0/mem_limit
 
-6) Activate
+7) Activate
 ===========
 
 ::
@@ -162,7 +184,7 @@ Examples::
 	mkfs.ext4 /dev/zram1
 	mount /dev/zram1 /tmp
 
-7) Add/remove zram devices
+8) Add/remove zram devices
 ==========================
 
 zram provides a control interface, which enables dynamic (on-demand) device
@@ -182,7 +204,7 @@ execute::
 
 	echo X > /sys/class/zram-control/hot_remove
 
-8) Stats
+9) Stats
 ========
 
 Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -283,15 +305,15 @@ a single line of text and contains the following stats separated by whitespace:
 		Unit: 4K bytes
  ============== =============================================================
 
-9) Deactivate
-=============
+10) Deactivate
+==============
 
 ::
 
 	swapoff /dev/zram0
 	umount /dev/zram1
 
-10) Reset
+11) Reset
 =========
 
 	Write any positive value to 'reset' sysfs node::
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv4 9/9] zsmalloc: break out of loop when found perfect zspage order
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (7 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute Sergey Senozhatsky
@ 2022-10-31  5:41 ` Sergey Senozhatsky
  2022-11-10 22:44 ` [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Minchan Kim
  9 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-10-31  5:41 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

If we found zspage configuration that gives us perfect
100% used percentage (zero wasted space) then there is
no point it trying any other configuration

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 42987a913f45..a40c548520d3 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -787,6 +787,9 @@ static int get_pages_per_zspage(u32 class_size, u32 num_pages)
 			max_usedpc = usedpc;
 			max_usedpc_order = i;
 		}
+
+		if (usedpc == 100)
+			break;
 	}
 
 	return max_usedpc_order;
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 7/9] zram: add pages_per_pool_page device attribute
  2022-10-31  5:41 ` [PATCHv4 7/9] zram: add pages_per_pool_page device attribute Sergey Senozhatsky
@ 2022-11-09  4:34   ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-09  4:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm, Sergey Senozhatsky

On (22/10/31 14:41), Sergey Senozhatsky wrote:
[..]
>  	zram->mem_pool = zs_create_pool(zram->disk->disk_name,
> -					ZS_DEFAULT_PAGES_PER_ZSPAGE);
> +					zram->pages_per_pool_page);
>  	if (!zram->mem_pool) {
>  		vfree(zram->table);
>  		return false;
> @@ -2174,6 +2213,7 @@ static DEVICE_ATTR_RW(writeback_limit_enable);
>  static DEVICE_ATTR_RW(recomp_algorithm);
>  static DEVICE_ATTR_WO(recompress);
>  #endif
> +static DEVICE_ATTR_RW(pages_per_pool_page);

May be we can have a more generic "allocator_tunables" device attribute,
which will support named parameters instead. E.g.

	pool_page_len_limit=INT

And more in the future.

Having hard times coming up with good names here. max_pages_per_zspage
is too low level and exposes zsmalloc internals, while in theory zram
can use different allocators at some point, and those allocators can
have different tunables.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  2022-10-31  5:41 ` [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool() Sergey Senozhatsky
@ 2022-11-09  6:24   ` Sergey Senozhatsky
  2022-11-11 17:14     ` Minchan Kim
  2022-11-11  2:10   ` Minchan Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-09  6:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Sergey Senozhatsky, Nitin Gupta, linux-kernel, linux-mm

On (22/10/31 14:41), Sergey Senozhatsky wrote:
[..]
> -struct zs_pool *zs_create_pool(const char *name)
> +struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages)
>  {
>  	int i;
>  	struct zs_pool *pool;
>  	struct size_class *prev_class = NULL;
> -	unsigned long num_pages;
> +
> +	if (WARN_ON(num_pages < ZS_MIN_PAGES_PER_ZSPAGE ||
> +		    num_pages > ZS_MAX_PAGES_PER_ZSPAGE))
> +		return NULL;

I tend to think that creating `struct zs_tunables` would be better. For
the time being zs_tunables will contain only one member max_zspage_len,
but it can be extended in the future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable
  2022-10-31  5:41 ` [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable Sergey Senozhatsky
@ 2022-11-10 21:59   ` Minchan Kim
  2022-11-11 10:38     ` Sergey Senozhatsky
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-10 21:59 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:41:01PM +0900, Sergey Senozhatsky wrote:
> zsmalloc has 255 size classes. Size classes contain a number of zspages,
> which store objects of the same size. zspage can consist of up to four
> physical pages. The exact (most optimal) zspage size is calculated for
> each size class during zsmalloc pool creation.
> 
> As a reasonable optimization, zsmalloc merges size classes that have
> similar characteristics: number of pages per zspage and number of
> objects zspage can store.
> 
> For example, let's look at the following size classes:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ..
>    94  1536           0            0             0          0          0                3        0
>   100  1632           0            0             0          0          0                2        0
> ..
> 
> Size classes #95-99 are merged with size class #100. That is, each time
> we store an object of size, say, 1568 bytes instead of using class #96
> we end up storing it in size class #100. Class #100 is for objects of
> 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
> Class #100 zspages consist of 2 physical pages and can hold 5 objects.
> When we need to store, say, 13 objects of size 1568 we end up allocating
> three zspages; in other words, 6 physical pages.
> 
> However, if we'll look closer at size class #96 (which should hold objects
> of size 1568 bytes) and trace get_pages_per_zspage():
> 
>     pages per zspage      wasted bytes     used%
>            1                  960           76
>            2                  352           95
>            3                 1312           89
>            4                  704           95
>            5                   96           99
> 
> We'd notice that the most optimal zspage configuration for this class is
> when it consists of 5 physical pages, but currently we never let zspages
> to consists of more than 4 pages. A 5 page class #96 configuration would
> store 13 objects of size 1568 in a single zspage, allocating 5 physical
> pages, as opposed to 6 physical pages that class #100 will allocate.
> 
> A higher order zspage for class #96 also changes its key characteristics:
> pages per-zspage and objects per-zspage. As a result classes #96 and #100
> are not merged anymore, which gives us more compact zsmalloc.
> 
> Of course the described effect does not apply only to size classes #96 and
> We still merge classes, but less often so. In other words classes are grouped
> in a more compact way, which decreases memory wastage:
> 
> zspage order               # unique size classes
>      2                                69
>      3                               123
>      4                               191
> 
> Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ...
>   202  3264           0            0             0          0          0                4        0
>   254  4096           0            0             0          0          0                1        0
> ...
> 
> For exactly same reason - maximum 4 pages per zspage - the last non-huge
> size class is #202, which stores objects of size 3264 bytes. Any object
> larger than 3264 bytes, hence, is considered to be huge and lands in size
> class #254, which uses a whole physical page to store every object. To put
> it slightly differently - objects in huge classes don't share physical pages.
> 
> 3264 bytes is too low of a watermark and we have too many huge classes:
> classes from #203 to #254. Similarly to class size #96 above, higher order
> zspages change key characteristics for some of those huge size classes and
> thus those classes become normal classes, where stored objects share physical
> pages.
> 
> Hence yet another consequence of higher order zspages: we move the huge
> size class watermark with higher order zspages, have less huge classes and
> store large objects in a more compact way.
> 
> For order 3, huge class watermark becomes 3632 bytes:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ...
>   202  3264           0            0             0          0          0                4        0
>   211  3408           0            0             0          0          0                5        0
>   217  3504           0            0             0          0          0                6        0
>   222  3584           0            0             0          0          0                7        0
>   225  3632           0            0             0          0          0                8        0
>   254  4096           0            0             0          0          0                1        0
> ...
> 
> For order 4, huge class watermark becomes 3840 bytes:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ...
>   202  3264           0            0             0          0          0                4        0
>   206  3328           0            0             0          0          0               13        0
>   207  3344           0            0             0          0          0                9        0
>   208  3360           0            0             0          0          0               14        0
>   211  3408           0            0             0          0          0                5        0
>   212  3424           0            0             0          0          0               16        0
>   214  3456           0            0             0          0          0               11        0
>   217  3504           0            0             0          0          0                6        0
>   219  3536           0            0             0          0          0               13        0
>   222  3584           0            0             0          0          0                7        0
>   223  3600           0            0             0          0          0               15        0
>   225  3632           0            0             0          0          0                8        0
>   228  3680           0            0             0          0          0                9        0
>   230  3712           0            0             0          0          0               10        0
>   232  3744           0            0             0          0          0               11        0
>   234  3776           0            0             0          0          0               12        0
>   235  3792           0            0             0          0          0               13        0
>   236  3808           0            0             0          0          0               14        0
>   238  3840           0            0             0          0          0               15        0
>   254  4096           0            0             0          0          0                1        0
> ...
> 
> TESTS
> =====
> 
> Test untars linux-6.0.tar.xz and compiles the kernel.
> 
> zram is configured as a block device with ext4 file system, lzo-rle
> compression algorithm. We captured /sys/block/zram0/mm_stat after
> every test and rebooted the VM.
> 
> orig_data_size       mem_used_total     mem_used_max       pages_compacted
>           compr_data_size         mem_limit         same_pages       huge_pages
> 
> ORDER 2 (BASE) zspage
> 
> 1691791360 628086729 655171584        0 655171584       60        0    34043
> 1691787264 628089196 655175680        0 655175680       60        0    34046
> 1691803648 628098840 655187968        0 655187968       59        0    34047
> 1691795456 628091503 655183872        0 655183872       60        0    34044
> 1691799552 628086877 655183872        0 655183872       60        0    34047
> 
> ORDER 3 zspage
> 
> 1691803648 627792993 641794048        0 641794048       60        0    33591
> 1691787264 627779342 641708032        0 641708032       59        0    33591
> 1691811840 627786616 641769472        0 641769472       60        0    33591
> 1691803648 627794468 641818624        0 641818624       59        0    33592
> 1691783168 627780882 641794048        0 641794048       61        0    33591
> 
> ORDER 4 zspage
> 
> 1691803648 627726635 639655936        0 639655936       60        0    33435
> 1691811840 627733348 639643648        0 639643648       61        0    33434
> 1691795456 627726290 639614976        0 639614976       60        0    33435
> 1691803648 627730458 639688704        0 639688704       60        0    33434
> 1691811840 627727771 639688704        0 639688704       60        0    33434
> 
> Order 3 and order 4 show statistically significant improvement in
> `mem_used_max` metrics.
> 
> T-test for order 3:
> 
> x order-2-maxmem
> + order-3-maxmem
>     N           Min           Max        Median           Avg        Stddev
> x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
> +   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
> Difference at 95.0% confidence
> 	-1.34038e+07 +/- 44080.7
> 	-2.04581% +/- 0.00672802%
> 	(Student's t, pooled s = 30224.5)
> 
> T-test for order 4:
> 
> x order-2-maxmem
> + order-4-maxmem
>     N           Min           Max        Median           Avg        Stddev
> x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
> +   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
> Difference at 95.0% confidence
> 	-1.55222e+07 +/- 33126.2
> 	-2.36915% +/- 0.00505604%
> 	(Student's t, pooled s = 22713.4)
> 
> This test tends to benefit more from order 4 zspages, due to test's data
> patterns.
> 
> zsmalloc object distribution analysis
> =============================================================================
> 
> Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
> which is merged with size classes #112-#125:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ...
>     71  1168           0            0          6146       6146       1756                2        0
>     74  1216           0            1          4560       4552       1368                3        0
>     76  1248           0            1          2938       2934        904                4        0
>     83  1360           0            0         10971      10971       3657                1        0
>     91  1488           0            0         16126      16126       5864                4        0
>     94  1536           0            1          5912       5908       2217                3        0
>    100  1632           0            0         11990      11990       4796                2        0
>    107  1744           0            1         15771      15768       6759                3        0
>    111  1808           0            1         10386      10380       4616                4        0
>    126  2048           0            0         45444      45444      22722                1        0
>    144  2336           0            0         47446      47446      27112                4        0
>    151  2448           1            0         10760      10759       6456                3        0
>    168  2720           0            0         10173      10173       6782                2        0
>    190  3072           0            1          1700       1697       1275                3        0
>    202  3264           0            1           290        286        232                4        0
>    254  4096           0            0         34051      34051      34051                1        0
> 
> Order 3 (8 pages per zspage) changed pool characteristics and unmerged
> some of the size classes, which resulted in less objects being put into
> size class 2048, because there are lower size classes are now available
> for more compact object storage:
> 
> class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
> ...
>     71  1168           0            1          2996       2994        856                2        0
>     72  1184           0            1          1632       1609        476                7        0
>     73  1200           1            0          1445       1442        425                5        0
>     74  1216           0            0          1510       1510        453                3        0
>     75  1232           0            1          1495       1479        455                7        0
>     76  1248           0            1          1456       1451        448                4        0
>     78  1280           0            1          3040       3033        950                5        0
>     79  1296           0            1          1584       1571        504                7        0
>     83  1360           0            0          6375       6375       2125                1        0
>     84  1376           0            1          1817       1796        632                8        0
>     87  1424           0            1          6020       6006       2107                7        0
>     88  1440           0            1          2108       2101        744                6        0
>     89  1456           0            1          2072       2064        740                5        0
>     91  1488           0            1          4169       4159       1516                4        0
>     92  1504           0            1          2014       2007        742                7        0
>     94  1536           0            1          3904       3900       1464                3        0
>     95  1552           0            1          1890       1873        720                8        0
>     96  1568           0            1          1963       1958        755                5        0
>     97  1584           0            1          1980       1974        770                7        0
>    100  1632           0            1          6190       6187       2476                2        0
>    103  1680           0            0          6477       6477       2667                7        0
>    104  1696           0            1          2256       2253        940                5        0
>    105  1712           0            1          2356       2340        992                8        0
>    107  1744           1            0          4697       4696       2013                3        0
>    110  1792           0            1          7744       7734       3388                7        0
>    111  1808           0            1          2655       2649       1180                4        0
>    114  1856           0            1          8371       8365       3805                5        0
>    116  1888           1            0          5863       5862       2706                6        0
>    117  1904           0            1          2955       2942       1379                7        0
>    118  1920           0            1          3009       2997       1416                8        0
>    126  2048           0            0         25276      25276      12638                1        0
>    128  2080           0            1          6060       6052       3232                8        0
>    129  2096           1            0          3081       3080       1659                7        0
>    134  2176           0            1         14835      14830       7912                8        0
>    135  2192           0            1          2769       2758       1491                7        0
>    137  2224           0            1          5082       5077       2772                6        0
>    140  2272           0            1          7236       7232       4020                5        0
>    144  2336           0            1          8428       8423       4816                4        0
>    147  2384           0            1          5316       5313       3101                7        0
>    151  2448           0            1          5445       5443       3267                3        0
>    155  2512           0            0          4121       4121       2536                8        0
>    158  2560           0            1          2208       2205       1380                5        0
>    160  2592           0            0          1133       1133        721                7        0
>    168  2720           0            0          2712       2712       1808                2        0
>    177  2864           1            0          1100       1098        770                7        0
>    180  2912           0            1           189        183        135                5        0
>    184  2976           0            1           176        166        128                8        0
>    190  3072           0            0           252        252        189                3        0
>    197  3184           0            1           198        192        154                7        0
>    202  3264           0            1           100         96         80                4        0
>    211  3408           0            1           210        208        175                5        0
>    217  3504           0            1            98         94         84                6        0
>    222  3584           0            0           104        104         91                7        0
>    225  3632           0            1            54         50         48                8        0
>    254  4096           0            0         33591      33591      33591                1        0
> 
> Note, the huge size watermark is above 3632 and there are a number of new
> normal classes available that previously were merged with the huge class.
> For instance, size class #211 holds 210 objects of size 3408 and uses 175
> physical pages, while previously for those objects we would have used 210
> physical pages.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  include/linux/zsmalloc.h | 12 +++++++
>  mm/zsmalloc.c            | 73 +++++++++++++++++++++++-----------------
>  2 files changed, 54 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..6cd1d95b928a 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -33,6 +33,18 @@ enum zs_mapmode {
>  	 */
>  };
>  
> +#define ZS_PAGE_ORDER_2		2
> +#define ZS_PAGE_ORDER_4		4
> +
> +/*
> + * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
> + * pages. ZS_MAX_PAGE_ORDER defines upper limit on N, ZS_MIN_PAGE_ORDER
> + * defines lower limit on N. ZS_DEFAULT_PAGE_ORDER is recommended value.

It gives the impression:

   2^2 <= the page nubmer of zspage <= 2^4

I think that's not what you want to describe. How about?

A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
pages and the N can be from ZS_MIN_PAGE_ORDER to ZS_MAX_PAGE_ORDER.

> + */
> +#define ZS_MIN_PAGE_ORDER	ZS_PAGE_ORDER_2
> +#define ZS_MAX_PAGE_ORDER	ZS_PAGE_ORDER_4
> +#define ZS_DEFAULT_PAGE_ORDER	ZS_PAGE_ORDER_2

#define ZS_MIN_PAGE_ORDER	2

We can use the number directly instead of another wrapping at least
in this patch(Just in case: if you want to extent it later patch,
please do it in the patch)

> +
>  struct zs_pool_stats {
>  	/* How many pages were migrated (freed) */
>  	atomic_long_t pages_compacted;
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 065744b7e9d8..a9773566f85b 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -74,12 +74,7 @@
>   */
>  #define ZS_ALIGN		8
>  
> -/*
> - * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
> - * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
> - */
> -#define ZS_MAX_ZSPAGE_ORDER 2
> -#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
> +#define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(1, UL) << ZS_MAX_PAGE_ORDER)
>  
>  #define ZS_HANDLE_SIZE (sizeof(unsigned long))
>  
> @@ -124,10 +119,8 @@
>  #define ISOLATED_BITS	3
>  #define MAGIC_VAL_BITS	8
>  
> -#define MAX(a, b) ((a) >= (b) ? (a) : (b))
> -/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> -#define ZS_MIN_ALLOC_SIZE \
> -	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
> +#define ZS_MIN_ALLOC_SIZE	32U

Let's have some comment here to say that's not the final vaule which
is supposed to be pool->min_alloc_size.

> +
>  /* each chunk includes extra space to keep handle */
>  #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
>  
> @@ -141,12 +134,10 @@
>   *    determined). NOTE: all those class sizes must be set as multiple of
>   *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
>   *
> - *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
> - *  (reason above)
> + *  pool->min_alloc_size (ZS_MIN_ALLOC_SIZE) and ZS_SIZE_CLASS_DELTA must
> + *  be multiple of ZS_ALIGN (reason above)
>   */
>  #define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
> -#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> -				      ZS_SIZE_CLASS_DELTA) + 1)
>  
>  enum fullness_group {
>  	ZS_EMPTY,
> @@ -230,12 +221,15 @@ struct link_free {
>  struct zs_pool {
>  	const char *name;
>  
> -	struct size_class *size_class[ZS_SIZE_CLASSES];
> +	struct size_class **size_class;
>  	struct kmem_cache *handle_cachep;
>  	struct kmem_cache *zspage_cachep;
>  
>  	atomic_long_t pages_allocated;
>  
> +	u32 num_size_classes;
> +	u32 min_alloc_size;

Please use int. From this patch, I couldn't figure why we need
variable in the pool. Let's have the change in the patch where
you really need to have the usecase.

> +
>  	struct zs_pool_stats stats;
>  
>  	/* Compact classes */
> @@ -523,15 +517,15 @@ static void set_zspage_mapping(struct zspage *zspage,
>   * classes depending on its size. This function returns index of the
>   * size class which has chunk size big enough to hold the given size.
>   */
> -static int get_size_class_index(int size)
> +static int get_size_class_index(struct zs_pool *pool, int size)
>  {
>  	int idx = 0;
>  
> -	if (likely(size > ZS_MIN_ALLOC_SIZE))
> -		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
> +	if (likely(size > pool->min_alloc_size))
> +		idx = DIV_ROUND_UP(size - pool->min_alloc_size,
>  				ZS_SIZE_CLASS_DELTA);
>  
> -	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
> +	return min_t(int, pool->num_size_classes - 1, idx);
>  }
>  
>  /* type can be of enum type class_stat_type or fullness_group */
> @@ -591,7 +585,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  			"obj_allocated", "obj_used", "pages_used",
>  			"pages_per_zspage", "freeable");
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < pool->num_size_classes; i++) {
>  		class = pool->size_class[i];
>  
>  		if (class->index != i)
> @@ -777,13 +771,13 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>   * link together 3 PAGE_SIZE sized pages to form a zspage
>   * since then we can perfectly fit in 8 such objects.
>   */
> -static int get_pages_per_zspage(int class_size)
> +static int get_pages_per_zspage(u32 class_size, u32 num_pages)

Let's just use int instead of u32

Why do you need num_pages argument instead of using 1UL << ZS_DEFAULT_PAGE_ORDER?
It looks like static value.

>  {
>  	int i, max_usedpc = 0;
>  	/* zspage order which gives maximum used size per KB */
>  	int max_usedpc_order = 1;
>  
> -	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
> +	for (i = 1; i <= num_pages; i++) {
>  		int zspage_size;
>  		int waste, usedpc;
>  
> @@ -1220,7 +1214,7 @@ unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size)
>  {
>  	struct size_class *class;
>  
> -	class = pool->size_class[get_size_class_index(size)];
> +	class = pool->size_class[get_size_class_index(pool, size)];
>  
>  	return class->index;
>  }
> @@ -1431,7 +1425,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  
>  	/* extra space in chunk to keep the handle */
>  	size += ZS_HANDLE_SIZE;
> -	class = pool->size_class[get_size_class_index(size)];
> +	class = pool->size_class[get_size_class_index(pool, size)];
>  
>  	/* class->lock effectively protects the zpage migration */
>  	spin_lock(&class->lock);
> @@ -1980,7 +1974,7 @@ static void async_free_zspage(struct work_struct *work)
>  	struct zs_pool *pool = container_of(work, struct zs_pool,
>  					free_work);
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < pool->num_size_classes; i++) {
>  		class = pool->size_class[i];
>  		if (class->index != i)
>  			continue;
> @@ -2129,7 +2123,7 @@ unsigned long zs_compact(struct zs_pool *pool)
>  	struct size_class *class;
>  	unsigned long pages_freed = 0;
>  
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = pool->num_size_classes - 1; i >= 0; i--) {
>  		class = pool->size_class[i];
>  		if (class->index != i)
>  			continue;
> @@ -2173,7 +2167,7 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
>  	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
>  			shrinker);
>  
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = pool->num_size_classes - 1; i >= 0; i--) {
>  		class = pool->size_class[i];
>  		if (class->index != i)
>  			continue;
> @@ -2215,11 +2209,27 @@ struct zs_pool *zs_create_pool(const char *name)
>  	int i;
>  	struct zs_pool *pool;
>  	struct size_class *prev_class = NULL;
> +	unsigned long num_pages;
>  
>  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>  	if (!pool)
>  		return NULL;
>  
> +	num_pages = 1UL << ZS_DEFAULT_PAGE_ORDER;
> +	/* min_alloc_size must be multiple of ZS_ALIGN */
> +	pool->min_alloc_size = num_pages << PAGE_SHIFT >> OBJ_INDEX_BITS;
> +	pool->min_alloc_size = max(pool->min_alloc_size, ZS_MIN_ALLOC_SIZE);
> +
> +	pool->num_size_classes =
> +		DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - pool->min_alloc_size,
> +			     ZS_SIZE_CLASS_DELTA) + 1;
> +
> +	pool->size_class = kmalloc_array(pool->num_size_classes,
> +					 sizeof(struct size_class *),
> +					 GFP_KERNEL | __GFP_ZERO);
> +	if (!pool->size_class)
> +		goto err;
> +
>  	init_deferred_free(pool);
>  	rwlock_init(&pool->migrate_lock);
>  
> @@ -2234,17 +2244,17 @@ struct zs_pool *zs_create_pool(const char *name)
>  	 * Iterate reversely, because, size of size_class that we want to use
>  	 * for merging should be larger or equal to current size.
>  	 */
> -	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> +	for (i = pool->num_size_classes - 1; i >= 0; i--) {
>  		int size;
>  		int pages_per_zspage;
>  		int objs_per_zspage;
>  		struct size_class *class;
>  		int fullness = 0;
>  
> -		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
> +		size = pool->min_alloc_size + i * ZS_SIZE_CLASS_DELTA;
>  		if (size > ZS_MAX_ALLOC_SIZE)
>  			size = ZS_MAX_ALLOC_SIZE;
> -		pages_per_zspage = get_pages_per_zspage(size);
> +		pages_per_zspage = get_pages_per_zspage(size, num_pages);
>  		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
>  
>  		/*
> @@ -2328,7 +2338,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  	zs_flush_migration(pool);
>  	zs_pool_stat_destroy(pool);
>  
> -	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> +	for (i = 0; i < pool->num_size_classes; i++) {
>  		int fg;
>  		struct size_class *class = pool->size_class[i];
>  
> @@ -2348,6 +2358,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  	}
>  
>  	destroy_cache(pool);
> +	kfree(pool->size_class);
>  	kfree(pool->name);
>  	kfree(pool);
>  }
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 3/9] zsmalloc: move away from page order defines
  2022-10-31  5:41 ` [PATCHv4 3/9] zsmalloc: move away from page order defines Sergey Senozhatsky
@ 2022-11-10 22:02   ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2022-11-10 22:02 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:41:02PM +0900, Sergey Senozhatsky wrote:
> There is no reason for us to require pages per-zspage to be a
> power of two. Rename macros and use plain limit numbers there
> instead of 2 ^ N values. This will let us to have more tunable
> limits.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

Looks good to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member
  2022-10-31  5:41 ` [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member Sergey Senozhatsky
@ 2022-11-10 22:25   ` Minchan Kim
  2022-11-11  1:07     ` Sergey Senozhatsky
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-10 22:25 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:41:03PM +0900, Sergey Senozhatsky wrote:
> We will permit per-pool configuration of pages per-zspage value,
> which changes characteristics of the classes and moves around
> huge class size watermark. Thus huge class size needs to be
> a per-pool variable.

I think part of code in previous patch should move here since
you are creating the feature in this patch:

BTW, I am wondering we really need to jump the per-pool config
option over global general golden ratio and/or smarter approach
to optimize transparently depending on how much memory we have
wasted.

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  mm/zsmalloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 5f79223e7bfe..d329bd673baa 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -178,7 +178,6 @@ static struct dentry *zs_stat_root;
>   * (see: fix_fullness_group())
>   */
>  static const int fullness_threshold_frac = 4;
> -static size_t huge_class_size;
>  
>  struct size_class {
>  	spinlock_t lock;
> @@ -227,6 +226,7 @@ struct zs_pool {
>  
>  	u32 num_size_classes;
>  	u32 min_alloc_size;
> +	size_t huge_class_size;
>  
>  	struct zs_pool_stats stats;
>  
> @@ -1350,7 +1350,7 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
>   */
>  size_t zs_huge_class_size(struct zs_pool *pool)
>  {
> -	return huge_class_size;
> +	return pool->huge_class_size;
>  }
>  EXPORT_SYMBOL_GPL(zs_huge_class_size);
>  
> @@ -2262,8 +2262,8 @@ struct zs_pool *zs_create_pool(const char *name)
>  		 * endup in the huge class.
>  		 */
>  		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
> -				!huge_class_size) {
> -			huge_class_size = size;
> +				!pool->huge_class_size) {
> +			pool->huge_class_size = size;
>  			/*
>  			 * The object uses ZS_HANDLE_SIZE bytes to store the
>  			 * handle. We need to subtract it, because zs_malloc()
> @@ -2273,7 +2273,7 @@ struct zs_pool *zs_create_pool(const char *name)
>  			 * class because it grows by ZS_HANDLE_SIZE extra bytes
>  			 * right before class lookup.
>  			 */
> -			huge_class_size -= (ZS_HANDLE_SIZE - 1);
> +			pool->huge_class_size -= (ZS_HANDLE_SIZE - 1);
>  		}
>  
>  		/*
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
                   ` (8 preceding siblings ...)
  2022-10-31  5:41 ` [PATCHv4 9/9] zsmalloc: break out of loop when found perfect zspage order Sergey Senozhatsky
@ 2022-11-10 22:44 ` Minchan Kim
  2022-11-11  0:56   ` Sergey Senozhatsky
  9 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-10 22:44 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> 	Hello,
> 
> 	Some use-cases and/or data patterns may benefit from
> larger zspages. Currently the limit on the number of physical
> pages that are linked into a zspage is hardcoded to 4. Higher
> limit changes key characteristics of a number of the size
> classes, improving compactness of the pool and redusing the
> amount of memory zsmalloc pool uses. More on this in 0002
> commit message.

Hi Sergey,

I think the idea that break of fixed subpages in zspage is
really good start to optimize further. However, I am worry
about introducing per-pool config this stage. How about
to introduce just one golden value for the zspage size?
order-3 or 4 in Kconfig with keeping default 2?

And then we make more efforts to have auto tune based on
the wasted memory and the number of size classes on the
fly. A good thing to be able to achieve is we have indirect
table(handle <-> zpage) so we could move the object anytime
so I think we could do better way in the end.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-10 22:44 ` [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Minchan Kim
@ 2022-11-11  0:56   ` Sergey Senozhatsky
  2022-11-11 17:03     ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-11  0:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

Hi,

On (22/11/10 14:44), Minchan Kim wrote:
> On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> > 	Hello,
> > 
> > 	Some use-cases and/or data patterns may benefit from
> > larger zspages. Currently the limit on the number of physical
> > pages that are linked into a zspage is hardcoded to 4. Higher
> > limit changes key characteristics of a number of the size
> > classes, improving compactness of the pool and redusing the
> > amount of memory zsmalloc pool uses. More on this in 0002
> > commit message.
> 
> Hi Sergey,
> 
> I think the idea that break of fixed subpages in zspage is
> really good start to optimize further. However, I am worry
> about introducing per-pool config this stage. How about
> to introduce just one golden value for the zspage size?
> order-3 or 4 in Kconfig with keeping default 2?

Sorry, not sure I'm following. So you want a .config value
for zspage limit? I really like the sysfs knob, because then
one may set values on per-device basis (if they have multiple
zram devices in a system with different data patterns):

	zram0 which is used as a swap device uses, say, 4
	zram1 which is vfat block device uses, say, 6
	zram2 which is ext4 block device uses, say, 8

The whole point of the series is that one single value does
not fit all purposes. There is no silver bullet.

> And then we make more efforts to have auto tune based on
> the wasted memory and the number of size classes on the
> fly. A good thing to be able to achieve is we have indirect
> table(handle <-> zpage) so we could move the object anytime
> so I think we could do better way in the end.

It still needs to be per zram device (per zspool). sysfs knob
doesn't stop us from having auto-tuned values in the future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member
  2022-11-10 22:25   ` Minchan Kim
@ 2022-11-11  1:07     ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-11  1:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/10 14:25), Minchan Kim wrote:
> On Mon, Oct 31, 2022 at 02:41:03PM +0900, Sergey Senozhatsky wrote:
> > We will permit per-pool configuration of pages per-zspage value,
> > which changes characteristics of the classes and moves around
> > huge class size watermark. Thus huge class size needs to be
> > a per-pool variable.
> 
> I think part of code in previous patch should move here since
> you are creating the feature in this patch:

What do you mean? This patch - make huge_class_size a pool value - looks
completely independent to me.

> BTW, I am wondering we really need to jump the per-pool config
> option over global general golden ratio and/or smarter approach
> to optimize transparently depending on how much memory we have
> wasted.

I like the per-zspool value.

Dynamic zspage sizing is going to be very very difficult if possible at
all. With different zspage limits we create different size class clusters
and we also limit huge size class watermark. So if we say, increase the
zspage length value, then we have more size classes: but in order for us
to actually start saving memory we need to move objects that waste
memory in previous cluster configuration to new classes. It's even more
complex with huge objects. When we say move huge size class watermark
from 3264 to 3632 then in order to actually save memory we need to
recompress huge objects and put them into size classes that are between
3264 and 3632.

And that's only half. We also can lower the zspage length limit and
we'll have less size classes (because they merge more) and move huge
size class watermark from 3632 back to 3264. How do we handle this?

I really think that per-zspool knob is the easiest way. And it doesn't
block us from doing any improvements in the future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  2022-10-31  5:41 ` [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool() Sergey Senozhatsky
  2022-11-09  6:24   ` Sergey Senozhatsky
@ 2022-11-11  2:10   ` Minchan Kim
  2022-11-11 10:32     ` Sergey Senozhatsky
  1 sibling, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-11  2:10 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:41:05PM +0900, Sergey Senozhatsky wrote:
> Allow zsmalloc pool owner to specify max number of pages
> per-zspage (during pool creation), so that different pools
> can have different characteristics.
> 
> By default we pass ZS_DEFAULT_PAGES_PER_ZSPAGE which is 4
> (matches the current order 2 zspages limit).

How could user decide what's the best size for their workload?

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zram_drv.c |  3 ++-
>  include/linux/zsmalloc.h      |  2 +-
>  mm/zsmalloc.c                 | 11 +++++++----
>  3 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 90b0c66bbd5b..bec02f636bce 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1247,7 +1247,8 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
>  	if (!zram->table)
>  		return false;
>  
> -	zram->mem_pool = zs_create_pool(zram->disk->disk_name);
> +	zram->mem_pool = zs_create_pool(zram->disk->disk_name,
> +					ZS_DEFAULT_PAGES_PER_ZSPAGE);
>  	if (!zram->mem_pool) {
>  		vfree(zram->table);
>  		return false;
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index b6b8654a2d45..28f2b9cb1c47 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -50,7 +50,7 @@ struct zs_pool_stats {
>  
>  struct zs_pool;
>  
> -struct zs_pool *zs_create_pool(const char *name);
> +struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages);
>  void zs_destroy_pool(struct zs_pool *pool);
>  
>  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index d329bd673baa..42987a913f45 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -366,7 +366,7 @@ static void *zs_zpool_create(const char *name, gfp_t gfp,
>  	 * different contexts and its caller must provide a valid
>  	 * gfp mask.
>  	 */
> -	return zs_create_pool(name);
> +	return zs_create_pool(name, ZS_DEFAULT_PAGES_PER_ZSPAGE);
>  }
>  
>  static void zs_zpool_destroy(void *pool)
> @@ -2195,6 +2195,7 @@ static int zs_register_shrinker(struct zs_pool *pool)
>  /**
>   * zs_create_pool - Creates an allocation pool to work from.
>   * @name: pool name to be created
> + * @num_pages: maximum number of pages per-zspage

How about "max_page_chain:"? 

>   *
>   * This function must be called before anything when using
>   * the zsmalloc allocator.
> @@ -2202,18 +2203,20 @@ static int zs_register_shrinker(struct zs_pool *pool)
>   * On success, a pointer to the newly created pool is returned,
>   * otherwise NULL.
>   */
> -struct zs_pool *zs_create_pool(const char *name)
> +struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages)
>  {
>  	int i;
>  	struct zs_pool *pool;
>  	struct size_class *prev_class = NULL;
> -	unsigned long num_pages;
> +
> +	if (WARN_ON(num_pages < ZS_MIN_PAGES_PER_ZSPAGE ||
> +		    num_pages > ZS_MAX_PAGES_PER_ZSPAGE))
> +		return NULL;
>  
>  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>  	if (!pool)
>  		return NULL;
>  
> -	num_pages = ZS_DEFAULT_PAGES_PER_ZSPAGE;
>  	/* min_alloc_size must be multiple of ZS_ALIGN */
>  	pool->min_alloc_size = num_pages << PAGE_SHIFT >> OBJ_INDEX_BITS;
>  	pool->min_alloc_size = max(pool->min_alloc_size, ZS_MIN_ALLOC_SIZE);
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute
  2022-10-31  5:41 ` [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute Sergey Senozhatsky
@ 2022-11-11  2:20   ` Minchan Kim
  2022-11-11 10:34     ` Sergey Senozhatsky
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-11  2:20 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Mon, Oct 31, 2022 at 02:41:07PM +0900, Sergey Senozhatsky wrote:
> Provide a simple documentation for pages_per_pool_page ZRAM
> device attribute.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  Documentation/admin-guide/blockdev/zram.rst | 38 ++++++++++++++++-----
>  1 file changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
> index 010fb05a5999..4cb287520d45 100644
> --- a/Documentation/admin-guide/blockdev/zram.rst
> +++ b/Documentation/admin-guide/blockdev/zram.rst
> @@ -112,7 +112,29 @@ to list all of them using, for instance, /proc/crypto or any other
>  method. This, however, has an advantage of permitting the usage of
>  custom crypto compression modules (implementing S/W or H/W compression).
>  
> -4) Set Disksize
> +4) Set pages per-pool page limit: Optional
> +==========================================
> +
> +zsmalloc pages can consist of up to ZS_DEFAULT_PAGES_PER_ZSPAGE (single)
> +physical pages. The exact number is calculated for each zsmalloc size
> +class during zsmalloc pool creation. ZRAM provides pages_per_pool_page
> +device attribute that lets one adjust that limit (maximum possible value
> +is ZS_MAX_PAGES_PER_ZSPAGE). The default limit is considered to be good
> +enough, so tweak this value only when the changes in zsmalloc size classes
> +characteristics are beneficial for your data patterns. The limit on the
> +pages per zspages (currently) should be in [1,16] range; default value
> +is 4.

I think we need to introudce pros and cons for user to decide it since
it's not familiar with admin. I think It would need more explanation about
zsmalloc internal(especailly zspage and size classes)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  2022-11-11  2:10   ` Minchan Kim
@ 2022-11-11 10:32     ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-11 10:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/10 18:10), Minchan Kim wrote:
> On Mon, Oct 31, 2022 at 02:41:05PM +0900, Sergey Senozhatsky wrote:
> > Allow zsmalloc pool owner to specify max number of pages
> > per-zspage (during pool creation), so that different pools
> > can have different characteristics.
> > 
> > By default we pass ZS_DEFAULT_PAGES_PER_ZSPAGE which is 4
> > (matches the current order 2 zspages limit).
> 
> How could user decide what's the best size for their workload?

[..]

For starters in a similar manner that I showed during our meeting.
They can run tests, gather stats (zsmalloc objects distribution),
analyze where most of the objects sit, how things change when we
have different cluster configurations, and so on.

But more importantly: they need lots of zramX mm_stat data, which is
perfectly traceable and collectable during fleet A/B testing: when a
number of devices get randomly assigned to different experiments and
receive different zspage len configuration, which they feed to zram
sysfs knobs during system startup (when init script configures zram).
And then look at statistically significant improvements or regressions.

This is how things done in ChromeOS and I'm sure in many other places.

In this regard, finding best zspage len value is not any different from
finding what is the best zram disksize, or what is the best compression
algorithm. Exactly same approach - feed different configuration to devices
and then analyze the data. Look at mm_stat-s before and after experiment,
per device class/type.

We can discuss in more details internally.

> >  static void zs_zpool_destroy(void *pool)
> > @@ -2195,6 +2195,7 @@ static int zs_register_shrinker(struct zs_pool *pool)
> >  /**
> >   * zs_create_pool - Creates an allocation pool to work from.
> >   * @name: pool name to be created
> > + * @num_pages: maximum number of pages per-zspage
> 
> How about "max_page_chain:"? 

OK.

Do you dislike idea of creating a `struct zs_tunables` which will hold
all fields that we can tune? And then zsmalloc users can pass that
struct (a pointer to) to zs_create_pool().

There can be various tunables. Like policy changes: do we use static
zspool configuration, or a dynamic one and so on.

On zram side, we can have a generic sysfs knob: allocator_tuning,
which will accept named params, the same way we did it for
recomp_algorithm and recompress.

	echo "tuneable=VAL tunealbe=VAL" > /sys/block/zramX/allocator_tuning

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute
  2022-11-11  2:20   ` Minchan Kim
@ 2022-11-11 10:34     ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-11 10:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/10 18:20), Minchan Kim wrote:
> > diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
> > index 010fb05a5999..4cb287520d45 100644
> > --- a/Documentation/admin-guide/blockdev/zram.rst
> > +++ b/Documentation/admin-guide/blockdev/zram.rst
> > @@ -112,7 +112,29 @@ to list all of them using, for instance, /proc/crypto or any other
> >  method. This, however, has an advantage of permitting the usage of
> >  custom crypto compression modules (implementing S/W or H/W compression).
> >  
> > -4) Set Disksize
> > +4) Set pages per-pool page limit: Optional
> > +==========================================
> > +
> > +zsmalloc pages can consist of up to ZS_DEFAULT_PAGES_PER_ZSPAGE (single)
> > +physical pages. The exact number is calculated for each zsmalloc size
> > +class during zsmalloc pool creation. ZRAM provides pages_per_pool_page
> > +device attribute that lets one adjust that limit (maximum possible value
> > +is ZS_MAX_PAGES_PER_ZSPAGE). The default limit is considered to be good
> > +enough, so tweak this value only when the changes in zsmalloc size classes
> > +characteristics are beneficial for your data patterns. The limit on the
> > +pages per zspages (currently) should be in [1,16] range; default value
> > +is 4.
> 
> I think we need to introudce pros and cons for user to decide it since
> it's not familiar with admin. I think It would need more explanation about
> zsmalloc internal(especailly zspage and size classes)

OK, agreed. I have quite a bit of info in the 0002 commit messages.
I can copy-paste some of those bits and edit them. We also have
some info the internal doc, which I can also use as a "source of
inspiration".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable
  2022-11-10 21:59   ` Minchan Kim
@ 2022-11-11 10:38     ` Sergey Senozhatsky
  2022-11-11 17:09       ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-11 10:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/10 13:59), Minchan Kim wrote:
[..]
> > +#define ZS_PAGE_ORDER_2		2
> > +#define ZS_PAGE_ORDER_4		4
> > +
> > +/*
> > + * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
> > + * pages. ZS_MAX_PAGE_ORDER defines upper limit on N, ZS_MIN_PAGE_ORDER
> > + * defines lower limit on N. ZS_DEFAULT_PAGE_ORDER is recommended value.
> 
> It gives the impression:
> 
>    2^2 <= the page nubmer of zspage <= 2^4
> 
> I think that's not what you want to describe. How about?
> 
> A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
> pages and the N can be from ZS_MIN_PAGE_ORDER to ZS_MAX_PAGE_ORDER.

OK.

> > + */
> > +#define ZS_MIN_PAGE_ORDER	ZS_PAGE_ORDER_2
> > +#define ZS_MAX_PAGE_ORDER	ZS_PAGE_ORDER_4
> > +#define ZS_DEFAULT_PAGE_ORDER	ZS_PAGE_ORDER_2
> 
> #define ZS_MIN_PAGE_ORDER	2
> 
> We can use the number directly instead of another wrapping at least
> in this patch(Just in case: if you want to extent it later patch,
> please do it in the patch)

OK.

[..]
> > -#define MAX(a, b) ((a) >= (b) ? (a) : (b))
> > -/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> > -#define ZS_MIN_ALLOC_SIZE \
> > -	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
> > +#define ZS_MIN_ALLOC_SIZE	32U
> 
> Let's have some comment here to say that's not the final vaule which
> is supposed to be pool->min_alloc_size.

OK.

[..]
> >  enum fullness_group {
> >  	ZS_EMPTY,
> > @@ -230,12 +221,15 @@ struct link_free {
> >  struct zs_pool {
> >  	const char *name;
> >  
> > -	struct size_class *size_class[ZS_SIZE_CLASSES];
> > +	struct size_class **size_class;
> >  	struct kmem_cache *handle_cachep;
> >  	struct kmem_cache *zspage_cachep;
> >  
> >  	atomic_long_t pages_allocated;
> >  
> > +	u32 num_size_classes;
> > +	u32 min_alloc_size;
> 
> Please use int.

OK. Any reason why we don't want u32? I thought that
s16/u16/s32/u32/etc. is the new normal.

> From this patch, I couldn't figure why we need
> variable in the pool. Let's have the change in the patch where
> you really need to have the usecase.

Let me take a look.

> > -static int get_pages_per_zspage(int class_size)
> > +static int get_pages_per_zspage(u32 class_size, u32 num_pages)
> 
> Let's just use int instead of u32
> 
> Why do you need num_pages argument instead of using 1UL << ZS_DEFAULT_PAGE_ORDER?
> It looks like static value.

It is static right now, but in the a couple of patches it'll change to
dynamic.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-11  0:56   ` Sergey Senozhatsky
@ 2022-11-11 17:03     ` Minchan Kim
  2022-11-14  3:53       ` Sergey Senozhatsky
                         ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Minchan Kim @ 2022-11-11 17:03 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Fri, Nov 11, 2022 at 09:56:36AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (22/11/10 14:44), Minchan Kim wrote:
> > On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> > > 	Hello,
> > > 
> > > 	Some use-cases and/or data patterns may benefit from
> > > larger zspages. Currently the limit on the number of physical
> > > pages that are linked into a zspage is hardcoded to 4. Higher
> > > limit changes key characteristics of a number of the size
> > > classes, improving compactness of the pool and redusing the
> > > amount of memory zsmalloc pool uses. More on this in 0002
> > > commit message.
> > 
> > Hi Sergey,
> > 
> > I think the idea that break of fixed subpages in zspage is
> > really good start to optimize further. However, I am worry
> > about introducing per-pool config this stage. How about
> > to introduce just one golden value for the zspage size?
> > order-3 or 4 in Kconfig with keeping default 2?
> 
> Sorry, not sure I'm following. So you want a .config value
> for zspage limit? I really like the sysfs knob, because then
> one may set values on per-device basis (if they have multiple
> zram devices in a system with different data patterns):

Yes, I wanted to have just a global policy to drive zsmalloc smarter
without needing user's big effort to decide right tune value(I thought
the decision process would be quite painful for normal user who don't
have enough resources) since zsmalloc's design makes it possible.
But for the interim solution until we prove no regression, just
provide config and then remove the config later when we add aggressive
zpage compaction(if necessary, please see below) since it's easier to
deprecate syfs knob.

> 
> 	zram0 which is used as a swap device uses, say, 4
> 	zram1 which is vfat block device uses, say, 6
> 	zram2 which is ext4 block device uses, say, 8
> 
> The whole point of the series is that one single value does
> not fit all purposes. There is no silver bullet.

I understand what you want to achieve with per-pool config with exposing
the knob to user but my worry is still how user could decide best fit
since workload is so dynamic. Some groups have enough resouces to practice
under fleet experimental while many others don't so if we really need the
per-pool config step, at least, I'd like to provide default guide to user
in the documentation along with the tunable knobs for experimental.
Maybe, we can suggest 4 for swap case and 8 for fs case.

I don't disagree the sysfs knobs for use cases but can't we deal with the
issue better way?

In general, the bigger pages_per_zspage, the more memory saving. It would
be same with slab_order in slab allocator but slab has the limit due to
high-order allocation cost and internal fragmentation with bigger order
size slab. However, zsmalloc is different in that it doesn't expose memory
address directly and it knows when the object is accessed by user. And
it doesn't need high-order allocation, either. That's how zsmalloc could
support object migration and page migration. With those features, theoretically,
zsmalloc doesn't need limitation of the pages_per_zspage so I am looking
forward to seeing zsmalloc handles the memory fragmentation problem better way.

Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory
when zram is used for swap. The use case aims to help memory pressure but the
worst case, the bigger pages_per_zspage, more chance to out of memory.
However, we could bound the worst case memory consumption up to

for class in classes:
    wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size

with *aggressive zpage compaction*. Now, we are relying on shrinker
(it might be already enough) to trigger but we could change the policy 
wasted memory in the class size crossed a threshold we defind for zram fs
usecase since it would be used without memory pressure.

What do you think about?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable
  2022-11-11 10:38     ` Sergey Senozhatsky
@ 2022-11-11 17:09       ` Minchan Kim
  2022-11-14  3:55         ` Sergey Senozhatsky
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-11 17:09 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Fri, Nov 11, 2022 at 07:38:10PM +0900, Sergey Senozhatsky wrote:
< snip >

> [..]
> > >  enum fullness_group {
> > >  	ZS_EMPTY,
> > > @@ -230,12 +221,15 @@ struct link_free {
> > >  struct zs_pool {
> > >  	const char *name;
> > >  
> > > -	struct size_class *size_class[ZS_SIZE_CLASSES];
> > > +	struct size_class **size_class;
> > >  	struct kmem_cache *handle_cachep;
> > >  	struct kmem_cache *zspage_cachep;
> > >  
> > >  	atomic_long_t pages_allocated;
> > >  
> > > +	u32 num_size_classes;
> > > +	u32 min_alloc_size;
> > 
> > Please use int.
> 
> OK. Any reason why we don't want u32? I thought that
> s16/u16/s32/u32/etc. is the new normal.

Oh, I didn't know the new normal.

# ag u32 mm/ | wc -l 
65

Then, I'd like to use int to be consistent with others.

> 
> > From this patch, I couldn't figure why we need
> > variable in the pool. Let's have the change in the patch where
> > you really need to have the usecase.
> 
> Let me take a look.
> 
> > > -static int get_pages_per_zspage(int class_size)
> > > +static int get_pages_per_zspage(u32 class_size, u32 num_pages)
> > 
> > Let's just use int instead of u32
> > 
> > Why do you need num_pages argument instead of using 1UL << ZS_DEFAULT_PAGE_ORDER?
> > It looks like static value.
> 
> It is static right now, but in the a couple of patches it'll change to
> dynamic.

Then, plase have the change in the patch you will use to review easier.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  2022-11-09  6:24   ` Sergey Senozhatsky
@ 2022-11-11 17:14     ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2022-11-11 17:14 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Wed, Nov 09, 2022 at 03:24:43PM +0900, Sergey Senozhatsky wrote:
> On (22/10/31 14:41), Sergey Senozhatsky wrote:
> [..]
> > -struct zs_pool *zs_create_pool(const char *name)
> > +struct zs_pool *zs_create_pool(const char *name, unsigned long num_pages)
> >  {
> >  	int i;
> >  	struct zs_pool *pool;
> >  	struct size_class *prev_class = NULL;
> > -	unsigned long num_pages;
> > +
> > +	if (WARN_ON(num_pages < ZS_MIN_PAGES_PER_ZSPAGE ||
> > +		    num_pages > ZS_MAX_PAGES_PER_ZSPAGE))
> > +		return NULL;
> 
> I tend to think that creating `struct zs_tunables` would be better. For
> the time being zs_tunables will contain only one member max_zspage_len,
> but it can be extended in the future.

+1 zs_tunables if we go that way.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-11 17:03     ` Minchan Kim
@ 2022-11-14  3:53       ` Sergey Senozhatsky
  2022-11-14  7:55       ` Sergey Senozhatsky
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-14  3:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

Hi Minchan,

On (22/11/11 09:03), Minchan Kim wrote:
> > Sorry, not sure I'm following. So you want a .config value
> > for zspage limit? I really like the sysfs knob, because then
> > one may set values on per-device basis (if they have multiple
> > zram devices in a system with different data patterns):
> 
> Yes, I wanted to have just a global policy to drive zsmalloc smarter
> without needing user's big effort to decide right tune value(I thought
> the decision process would be quite painful for normal user who don't
> have enough resources) since zsmalloc's design makes it possible.
> But for the interim solution until we prove no regression, just
> provide config and then remove the config later when we add aggressive
> zpage compaction(if necessary, please see below) since it's easier to
> deprecate syfs knob.

[..]

> I understand what you want to achieve with per-pool config with exposing
> the knob to user but my worry is still how user could decide best fit
> since workload is so dynamic. Some groups have enough resouces to practice
> under fleet experimental while many others don't so if we really need the
> per-pool config step, at least, I'd like to provide default guide to user
> in the documentation along with the tunable knobs for experimental.
> Maybe, we can suggest 4 for swap case and 8 for fs case.
> 
> I don't disagree the sysfs knobs for use cases but can't we deal with the
> issue better way?

[..]

> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold we defind for zram fs
> usecase since it would be used without memory pressure.
> 
> What do you think about?

This is tricky. I didn't want us to come up with any sort of policies
based on assumptions. For instance, we know that SUSE uses zram with fs
under severe memory pressure (so severe that they immediately noticed
when we removed zsmalloc handle allocation slow path and reported a
regression), so assumption that fs zram use-case is not memory sensitive
does not always hold.

There are too many variables. We have different data patterns, yes, but
even same data patterns have different characteristics when compressed
with different algorithms; then we also have different host states
(memory pressure, etc.) and so on.

I think that it'll be safer for us to execute it the other way.
We can (that's what I was going to do) reach out to people (Android,
SUSE, Meta, ChromeOS, Google cloud, WebOS, Tizen) and ask them to run
experiments (try out various numbers). Then (several months later) we
can take a look at the data - what numbers work for which workloads,
and then we can introduce/change policies, based on evidence and real
use cases. Who knows, maybe zspage_chain_size of 6 can be the new
default and then we can add .config policy, maybe 7 or 8. Or maybe we
won't find a single number that works equally well for everyone (even
in similar use cases).

This is where sysfs knob is very useful. Unlike .config, which has no
flexibility especially when your entire fleet uses same .config for all
builds, sysfs knob lets people run numerous A/B tests simultaneously
(not to mention that some setups have many zram devices which can have
different zspage_chain_size-s). And we don't even need to deprecate it,
if we introduce a generic one like allocator_tunables, which will
support tuples `key=val`. Then we can just deprecate a specific `key`.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable
  2022-11-11 17:09       ` Minchan Kim
@ 2022-11-14  3:55         ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-14  3:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/11 09:09), Minchan Kim wrote:
> > [..]
> > OK. Any reason why we don't want u32? I thought that
> > s16/u16/s32/u32/etc. is the new normal.
> 
> Oh, I didn't know the new normal.
> 
> # ag u32 mm/ | wc -l 
> 65
> 
> Then, I'd like to use int to be consistent with others.

OK.

> > > Let's just use int instead of u32
> > > 
> > > Why do you need num_pages argument instead of using 1UL << ZS_DEFAULT_PAGE_ORDER?
> > > It looks like static value.
> > 
> > It is static right now, but in the a couple of patches it'll change to
> > dynamic.
> 
> Then, plase have the change in the patch you will use to review easier.

OK.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-11 17:03     ` Minchan Kim
  2022-11-14  3:53       ` Sergey Senozhatsky
@ 2022-11-14  7:55       ` Sergey Senozhatsky
  2022-11-14  8:37       ` Sergey Senozhatsky
  2022-11-15  6:01       ` Sergey Senozhatsky
  3 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-14  7:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/11 09:03), Minchan Kim wrote:
[..]
> Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory
> when zram is used for swap. The use case aims to help memory pressure but the
> worst case, the bigger pages_per_zspage, more chance to out of memory.

It's hard to speak in concrete terms here. What locally may look
like a less optimal configuration, can result in a more optimal configuration
globally.

Yes, some zspage_chains get longer, but in return we have very different
clustering and zspool performance/configuration.

Example, a synthetic test on my host.

zspage_chain_size 4
-------------------

zsmalloc classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
 ...
 Total                13           51        413836     412973     159955                         3

zram mm_stat
1691783168 628083717 655175680        0 655175680       60        0    34048    34049

zspage_chain_size 8
-------------------

zsmalloc classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
 ...
 Total                18           87        414852     412978     156666                         0

zram mm_stat
1691803648 627793930 641703936        0 641703936       60        0    33591    33591


Note that we have lower "pages_used" value for the same amount of stored
data. Down to 156666 from 159955 pages.

So it *could be* that longer zspage_chains can be beneficial even in
memory sensitive cases, but we need more data on this, so that we can
speak "statistically".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-11 17:03     ` Minchan Kim
  2022-11-14  3:53       ` Sergey Senozhatsky
  2022-11-14  7:55       ` Sergey Senozhatsky
@ 2022-11-14  8:37       ` Sergey Senozhatsky
  2022-11-15  6:01       ` Sergey Senozhatsky
  3 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-14  8:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/11 09:03), Minchan Kim wrote:
[..]
> for class in classes:
>     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> 
> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold

That threshold can be another tunable in zramX/allocator_tunables sysfs
knob and struct zs_tunables.

But overall it sounds like a bigger project for some time next year.
We already have zs_compact() sysfs knob, so user-space can invoke it
as often as it wants to (not aware if anyone does btw), maybe new
compaction should be something slightly different. I don't have any
ideas yet. One way or the other it still can use the same sysfs knob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-11 17:03     ` Minchan Kim
                         ` (2 preceding siblings ...)
  2022-11-14  8:37       ` Sergey Senozhatsky
@ 2022-11-15  6:01       ` Sergey Senozhatsky
  2022-11-15  7:59         ` Sergey Senozhatsky
  3 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-15  6:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/11 09:03), Minchan Kim wrote:
[..]
> for class in classes:
>     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> 
> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold

Compaction does something good only when we can release zspage in the
end. Otherwise we just hold the global pool->lock (assuming that we
land zsmalloc writeback series) and simply move objects around zspages.
So ability to limit zspage chain size still can be valuable, on another
level, as a measure to reduce dependency on compaction success.

We may be can make compaction slightly more successful. For instance,
if we would start move objects not only within zspages of the same size
class, but, for example, move objects to class size + X (upper size
classes). As an example, when all zspages in class are almost full,
but class size + 1 has almost empty pages. In other words sort of as
is those classes had been merged. (virtual merge). Single pool->look
would be handy for it.

But this is more of a research project (intern project?), with unclear
outcome and ETA. I think in the mean time we can let people start
experimenting with various zspage chain sizes so that may be at some
point we can arrive to a new "default" value for all zspool, higher
than current 4, which has been around for many years. Can't think, at
present, of a better way forward.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-15  6:01       ` Sergey Senozhatsky
@ 2022-11-15  7:59         ` Sergey Senozhatsky
  2022-11-15 23:23           ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-15  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/15 15:01), Sergey Senozhatsky wrote:
> On (22/11/11 09:03), Minchan Kim wrote:
> [..]
> > for class in classes:
> >     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> > 
> > with *aggressive zpage compaction*. Now, we are relying on shrinker
> > (it might be already enough) to trigger but we could change the policy 
> > wasted memory in the class size crossed a threshold
> 
> Compaction does something good only when we can release zspage in the
> end. Otherwise we just hold the global pool->lock (assuming that we
> land zsmalloc writeback series) and simply move objects around zspages.
> So ability to limit zspage chain size still can be valuable, on another
> level, as a measure to reduce dependency on compaction success.
> 
> We may be can make compaction slightly more successful. For instance,
> if we would start move objects not only within zspages of the same size
> class, but, for example, move objects to class size + X (upper size
> classes). As an example, when all zspages in class are almost full,
> but class size + 1 has almost empty pages. In other words sort of as
> is those classes had been merged. (virtual merge). Single pool->look
> would be handy for it.

What I'm trying to say here is that "aggressiveness of compaction"
probably should be measured not by compaction frequency, but by overall
cost of compaction operations.

Aggressive frequency of compaction doesn't help us much if the state of
the pool doesn't change significantly between compactions. E.g. if we do
10 compaction calls, then only the first one potentially compacts some
zspages, the remaining ones don't do anything.

Cost of compaction operations is a measure of how hard compaction tries.
Does it move object to neighbouring classes and so on? May be we can do
something here.

But then the question is - how do we control that we don't drain battery
too fast? And perhaps some other questions too.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-15  7:59         ` Sergey Senozhatsky
@ 2022-11-15 23:23           ` Minchan Kim
  2022-11-16  0:52             ` Sergey Senozhatsky
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2022-11-15 23:23 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On Tue, Nov 15, 2022 at 04:59:29PM +0900, Sergey Senozhatsky wrote:
> On (22/11/15 15:01), Sergey Senozhatsky wrote:
> > On (22/11/11 09:03), Minchan Kim wrote:
> > [..]
> > > for class in classes:
> > >     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> > > 
> > > with *aggressive zpage compaction*. Now, we are relying on shrinker
> > > (it might be already enough) to trigger but we could change the policy 
> > > wasted memory in the class size crossed a threshold
> > 
> > Compaction does something good only when we can release zspage in the
> > end. Otherwise we just hold the global pool->lock (assuming that we
> > land zsmalloc writeback series) and simply move objects around zspages.
> > So ability to limit zspage chain size still can be valuable, on another
> > level, as a measure to reduce dependency on compaction success.
> > 
> > We may be can make compaction slightly more successful. For instance,
> > if we would start move objects not only within zspages of the same size
> > class, but, for example, move objects to class size + X (upper size
> > classes). As an example, when all zspages in class are almost full,
> > but class size + 1 has almost empty pages. In other words sort of as
> > is those classes had been merged. (virtual merge). Single pool->look
> > would be handy for it.
> 
> What I'm trying to say here is that "aggressiveness of compaction"
> probably should be measured not by compaction frequency, but by overall
> cost of compaction operations.
> 
> Aggressive frequency of compaction doesn't help us much if the state of
> the pool doesn't change significantly between compactions. E.g. if we do
> 10 compaction calls, then only the first one potentially compacts some
> zspages, the remaining ones don't do anything.
> 
> Cost of compaction operations is a measure of how hard compaction tries.
> Does it move object to neighbouring classes and so on? May be we can do
> something here.
> 
> But then the question is - how do we control that we don't drain battery
> too fast? And perhaps some other questions too.
> 

Sure, if we start talking about battery, that would have a lot of things
we need to consider not only from zram-direct but also other indirect-stuffs
caused caused by memory pressure and workload patterns. That's not what we
can control and would consume much more battery. I understand your concern
but also think sysfs per-konb can solve the issue since workload is too
dynamic even in the same swap file/fs, too. I'd like to try finding a
sweet spot in general. If it's too hard to have, then, we need to introduce
the knob with reasonable guideline how we could find it.

Let me try to see the data under Android workload how much just increase
the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size
  2022-11-15 23:23           ` Minchan Kim
@ 2022-11-16  0:52             ` Sergey Senozhatsky
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Senozhatsky @ 2022-11-16  0:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, Nitin Gupta, linux-kernel, linux-mm

On (22/11/15 15:23), Minchan Kim wrote:
> Sure, if we start talking about battery, that would have a lot of things
> we need to consider not only from zram-direct but also other indirect-stuffs
> caused caused by memory pressure and workload patterns. That's not what we
> can control and would consume much more battery. I understand your concern
> but also think sysfs per-konb can solve the issue since workload is too
> dynamic even in the same swap file/fs, too. I'd like to try finding a
> sweet spot in general. If it's too hard to have, then, we need to introduce
> the knob with reasonable guideline how we could find it.
> 
> Let me try to see the data under Android workload how much just increase
> the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data.

I don't want to push for sysfs knob.

What I like about sysfs knob vs KConfig is that sysfs is opt-in. We can
ask folks to try things out, people will know what to look at and they
will keep an eye on metrics, then they come back to us. So we can sit
down, look at the numbers and draw some conclusions. KConfig is not
opt-in. It'll happen for everyone, as a policy, transparently and then
we rely on
a) people tracking metrics that they were not asked to track
b) people noticing changes (positive or negative) in metrics that they
   don't keep an eye on
c) people figuring out that change in metrics is related to zsmalloc
   Kconfig (and that's a very non-obvious conclusion)
d) people reaching out to us

That's way too much to rely on. Chances are we will never hear back.

I understand that you don't like sysfs, and it's not the best thing
probably, but KConfig is not better. I like the opt-in nature of
sysfs - if you change it then you know what you are doing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-11-16  0:53 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31  5:40 [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 1/9] zram: add size class equals check into recompression Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 2/9] zsmalloc: turn zspage order into runtime variable Sergey Senozhatsky
2022-11-10 21:59   ` Minchan Kim
2022-11-11 10:38     ` Sergey Senozhatsky
2022-11-11 17:09       ` Minchan Kim
2022-11-14  3:55         ` Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 3/9] zsmalloc: move away from page order defines Sergey Senozhatsky
2022-11-10 22:02   ` Minchan Kim
2022-10-31  5:41 ` [PATCHv4 4/9] zsmalloc: make huge class watermark zs_pool member Sergey Senozhatsky
2022-11-10 22:25   ` Minchan Kim
2022-11-11  1:07     ` Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 5/9] zram: huge size watermark cannot be global Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 6/9] zsmalloc: pass limit on pages per-zspage to zs_create_pool() Sergey Senozhatsky
2022-11-09  6:24   ` Sergey Senozhatsky
2022-11-11 17:14     ` Minchan Kim
2022-11-11  2:10   ` Minchan Kim
2022-11-11 10:32     ` Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 7/9] zram: add pages_per_pool_page device attribute Sergey Senozhatsky
2022-11-09  4:34   ` Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 8/9] Documentation: document zram pages_per_pool_page attribute Sergey Senozhatsky
2022-11-11  2:20   ` Minchan Kim
2022-11-11 10:34     ` Sergey Senozhatsky
2022-10-31  5:41 ` [PATCHv4 9/9] zsmalloc: break out of loop when found perfect zspage order Sergey Senozhatsky
2022-11-10 22:44 ` [PATCHv4 0/9] zsmalloc/zram: configurable zspage size Minchan Kim
2022-11-11  0:56   ` Sergey Senozhatsky
2022-11-11 17:03     ` Minchan Kim
2022-11-14  3:53       ` Sergey Senozhatsky
2022-11-14  7:55       ` Sergey Senozhatsky
2022-11-14  8:37       ` Sergey Senozhatsky
2022-11-15  6:01       ` Sergey Senozhatsky
2022-11-15  7:59         ` Sergey Senozhatsky
2022-11-15 23:23           ` Minchan Kim
2022-11-16  0:52             ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).