[PATCH v4 0/5] Implement MTE tag compression for swapped pages

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/5] Implement MTE tag compression for swapped pages
@ 2023-07-20 17:39 ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying 128 bytes per page. This is especially problematic
for devices that use zram-backed in-memory swap, because tags stored
uncompressed in the heap effectively reduce the available amount of
swap memory.

The RLE-based EA0 algorithm suggested by Evgenii Stepanov and
implemented in this patch series is able to efficiently compress
128-byte tag buffers, resulting in practical compression ratio between
2.5x and 20x. In most cases it is possible to store the compressed data
in 63-bit Xarray values, resulting in no extra memory allocations.

Our measurements show that EA0 provides better compression than existing
kernel compression algorithms (LZ4, LZO, LZ4HC, ZSTD) can offer, because
EA0 specifically targets 128-byte buffers.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to set/get bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v4:
 - fixed a bunch of comments by Andy Shevchenko and Yury Norov
 - added Documentation/arch/arm64/mte-tag-compression.rst

v3:
 - as suggested by Andy Shevchenko, use
   bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
 - switched to unsigned long to reduce typecasts
 - simplified the compression code

v2:
 - as suggested by Yuri Norov, replace the poorly implemented struct
   bitq with <linux/bitmap.h>

Alexander Potapenko (4):
  lib/test_bitmap: add tests for bitmap_{set,get}_value()
  arm64: mte: implement CONFIG_ARM64_MTE_COMP
  arm64: mte: add a test for MTE tags compression
  arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
  lib/bitmap: add bitmap_{set,get}_value()

 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 212 ++++++++++
 arch/arm64/Kconfig                            |  19 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   7 +
 arch/arm64/mm/mtecomp.c                       | 382 ++++++++++++++++++
 arch/arm64/mm/mteswap.c                       |  20 +-
 arch/arm64/mm/mteswap.h                       |  12 +
 arch/arm64/mm/mteswap_comp.c                  |  54 +++
 arch/arm64/mm/mteswap_nocomp.c                |  38 ++
 arch/arm64/mm/test_mtecomp.c                  | 217 ++++++++++
 include/linux/bitmap.h                        |  60 +++
 lib/test_bitmap.c                             |  81 ++++
 13 files changed, 1105 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c
 create mode 100644 arch/arm64/mm/test_mtecomp.c

-- 
2.41.0.487.g6d72f3e995-goog

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 0/5] Implement MTE tag compression for swapped pages
@ 2023-07-20 17:39 ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying 128 bytes per page. This is especially problematic
for devices that use zram-backed in-memory swap, because tags stored
uncompressed in the heap effectively reduce the available amount of
swap memory.

The RLE-based EA0 algorithm suggested by Evgenii Stepanov and
implemented in this patch series is able to efficiently compress
128-byte tag buffers, resulting in practical compression ratio between
2.5x and 20x. In most cases it is possible to store the compressed data
in 63-bit Xarray values, resulting in no extra memory allocations.

Our measurements show that EA0 provides better compression than existing
kernel compression algorithms (LZ4, LZO, LZ4HC, ZSTD) can offer, because
EA0 specifically targets 128-byte buffers.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to set/get bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v4:
 - fixed a bunch of comments by Andy Shevchenko and Yury Norov
 - added Documentation/arch/arm64/mte-tag-compression.rst

v3:
 - as suggested by Andy Shevchenko, use
   bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
 - switched to unsigned long to reduce typecasts
 - simplified the compression code

v2:
 - as suggested by Yuri Norov, replace the poorly implemented struct
   bitq with <linux/bitmap.h>

Alexander Potapenko (4):
  lib/test_bitmap: add tests for bitmap_{set,get}_value()
  arm64: mte: implement CONFIG_ARM64_MTE_COMP
  arm64: mte: add a test for MTE tags compression
  arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
  lib/bitmap: add bitmap_{set,get}_value()

 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 212 ++++++++++
 arch/arm64/Kconfig                            |  19 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   7 +
 arch/arm64/mm/mtecomp.c                       | 382 ++++++++++++++++++
 arch/arm64/mm/mteswap.c                       |  20 +-
 arch/arm64/mm/mteswap.h                       |  12 +
 arch/arm64/mm/mteswap_comp.c                  |  54 +++
 arch/arm64/mm/mteswap_nocomp.c                |  38 ++
 arch/arm64/mm/test_mtecomp.c                  | 217 ++++++++++
 include/linux/bitmap.h                        |  60 +++
 lib/test_bitmap.c                             |  81 ++++
 13 files changed, 1105 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c
 create mode 100644 arch/arm64/mm/test_mtecomp.c

-- 
2.41.0.487.g6d72f3e995-goog

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-20 17:39 ` Alexander Potapenko
@ 2023-07-20 17:39   ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+
+	if (unlikely(!nbits))
+		return;
+	value &= GENMASK(nbits - 1, 0);
+	if (space >= nbits) {
+		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-07-20 17:39   ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+
+	if (unlikely(!nbits))
+		return;
+	value &= GENMASK(nbits - 1, 0);
+	if (space >= nbits) {
+		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.41.0.487.g6d72f3e995-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-20 17:39 ` Alexander Potapenko
@ 2023-07-20 17:39   ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
(https://lore.kernel.org/lkml/20230713125706.2884502-3-glider@google.com/)

v4:
 - Address comments by Andy Shevchenko: added Reviewed-by: and a link to
   the previous discussion
 - Address comments by Yury Norov:
   - expand the bitmap to catch more corner cases
   - add code testing that bitmap_set_value() does not touch adjacent
     bits
   - add code testing the nbits==0 case
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()

v3:
 - switch to using bitmap_{set,get}_value()
 - change the expected bit pattern in test_set_get_value(),
   as the test was incorrectly assuming 0 is the LSB.
---
 lib/test_bitmap.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..601000c7799df 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
 	return true;
 }
 
+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+		 const unsigned long exp_ulong, unsigned long x)
+{
+	if (exp_ulong != x) {
+		pr_err("[%s:%u] expected %lu, got %lu\n",
+			srcfile, line, exp_ulong, x);
+		return false;
+	}
+	return true;
+}
 
 static bool __init
 __check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
 	})
 
 #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
 #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
 #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
 #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,73 @@ static void __init test_bitmap_const_eval(void)
 	BUILD_BUG_ON(~var != ~BIT(25));
 }
 
+/*
+ * Test bitmap should be big enough to include the cases when start is not in
+ * the first word, and start+nbits lands in the following word.
+ */
+#define TEST_BIT_LEN (BITS_PER_LONG * 3)
+#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))
+static void __init test_set_get_value(void)
+{
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
+	/* Prevent constant folding. */
+	volatile unsigned long zero_bits = 0;
+	unsigned long val, bit;
+	int i;
+
+	/* Setting/getting zero bytes should not crash the kernel. */
+	bitmap_write(NULL, 0, 0, zero_bits);
+	val = bitmap_read(NULL, 0, zero_bits);
+	expect_eq_ulong(0, val);
+
+	/*
+	 * Ensure that bitmap_read() reads the same value that was previously
+	 * written, and two consequent values are correctly merged.
+	 * The resulting bit pattern is asymmetric to rule out possible issues
+	 * with bit numeration order.
+	 */
+	for (i = 0; i < TEST_BIT_LEN - 7; i++) {
+		bitmap_zero(bitmap, TEST_BIT_LEN);
+		bitmap_write(bitmap, 0b10101UL, i, 5);
+		val = bitmap_read(bitmap, i, 5);
+		expect_eq_ulong(0b10101UL, val);
+		bitmap_write(bitmap, 0b101UL, i + 5, 3);
+		val = bitmap_read(bitmap, i + 5, 3);
+		expect_eq_ulong(0b101UL, val);
+		val = bitmap_read(bitmap, i, 8);
+		expect_eq_ulong(0b10110101UL, val);
+	}
+
+	/*
+	 * Check that setting a single bit does not accidentally touch the
+	 * adjacent bits.
+	 */
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		/*
+		 * A 0b10101010 pattern to catch both 0s replaced to 1s and vice
+		 * versa.
+		 */
+		memset(bitmap, 0xaa, TEST_BYTE_LEN);
+		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
+		for (bit = 0; bit <= 1; bit++) {
+			bitmap_write(bitmap, bit, i, 1);
+			__assign_bit(i, exp_bitmap, bit);
+			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+		}
+	}
+
+	/* Ensure setting 0 bits does not change anything. */
+	memset(bitmap, 0xaa, TEST_BYTE_LEN);
+	memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_write(bitmap, ~0UL, i, 0);
+		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+	}
+}
+#undef TEST_BYTE_LEN
+#undef TEST_BIT_LEN
+
 static void __init selftest(void)
 {
 	test_zero_clear();
@@ -1249,6 +1328,8 @@ static void __init selftest(void)
 	test_for_each_clear_bitrange_from();
 	test_for_each_set_clump8();
 	test_for_each_set_bit_wrap();
+
+	test_set_get_value();
 }
 
 KSTM_MODULE_LOADERS(test_bitmap);
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
@ 2023-07-20 17:39   ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
(https://lore.kernel.org/lkml/20230713125706.2884502-3-glider@google.com/)

v4:
 - Address comments by Andy Shevchenko: added Reviewed-by: and a link to
   the previous discussion
 - Address comments by Yury Norov:
   - expand the bitmap to catch more corner cases
   - add code testing that bitmap_set_value() does not touch adjacent
     bits
   - add code testing the nbits==0 case
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()

v3:
 - switch to using bitmap_{set,get}_value()
 - change the expected bit pattern in test_set_get_value(),
   as the test was incorrectly assuming 0 is the LSB.
---
 lib/test_bitmap.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..601000c7799df 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
 	return true;
 }
 
+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+		 const unsigned long exp_ulong, unsigned long x)
+{
+	if (exp_ulong != x) {
+		pr_err("[%s:%u] expected %lu, got %lu\n",
+			srcfile, line, exp_ulong, x);
+		return false;
+	}
+	return true;
+}
 
 static bool __init
 __check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
 	})
 
 #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
 #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
 #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
 #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,73 @@ static void __init test_bitmap_const_eval(void)
 	BUILD_BUG_ON(~var != ~BIT(25));
 }
 
+/*
+ * Test bitmap should be big enough to include the cases when start is not in
+ * the first word, and start+nbits lands in the following word.
+ */
+#define TEST_BIT_LEN (BITS_PER_LONG * 3)
+#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))
+static void __init test_set_get_value(void)
+{
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
+	/* Prevent constant folding. */
+	volatile unsigned long zero_bits = 0;
+	unsigned long val, bit;
+	int i;
+
+	/* Setting/getting zero bytes should not crash the kernel. */
+	bitmap_write(NULL, 0, 0, zero_bits);
+	val = bitmap_read(NULL, 0, zero_bits);
+	expect_eq_ulong(0, val);
+
+	/*
+	 * Ensure that bitmap_read() reads the same value that was previously
+	 * written, and two consequent values are correctly merged.
+	 * The resulting bit pattern is asymmetric to rule out possible issues
+	 * with bit numeration order.
+	 */
+	for (i = 0; i < TEST_BIT_LEN - 7; i++) {
+		bitmap_zero(bitmap, TEST_BIT_LEN);
+		bitmap_write(bitmap, 0b10101UL, i, 5);
+		val = bitmap_read(bitmap, i, 5);
+		expect_eq_ulong(0b10101UL, val);
+		bitmap_write(bitmap, 0b101UL, i + 5, 3);
+		val = bitmap_read(bitmap, i + 5, 3);
+		expect_eq_ulong(0b101UL, val);
+		val = bitmap_read(bitmap, i, 8);
+		expect_eq_ulong(0b10110101UL, val);
+	}
+
+	/*
+	 * Check that setting a single bit does not accidentally touch the
+	 * adjacent bits.
+	 */
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		/*
+		 * A 0b10101010 pattern to catch both 0s replaced to 1s and vice
+		 * versa.
+		 */
+		memset(bitmap, 0xaa, TEST_BYTE_LEN);
+		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
+		for (bit = 0; bit <= 1; bit++) {
+			bitmap_write(bitmap, bit, i, 1);
+			__assign_bit(i, exp_bitmap, bit);
+			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+		}
+	}
+
+	/* Ensure setting 0 bits does not change anything. */
+	memset(bitmap, 0xaa, TEST_BYTE_LEN);
+	memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_write(bitmap, ~0UL, i, 0);
+		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+	}
+}
+#undef TEST_BYTE_LEN
+#undef TEST_BIT_LEN
+
 static void __init selftest(void)
 {
 	test_zero_clear();
@@ -1249,6 +1328,8 @@ static void __init selftest(void)
 	test_for_each_clear_bitrange_from();
 	test_for_each_set_clump8();
 	test_for_each_set_bit_wrap();
+
+	test_set_get_value();
 }
 
 KSTM_MODULE_LOADERS(test_bitmap);
-- 
2.41.0.487.g6d72f3e995-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-20 17:39 ` Alexander Potapenko
@ 2023-07-20 17:39   ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - Addressed comments by Andy Shevchenko:
    - expanded "MTE" to "Memory Tagging Extension" in Kconfig
    - fixed kernel-doc comments, moved them to C source
    - changed variables to unsigned where applicable
    - some code simplifications, fewer unnecessary assignments
    - added the mte_largest_idx_bits() helper
    - added namespace prefixes to all functions
    - added missing headers (but removed bits.h)
  - Addressed comments by Yury Norov:
    - removed test-only functions from mtecomp.h
    - dropped the algoritm name (all functions are now prefixed with
      "mte")
    - added more comments
    - got rid of MTE_RANGES_INLINE
    - renamed bitmap_{get,set}_value() to bitmap_{read,write}()
    - moved the big comment explaining the algorithm to
      Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
      add a link to it from Documentation/arch/arm64/index.rst
    - removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yury Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 212 ++++++++++
 arch/arm64/Kconfig                            |   9 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   1 +
 arch/arm64/mm/mtecomp.c                       | 382 ++++++++++++++++++
 6 files changed, 618 insertions(+)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
     legacy_instructions
     memory
     memory-tagging-extension
+    mte-tag-compression
     perf
     pointer-authentication
     ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..af6716d53c1a8
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
+carves out 3.125% (1/16) of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory, which in the case of a zram-backed swap may
+even exceed the memory used to store the swapped pages themselves. To reduce
+memory waste, ``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in
+compressed form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress a 128-byte (``MTE_GRANULES_PER_PAGE / 2``)
+array of tags into a smaller byte sequence that can be stored in a
+16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to 128-byte buffers
+allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given 128-byte ``tags`` buffer, allocates
+  storage for it, and returns an opaque handle addressing that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+  and fills the 128-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+  mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+  by ``handle``.
+
+Depending on the size of compressed data, ``mte_compress()`` stores it in one of
+the size classes backed by kmem caches: ``mte-tags-16``, ``mte-tags-32``,
+``mte-tags-64``, or ``mte-tags-128`` (for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by ``mte_compress()`` is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+   63  62    60  ...   2         0
+  +---+--------+-----+------------+
+  | 0 | INLINE | ... | SIZE_CLASS |
+  +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size class for out-of-line allocations:
+
+- ``0b001`` for ``mte-tags-16``;
+- ``0b010`` for ``mte-tags-32``;
+- ``0b100`` for ``mte-tags-64``;
+- ``0b000`` for ``mte-tags-128``.
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows.
+
+1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+   ``r_tags[]`` containing tag values and ``r_sizes[]`` containing range
+   lengths) by ``mte_tags_to_ranges()``. Note that ``r_sizes[]`` sums up to 256.
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+   ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+   because it can be reconstructed from the sum of the remaining elements. Note
+   that now none of the remaining ``r_sizes[]`` elements exceeds 127.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+           N <= 6:  8 bytes (inline case, no allocation required);
+      6 < N <= 11: 16 bytes
+     11 < N <= 23: 32 bytes
+     23 < N <= 46: 64 bytes
+     46 < N:       128 bytes (no compression will be performed)
+
+(See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+   handle (``i<size>`` means a ``<size>``-bit unsigned integer)::
+
+      largest_idx : i4
+     r_tags[0..5] : i4 x 6
+    r_sizes[0..4] : i7 x 5
+
+   (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+   values)
+
+   Because ``largest_idx`` is <= 5, bit 63 of the handle is always 0 (so it
+   can be stored in the XArray), and bits 62..60 cannot all be 1, so it can be
+   distinguished from a kernel pointer.
+
+5. For the out-of-line case, the storage is allocated from one of the
+   ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+   on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+   Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+        largest_idx : i6
+       r_tags[0..N] : i4 x N
+    r_sizes[0..N-1] : i7 x (N-1)
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+   or out-of line.
+
+2. For the inline case, treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+   the input buffer length. To obtain the pointer to the input buffer, unset
+   bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+   buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+   input buffer. Calculate the removed largest element of ``r_sizes[]`` as
+   ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+   position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+   buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed 127, those can be packed into 7 bits, whereas a single tag
+occupies 4 bits, and ``largest_idx`` cannot take more than 8 bits.
+
+Now, for each ``S``-byte size class it is possible to find the maximal number
+``M`` such as ``8 + 4 * M + 7 * (M - 1) <= 8 * S``,
+i.e. ``M = (8 * S - 1) / 11``::
+
+ +-------------+----+--------------+
+ | Buffer size |  M | Storage bits |
+ +-------------+----+--------------+
+ |          8  |  5 |          56  |
+ |         16  | 11 |         122  |
+ |         32  | 23 |         254  |
+ |         64  | 46 |         507  |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+  +-------------+----+--------------+
+  | Buffer size |  N | Storage bits |
+  +-------------+----+--------------+
+  |          8  |  6 |          63  |
+  |         16  | 11 |         120  |
+  |         32  | 23 |         252  |
+  |         64  | 46 |         505  |
+  +-------------+----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..3ac6e302b1509 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,15 @@ config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 Memory Tagging Extension"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 Memory Tagging Extension.
+
+	  128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..6d27f69f9521f
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <eugenis@google.com>
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ *  0: mte-tags-128
+ *  1: mte-tags-16
+ *  2: mte-tags-32
+ *  3: mte-tags-64
+ */
+#define MTECOMP_NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+
+/*
+ * Sizes of compressed values. These depend on MTE_TAG_SIZE and
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_TAG 4
+#define MTE_BITS_PER_SIZE 7
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+#define MTE_BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+	if (len < 128)
+		return fls(len >> 4);
+	return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+	if (id == 0)
+		return 128;
+	return 8 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len)
+{
+	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+	unsigned int cur_idx = 0, i, j;
+	u8 cur_tag;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	out_tags[0] = prev_tag;
+	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+		for (j = 0; j < 2; j++) {
+			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+			if (cur_tag == prev_tag) {
+				out_sizes[cur_idx]++;
+			} else {
+				cur_idx++;
+				prev_tag = cur_tag;
+				out_tags[cur_idx] = prev_tag;
+				out_sizes[cur_idx] = 1;
+			}
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags)
+{
+	unsigned int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ *   4 + 4*N + 7*(N-1) <= 63 bits, for the inline case,
+ * or
+ *   6 + 4*N + 7*(N-1) <= array size in bits (128, 256, or 512),
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+	size_t largest_bits;
+	size_t ret = 0;
+
+	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+				     MTE_BITS_PER_LARGEST_IDX;
+	ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+	      (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
+	return ret;
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+	size_t sizes[4] = { 8, 16, 32, 64 };
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(sizes); i++) {
+		if (num_ranges <= mte_size_to_ranges(sizes[i]))
+			return sizes[i];
+	}
+	return 128;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+	return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return 8;
+	return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+			     unsigned long *pos, unsigned long bits)
+{
+	bitmap_write(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+	if (size == 8)
+		return MTE_BITS_PER_LARGEST_IDX_INLINE;
+	return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+				unsigned long *bitmap, size_t size)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	unsigned short largest = 0;
+	size_t max_ranges;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = mte_largest_idx_bits(size);
+	max_ranges = mte_size_to_ranges(size);
+	mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_BITS_PER_TAG);
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_TAG);
+	for (i = 0; i < len; i++) {
+		if (i != largest_idx)
+			mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+					 MTE_BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+	unsigned short r_sizes[256];
+	struct kmem_cache *cache;
+	unsigned long *storage;
+	unsigned int cache_id;
+	size_t alloc_size;
+	u8 r_tags[256];
+	size_t r_len;
+	/*
+	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	r_len = sizeof(r_tags);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = mte_alloc_size(r_len);
+	if (alloc_size == 8) {
+		mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    alloc_size);
+		return result;
+	}
+	cache_id = mte_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_alloc(cache, GFP_KERNEL);
+	if (alloc_size < 128) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    alloc_size);
+		return ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+	}
+	memcpy(storage, tags, alloc_size);
+	return (unsigned long)storage & MTE_HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+				     unsigned long *pos, unsigned long bits)
+{
+	unsigned long result;
+
+	result = bitmap_read(bitmap, *pos, bits);
+	*pos += bits;
+	return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+				    u8 *tags)
+{
+	unsigned short r_sizes[46], sum = 0;
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	size_t max_ranges;
+	u8 r_tags[46];
+
+	max_ranges = mte_size_to_ranges(size);
+	l_bits = mte_largest_idx_bits(size);
+	largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_TAG);
+	for (i = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] =
+			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= 256)
+		return false;
+	r_sizes[largest_idx] = 256 - sum;
+	mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage = mte_storage(handle);
+	size_t size = mte_storage_size(handle);
+
+	switch (size) {
+	case 8:
+		return mte_decompress_from_buf(&handle, size, tags);
+	case 128:
+		memcpy(tags, storage, size);
+		return true;
+	default:
+		return mte_decompress_from_buf(storage, size, tags);
+	}
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+	struct kmem_cache *c;
+	void *storage;
+	size_t size;
+
+	storage = mte_storage(handle);
+	if (!storage)
+		return;
+
+	size = mte_storage_size(handle);
+	c = mtecomp_caches[mte_size_to_cache_id(size)];
+	kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	unsigned int i;
+	char name[16];
+	size_t size;
+
+	static_assert(MTE_PAGE_TAG_STORAGE == 128);
+	static_assert(MTE_TAG_SIZE == MTE_BITS_PER_TAG);
+	for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+		size = mte_cache_id_to_size(i);
+		snprintf(name, sizeof(name), "mte-tags-%ld", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	return 0;
+}
+module_init(mtecomp_init);
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-07-20 17:39   ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - Addressed comments by Andy Shevchenko:
    - expanded "MTE" to "Memory Tagging Extension" in Kconfig
    - fixed kernel-doc comments, moved them to C source
    - changed variables to unsigned where applicable
    - some code simplifications, fewer unnecessary assignments
    - added the mte_largest_idx_bits() helper
    - added namespace prefixes to all functions
    - added missing headers (but removed bits.h)
  - Addressed comments by Yury Norov:
    - removed test-only functions from mtecomp.h
    - dropped the algoritm name (all functions are now prefixed with
      "mte")
    - added more comments
    - got rid of MTE_RANGES_INLINE
    - renamed bitmap_{get,set}_value() to bitmap_{read,write}()
    - moved the big comment explaining the algorithm to
      Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
      add a link to it from Documentation/arch/arm64/index.rst
    - removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yury Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 212 ++++++++++
 arch/arm64/Kconfig                            |   9 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   1 +
 arch/arm64/mm/mtecomp.c                       | 382 ++++++++++++++++++
 6 files changed, 618 insertions(+)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
     legacy_instructions
     memory
     memory-tagging-extension
+    mte-tag-compression
     perf
     pointer-authentication
     ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..af6716d53c1a8
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
+carves out 3.125% (1/16) of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory, which in the case of a zram-backed swap may
+even exceed the memory used to store the swapped pages themselves. To reduce
+memory waste, ``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in
+compressed form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress a 128-byte (``MTE_GRANULES_PER_PAGE / 2``)
+array of tags into a smaller byte sequence that can be stored in a
+16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to 128-byte buffers
+allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given 128-byte ``tags`` buffer, allocates
+  storage for it, and returns an opaque handle addressing that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+  and fills the 128-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+  mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+  by ``handle``.
+
+Depending on the size of compressed data, ``mte_compress()`` stores it in one of
+the size classes backed by kmem caches: ``mte-tags-16``, ``mte-tags-32``,
+``mte-tags-64``, or ``mte-tags-128`` (for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by ``mte_compress()`` is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+   63  62    60  ...   2         0
+  +---+--------+-----+------------+
+  | 0 | INLINE | ... | SIZE_CLASS |
+  +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size class for out-of-line allocations:
+
+- ``0b001`` for ``mte-tags-16``;
+- ``0b010`` for ``mte-tags-32``;
+- ``0b100`` for ``mte-tags-64``;
+- ``0b000`` for ``mte-tags-128``.
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows.
+
+1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+   ``r_tags[]`` containing tag values and ``r_sizes[]`` containing range
+   lengths) by ``mte_tags_to_ranges()``. Note that ``r_sizes[]`` sums up to 256.
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+   ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+   because it can be reconstructed from the sum of the remaining elements. Note
+   that now none of the remaining ``r_sizes[]`` elements exceeds 127.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+           N <= 6:  8 bytes (inline case, no allocation required);
+      6 < N <= 11: 16 bytes
+     11 < N <= 23: 32 bytes
+     23 < N <= 46: 64 bytes
+     46 < N:       128 bytes (no compression will be performed)
+
+(See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+   handle (``i<size>`` means a ``<size>``-bit unsigned integer)::
+
+      largest_idx : i4
+     r_tags[0..5] : i4 x 6
+    r_sizes[0..4] : i7 x 5
+
+   (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+   values)
+
+   Because ``largest_idx`` is <= 5, bit 63 of the handle is always 0 (so it
+   can be stored in the XArray), and bits 62..60 cannot all be 1, so it can be
+   distinguished from a kernel pointer.
+
+5. For the out-of-line case, the storage is allocated from one of the
+   ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+   on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+   Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+        largest_idx : i6
+       r_tags[0..N] : i4 x N
+    r_sizes[0..N-1] : i7 x (N-1)
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+   or out-of line.
+
+2. For the inline case, treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+   the input buffer length. To obtain the pointer to the input buffer, unset
+   bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+   buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+   input buffer. Calculate the removed largest element of ``r_sizes[]`` as
+   ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+   position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+   buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed 127, those can be packed into 7 bits, whereas a single tag
+occupies 4 bits, and ``largest_idx`` cannot take more than 8 bits.
+
+Now, for each ``S``-byte size class it is possible to find the maximal number
+``M`` such as ``8 + 4 * M + 7 * (M - 1) <= 8 * S``,
+i.e. ``M = (8 * S - 1) / 11``::
+
+ +-------------+----+--------------+
+ | Buffer size |  M | Storage bits |
+ +-------------+----+--------------+
+ |          8  |  5 |          56  |
+ |         16  | 11 |         122  |
+ |         32  | 23 |         254  |
+ |         64  | 46 |         507  |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+  +-------------+----+--------------+
+  | Buffer size |  N | Storage bits |
+  +-------------+----+--------------+
+  |          8  |  6 |          63  |
+  |         16  | 11 |         120  |
+  |         32  | 23 |         252  |
+  |         64  | 46 |         505  |
+  +-------------+----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..3ac6e302b1509 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,15 @@ config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 Memory Tagging Extension"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 Memory Tagging Extension.
+
+	  128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..6d27f69f9521f
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <eugenis@google.com>
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ *  0: mte-tags-128
+ *  1: mte-tags-16
+ *  2: mte-tags-32
+ *  3: mte-tags-64
+ */
+#define MTECOMP_NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+
+/*
+ * Sizes of compressed values. These depend on MTE_TAG_SIZE and
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_TAG 4
+#define MTE_BITS_PER_SIZE 7
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+#define MTE_BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+	if (len < 128)
+		return fls(len >> 4);
+	return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+	if (id == 0)
+		return 128;
+	return 8 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len)
+{
+	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+	unsigned int cur_idx = 0, i, j;
+	u8 cur_tag;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	out_tags[0] = prev_tag;
+	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+		for (j = 0; j < 2; j++) {
+			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+			if (cur_tag == prev_tag) {
+				out_sizes[cur_idx]++;
+			} else {
+				cur_idx++;
+				prev_tag = cur_tag;
+				out_tags[cur_idx] = prev_tag;
+				out_sizes[cur_idx] = 1;
+			}
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags)
+{
+	unsigned int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ *   4 + 4*N + 7*(N-1) <= 63 bits, for the inline case,
+ * or
+ *   6 + 4*N + 7*(N-1) <= array size in bits (128, 256, or 512),
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+	size_t largest_bits;
+	size_t ret = 0;
+
+	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+				     MTE_BITS_PER_LARGEST_IDX;
+	ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+	      (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
+	return ret;
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+	size_t sizes[4] = { 8, 16, 32, 64 };
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(sizes); i++) {
+		if (num_ranges <= mte_size_to_ranges(sizes[i]))
+			return sizes[i];
+	}
+	return 128;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+	return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return 8;
+	return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+			     unsigned long *pos, unsigned long bits)
+{
+	bitmap_write(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+	if (size == 8)
+		return MTE_BITS_PER_LARGEST_IDX_INLINE;
+	return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+				unsigned long *bitmap, size_t size)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	unsigned short largest = 0;
+	size_t max_ranges;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = mte_largest_idx_bits(size);
+	max_ranges = mte_size_to_ranges(size);
+	mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_BITS_PER_TAG);
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_TAG);
+	for (i = 0; i < len; i++) {
+		if (i != largest_idx)
+			mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+					 MTE_BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+	unsigned short r_sizes[256];
+	struct kmem_cache *cache;
+	unsigned long *storage;
+	unsigned int cache_id;
+	size_t alloc_size;
+	u8 r_tags[256];
+	size_t r_len;
+	/*
+	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	r_len = sizeof(r_tags);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = mte_alloc_size(r_len);
+	if (alloc_size == 8) {
+		mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    alloc_size);
+		return result;
+	}
+	cache_id = mte_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_alloc(cache, GFP_KERNEL);
+	if (alloc_size < 128) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    alloc_size);
+		return ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+	}
+	memcpy(storage, tags, alloc_size);
+	return (unsigned long)storage & MTE_HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+				     unsigned long *pos, unsigned long bits)
+{
+	unsigned long result;
+
+	result = bitmap_read(bitmap, *pos, bits);
+	*pos += bits;
+	return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+				    u8 *tags)
+{
+	unsigned short r_sizes[46], sum = 0;
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	size_t max_ranges;
+	u8 r_tags[46];
+
+	max_ranges = mte_size_to_ranges(size);
+	l_bits = mte_largest_idx_bits(size);
+	largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_TAG);
+	for (i = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] =
+			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= 256)
+		return false;
+	r_sizes[largest_idx] = 256 - sum;
+	mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage = mte_storage(handle);
+	size_t size = mte_storage_size(handle);
+
+	switch (size) {
+	case 8:
+		return mte_decompress_from_buf(&handle, size, tags);
+	case 128:
+		memcpy(tags, storage, size);
+		return true;
+	default:
+		return mte_decompress_from_buf(storage, size, tags);
+	}
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+	struct kmem_cache *c;
+	void *storage;
+	size_t size;
+
+	storage = mte_storage(handle);
+	if (!storage)
+		return;
+
+	size = mte_storage_size(handle);
+	c = mtecomp_caches[mte_size_to_cache_id(size)];
+	kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	unsigned int i;
+	char name[16];
+	size_t size;
+
+	static_assert(MTE_PAGE_TAG_STORAGE == 128);
+	static_assert(MTE_TAG_SIZE == MTE_BITS_PER_TAG);
+	for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+		size = mte_cache_id_to_size(i);
+		snprintf(name, sizeof(name), "mte-tags-%ld", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	return 0;
+}
+module_init(mtecomp_init);
-- 
2.41.0.487.g6d72f3e995-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
  2023-07-20 17:39 ` Alexander Potapenko
@ 2023-07-20 17:39   ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - addressed comments by Andy Shevchenko:
    - expanded MTE to "Memory Tagging Extension" in Kconfig
    - changed signed variables to unsigned where applicable
    - added missing header dependencies

  - addressed comments by Yury Norov:
    - moved test-only declarations from mtecomp.h into this test
    - switched to the new "mte"-prefixed function names, dropped the
      mentions of "EA0"
    - added test_tag_to_ranges_n()

 v3:
  - addressed comments by Andy Shevchenko in another patch:
   - switched from u64 to unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
   - fixed includes order
---
 arch/arm64/Kconfig           |  10 ++
 arch/arm64/mm/Makefile       |   1 +
 arch/arm64/mm/test_mtecomp.c | 217 +++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3ac6e302b1509..2432475003054 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2102,6 +2102,16 @@ config ARM64_MTE_COMP
 
 	  128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.
 
+config ARM64_MTE_COMP_KUNIT_TEST
+	tristate "Test tag compression for ARM64 Memory Tagging Extension" if !KUNIT_ALL_TESTS
+	default KUNIT_ALL_TESTS
+	depends on KUNIT && ARM64_MTE_COMP
+	help
+	  Test MTE compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+	  Ensure that tag sequences containing alternating values are compressed
+	  to buffers of expected size and correctly decompressed afterwards.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..185eb6cb73650
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for MTE tags compression algorithm.
+ */
+
+#include <linux/bits.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <kunit/test.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* Functions exported from mtecomp.c for this test. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags);
+
+/*
+ * Test that mte_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 1);
+	KUNIT_EXPECT_EQ(test, r_tags[0], 0);
+	KUNIT_EXPECT_EQ(test, r_sizes[0], 256);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+	const unsigned short ex_sizes[] = { 1, 2, 2, 1, 250 };
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 5);
+	KUNIT_EXPECT_EQ(test, memcmp(r_tags, ex_tags, sizeof(ex_tags)), 0);
+	KUNIT_EXPECT_EQ(test, memcmp(r_sizes, ex_sizes, sizeof(ex_sizes)), 0);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Test that repeated 0xa0 byte produces 256 ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0xa0, 128);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 256);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Generate a buffer that will contain @nranges of tag ranges. */
+static void gen_tag_range_helper(u8 *tags, int nranges)
+{
+	unsigned int i;
+
+	memset(tags, 0, 128);
+	if (nranges > 1) {
+		nranges--;
+		for (i = 0; i < nranges / 2; i++)
+			tags[i] = 0xab;
+		if (nranges % 2)
+			tags[nranges / 2] = 0xa0;
+	}
+}
+
+/*
+ * Test that mte_tags_to_ranges()/mte_ranges_to_tags() work for various
+ * r_len values.
+ */
+static void test_tag_to_ranges_n(struct kunit *test)
+{
+	unsigned short r_sizes[256];
+	u8 tags[128], dtags[128];
+	unsigned int i, j, sum;
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	for (i = 1; i <= 256; i++) {
+		gen_tag_range_helper(tags, i);
+		mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+		sum = 0;
+		for (j = 0; j < r_len; j++)
+			sum += r_sizes[j];
+		KUNIT_EXPECT_EQ(test, sum, 256);
+		mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+		KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+	}
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a buffer containing @nranges ranges compresses into @exp_size
+ * bytes and decompresses into the original tag sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges,
+				  size_t exp_size)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	gen_tag_range_helper(tags, nranges);
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), exp_size);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+	size_t exp_size;
+	unsigned int i;
+
+	for (i = 1; i <= 256; i++) {
+		if (i < 7)
+			exp_size = 8;
+		else if (i < 12)
+			exp_size = 16;
+		else if (i < 24)
+			exp_size = 32;
+		else if (i < 47)
+			exp_size = 64;
+		else
+			exp_size = 128;
+		compress_range_helper(test, i, exp_size);
+	}
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+	KUNIT_CASE(test_tags_to_ranges_zero),
+	KUNIT_CASE(test_tags_to_ranges_simple),
+	KUNIT_CASE(test_tags_to_ranges_repeated),
+	KUNIT_CASE(test_tag_to_ranges_n),
+	KUNIT_CASE(test_compress_zero),
+	KUNIT_CASE(test_compress_simple),
+	KUNIT_CASE(test_compress_ranges),
+	{}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+	.name = "mtecomp",
+	.test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <glider@google.com>");
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
@ 2023-07-20 17:39   ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - addressed comments by Andy Shevchenko:
    - expanded MTE to "Memory Tagging Extension" in Kconfig
    - changed signed variables to unsigned where applicable
    - added missing header dependencies

  - addressed comments by Yury Norov:
    - moved test-only declarations from mtecomp.h into this test
    - switched to the new "mte"-prefixed function names, dropped the
      mentions of "EA0"
    - added test_tag_to_ranges_n()

 v3:
  - addressed comments by Andy Shevchenko in another patch:
   - switched from u64 to unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
   - fixed includes order
---
 arch/arm64/Kconfig           |  10 ++
 arch/arm64/mm/Makefile       |   1 +
 arch/arm64/mm/test_mtecomp.c | 217 +++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3ac6e302b1509..2432475003054 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2102,6 +2102,16 @@ config ARM64_MTE_COMP
 
 	  128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.
 
+config ARM64_MTE_COMP_KUNIT_TEST
+	tristate "Test tag compression for ARM64 Memory Tagging Extension" if !KUNIT_ALL_TESTS
+	default KUNIT_ALL_TESTS
+	depends on KUNIT && ARM64_MTE_COMP
+	help
+	  Test MTE compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+	  Ensure that tag sequences containing alternating values are compressed
+	  to buffers of expected size and correctly decompressed afterwards.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..185eb6cb73650
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for MTE tags compression algorithm.
+ */
+
+#include <linux/bits.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <kunit/test.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* Functions exported from mtecomp.c for this test. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags);
+
+/*
+ * Test that mte_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 1);
+	KUNIT_EXPECT_EQ(test, r_tags[0], 0);
+	KUNIT_EXPECT_EQ(test, r_sizes[0], 256);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+	const unsigned short ex_sizes[] = { 1, 2, 2, 1, 250 };
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 5);
+	KUNIT_EXPECT_EQ(test, memcmp(r_tags, ex_tags, sizeof(ex_tags)), 0);
+	KUNIT_EXPECT_EQ(test, memcmp(r_sizes, ex_sizes, sizeof(ex_sizes)), 0);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Test that repeated 0xa0 byte produces 256 ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned short r_sizes[256];
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0xa0, 128);
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 256);
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Generate a buffer that will contain @nranges of tag ranges. */
+static void gen_tag_range_helper(u8 *tags, int nranges)
+{
+	unsigned int i;
+
+	memset(tags, 0, 128);
+	if (nranges > 1) {
+		nranges--;
+		for (i = 0; i < nranges / 2; i++)
+			tags[i] = 0xab;
+		if (nranges % 2)
+			tags[nranges / 2] = 0xa0;
+	}
+}
+
+/*
+ * Test that mte_tags_to_ranges()/mte_ranges_to_tags() work for various
+ * r_len values.
+ */
+static void test_tag_to_ranges_n(struct kunit *test)
+{
+	unsigned short r_sizes[256];
+	u8 tags[128], dtags[128];
+	unsigned int i, j, sum;
+	size_t r_len = 256;
+	u8 r_tags[256];
+
+	for (i = 1; i <= 256; i++) {
+		gen_tag_range_helper(tags, i);
+		mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+		sum = 0;
+		for (j = 0; j < r_len; j++)
+			sum += r_sizes[j];
+		KUNIT_EXPECT_EQ(test, sum, 256);
+		mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+		KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+	}
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a buffer containing @nranges ranges compresses into @exp_size
+ * bytes and decompresses into the original tag sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges,
+				  size_t exp_size)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	gen_tag_range_helper(tags, nranges);
+	handle = mte_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), exp_size);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+	size_t exp_size;
+	unsigned int i;
+
+	for (i = 1; i <= 256; i++) {
+		if (i < 7)
+			exp_size = 8;
+		else if (i < 12)
+			exp_size = 16;
+		else if (i < 24)
+			exp_size = 32;
+		else if (i < 47)
+			exp_size = 64;
+		else
+			exp_size = 128;
+		compress_range_helper(test, i, exp_size);
+	}
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+	KUNIT_CASE(test_tags_to_ranges_zero),
+	KUNIT_CASE(test_tags_to_ranges_simple),
+	KUNIT_CASE(test_tags_to_ranges_repeated),
+	KUNIT_CASE(test_tag_to_ranges_n),
+	KUNIT_CASE(test_compress_zero),
+	KUNIT_CASE(test_compress_simple),
+	KUNIT_CASE(test_compress_ranges),
+	{}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+	.name = "mtecomp",
+	.test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <glider@google.com>");
-- 
2.41.0.487.g6d72f3e995-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-20 17:39 ` Alexander Potapenko
@ 2023-07-20 17:39   ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Define the internal mteswap.h interface:
 - _mte_alloc_and_save_tags()
 - _mte_free_saved_tags()
 - _mte_restore_tags()

, that encapsulates saving tags for a struct page (together with memory
allocation), restoring tags, and deleting the storage allocated for them.

These functions accept opaque pointers, which may point to 128-byte
tag buffers, as well as smaller buffers containing compressed tags, or
have compressed tags stored directly in them.

The existing code from mteswap.c operating with uncompressed tags is split
away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
provides compression support. The latter implementation
is picked if CONFIG_ARM64_MTE_COMP=y.

Soon after booting Android, tag compression saves ~2.5x memory previously
spent by mteswap.c on tag allocations. With the growing uptime, the
savings reach 20x and even more.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - minor code simplifications suggested by Andy Shevchenko, added
    missing header dependencies
  - changed compression API names to reflect modifications made to
    memcomp.h (as suggested by Yury Norov)

 v3:
  - Addressed comments by Andy Shevchenko in another patch:
   - fixed includes order
   - replaced u64 with unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
---
 arch/arm64/mm/Makefile         |  5 ++++
 arch/arm64/mm/mteswap.c        | 20 ++++++-------
 arch/arm64/mm/mteswap.h        | 12 ++++++++
 arch/arm64/mm/mteswap_comp.c   | 54 ++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap_nocomp.c | 38 ++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 170dc62b010b9..46a798e2b67cb 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+ifdef CONFIG_ARM64_MTE_COMP
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_comp.o
+else
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_nocomp.o
+endif
 obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1b..9d8f87fd191a2 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,8 +5,11 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+
 #include <asm/mte.h>
 
+#include "mteswap.h"
+
 static DEFINE_XARRAY(mte_pages);
 
 void *mte_allocate_tag_storage(void)
@@ -27,20 +30,18 @@ int mte_save_tags(struct page *page)
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
+	tag_storage = _mte_alloc_and_save_tags(page);
 	if (!tag_storage)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
-
 	/* page_private contains the swap entry.val set in do_swap_page */
 	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		_mte_free_saved_tags(tag_storage);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		_mte_free_saved_tags(ret);
 	}
 
 	return 0;
@@ -53,10 +54,7 @@ void mte_restore_tags(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
-	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
-		set_page_mte_tagged(page);
-	}
+	_mte_restore_tags(tags, page);
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
@@ -64,7 +62,7 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	swp_entry_t entry = swp_entry(type, offset);
 	void *tags = xa_erase(&mte_pages, entry.val);
 
-	mte_free_tag_storage(tags);
+	_mte_free_saved_tags(tags);
 }
 
 void mte_invalidate_tags_area(int type)
@@ -78,7 +76,7 @@ void mte_invalidate_tags_area(int type)
 	xa_lock(&mte_pages);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
 		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		_mte_free_saved_tags(tags);
 	}
 	xa_unlock(&mte_pages);
 }
diff --git a/arch/arm64/mm/mteswap.h b/arch/arm64/mm/mteswap.h
new file mode 100644
index 0000000000000..4c576b76785d1
--- /dev/null
+++ b/arch/arm64/mm/mteswap.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTESWAP_H_
+#define ARCH_ARM64_MM_MTESWAP_H_
+
+struct page;
+
+void *_mte_alloc_and_save_tags(struct page *page);
+void _mte_free_saved_tags(void *tags);
+void _mte_restore_tags(void *tags, struct page *page);
+
+#endif // ARCH_ARM64_MM_MTESWAP_H_
diff --git a/arch/arm64/mm/mteswap_comp.c b/arch/arm64/mm/mteswap_comp.c
new file mode 100644
index 0000000000000..2c4ac5eac9e59
--- /dev/null
+++ b/arch/arm64/mm/mteswap_comp.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management with compression. */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+#include <asm/mtecomp.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	u8 tags[128];
+	unsigned long handle;
+
+	mte_save_page_tags(page_address(page), tags);
+	handle = mte_compress(tags);
+	return xa_mk_value(handle);
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	unsigned long handle;
+	int size;
+
+	handle = xa_to_value(storage);
+	if (!handle)
+		return;
+	size = mte_storage_size(handle);
+	mte_release_handle(handle);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	unsigned long handle;
+	u8 tags_decomp[128];
+
+	handle = xa_to_value(tags);
+	if (!handle)
+		return;
+	if (!try_page_mte_tagging(page))
+		return;
+	if (!mte_decompress(handle, tags_decomp))
+		return;
+	mte_restore_page_tags(page_address(page), tags_decomp);
+	set_page_mte_tagged(page);
+}
+MODULE_IMPORT_NS(MTECOMP);
diff --git a/arch/arm64/mm/mteswap_nocomp.c b/arch/arm64/mm/mteswap_nocomp.c
new file mode 100644
index 0000000000000..1e665a4b5f940
--- /dev/null
+++ b/arch/arm64/mm/mteswap_nocomp.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management without compression support. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	void *storage;
+
+	storage = mte_allocate_tag_storage();
+	if (!storage)
+		return NULL;
+
+	mte_save_page_tags(page_address(page), storage);
+	return storage;
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	mte_free_tag_storage(storage);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	if (!try_page_mte_tagging(page))
+		return;
+	mte_restore_page_tags(page_address(page), tags);
+	set_page_mte_tagged(page);
+}
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-07-20 17:39   ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-20 17:39 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Define the internal mteswap.h interface:
 - _mte_alloc_and_save_tags()
 - _mte_free_saved_tags()
 - _mte_restore_tags()

, that encapsulates saving tags for a struct page (together with memory
allocation), restoring tags, and deleting the storage allocated for them.

These functions accept opaque pointers, which may point to 128-byte
tag buffers, as well as smaller buffers containing compressed tags, or
have compressed tags stored directly in them.

The existing code from mteswap.c operating with uncompressed tags is split
away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
provides compression support. The latter implementation
is picked if CONFIG_ARM64_MTE_COMP=y.

Soon after booting Android, tag compression saves ~2.5x memory previously
spent by mteswap.c on tag allocations. With the growing uptime, the
savings reach 20x and even more.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v4:
  - minor code simplifications suggested by Andy Shevchenko, added
    missing header dependencies
  - changed compression API names to reflect modifications made to
    memcomp.h (as suggested by Yury Norov)

 v3:
  - Addressed comments by Andy Shevchenko in another patch:
   - fixed includes order
   - replaced u64 with unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
---
 arch/arm64/mm/Makefile         |  5 ++++
 arch/arm64/mm/mteswap.c        | 20 ++++++-------
 arch/arm64/mm/mteswap.h        | 12 ++++++++
 arch/arm64/mm/mteswap_comp.c   | 54 ++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap_nocomp.c | 38 ++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 170dc62b010b9..46a798e2b67cb 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+ifdef CONFIG_ARM64_MTE_COMP
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_comp.o
+else
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_nocomp.o
+endif
 obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1b..9d8f87fd191a2 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,8 +5,11 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+
 #include <asm/mte.h>
 
+#include "mteswap.h"
+
 static DEFINE_XARRAY(mte_pages);
 
 void *mte_allocate_tag_storage(void)
@@ -27,20 +30,18 @@ int mte_save_tags(struct page *page)
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
+	tag_storage = _mte_alloc_and_save_tags(page);
 	if (!tag_storage)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
-
 	/* page_private contains the swap entry.val set in do_swap_page */
 	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		_mte_free_saved_tags(tag_storage);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		_mte_free_saved_tags(ret);
 	}
 
 	return 0;
@@ -53,10 +54,7 @@ void mte_restore_tags(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
-	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
-		set_page_mte_tagged(page);
-	}
+	_mte_restore_tags(tags, page);
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
@@ -64,7 +62,7 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	swp_entry_t entry = swp_entry(type, offset);
 	void *tags = xa_erase(&mte_pages, entry.val);
 
-	mte_free_tag_storage(tags);
+	_mte_free_saved_tags(tags);
 }
 
 void mte_invalidate_tags_area(int type)
@@ -78,7 +76,7 @@ void mte_invalidate_tags_area(int type)
 	xa_lock(&mte_pages);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
 		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		_mte_free_saved_tags(tags);
 	}
 	xa_unlock(&mte_pages);
 }
diff --git a/arch/arm64/mm/mteswap.h b/arch/arm64/mm/mteswap.h
new file mode 100644
index 0000000000000..4c576b76785d1
--- /dev/null
+++ b/arch/arm64/mm/mteswap.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTESWAP_H_
+#define ARCH_ARM64_MM_MTESWAP_H_
+
+struct page;
+
+void *_mte_alloc_and_save_tags(struct page *page);
+void _mte_free_saved_tags(void *tags);
+void _mte_restore_tags(void *tags, struct page *page);
+
+#endif // ARCH_ARM64_MM_MTESWAP_H_
diff --git a/arch/arm64/mm/mteswap_comp.c b/arch/arm64/mm/mteswap_comp.c
new file mode 100644
index 0000000000000..2c4ac5eac9e59
--- /dev/null
+++ b/arch/arm64/mm/mteswap_comp.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management with compression. */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+#include <asm/mtecomp.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	u8 tags[128];
+	unsigned long handle;
+
+	mte_save_page_tags(page_address(page), tags);
+	handle = mte_compress(tags);
+	return xa_mk_value(handle);
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	unsigned long handle;
+	int size;
+
+	handle = xa_to_value(storage);
+	if (!handle)
+		return;
+	size = mte_storage_size(handle);
+	mte_release_handle(handle);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	unsigned long handle;
+	u8 tags_decomp[128];
+
+	handle = xa_to_value(tags);
+	if (!handle)
+		return;
+	if (!try_page_mte_tagging(page))
+		return;
+	if (!mte_decompress(handle, tags_decomp))
+		return;
+	mte_restore_page_tags(page_address(page), tags_decomp);
+	set_page_mte_tagged(page);
+}
+MODULE_IMPORT_NS(MTECOMP);
diff --git a/arch/arm64/mm/mteswap_nocomp.c b/arch/arm64/mm/mteswap_nocomp.c
new file mode 100644
index 0000000000000..1e665a4b5f940
--- /dev/null
+++ b/arch/arm64/mm/mteswap_nocomp.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management without compression support. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	void *storage;
+
+	storage = mte_allocate_tag_storage();
+	if (!storage)
+		return NULL;
+
+	mte_save_page_tags(page_address(page), storage);
+	return storage;
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	mte_free_tag_storage(storage);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	if (!try_page_mte_tagging(page))
+		return;
+	mte_restore_page_tags(page_address(page), tags);
+	set_page_mte_tagged(page);
+}
-- 
2.41.0.487.g6d72f3e995-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-07-21 11:22     ` Andy Shevchenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-21 11:22 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> The config implements the algorithm compressing memory tags for ARM MTE
> during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +Programming Interface
> +=====================
> +
> + .. kernel-doc:: arch/arm64/mm/mtecomp.c

:export:

> +

Is it dangling trailing blank line? Drop it.

...

> +#include <linux/bitmap.h>

> +#include <linux/bitops.h>

This is guaranteed to be included by bitmap.h.

> +#include <linux/export.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +/*
> + * Sizes of compressed values. These depend on MTE_TAG_SIZE and

of the

> + * MTE_GRANULES_PER_PAGE.
> + */

...

> +	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
> +	unsigned int cur_idx = 0, i, j;
> +	u8 cur_tag;


> +	out_tags[0] = prev_tag;

out_tags[cur_idx] ?

> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;

> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;

Looking more at this I think there is still a room for improvement. I can't
come up right now with a proposal (lunch time :-), but I would look into

	do {
		...
	} while (i < MTE_...);

approach.

> +			}
> +		}
> +	}

...

> +static size_t mte_size_to_ranges(size_t size)
> +{
> +	size_t largest_bits;

> +	size_t ret = 0;

Redundant assignment. Please, check again all of them.

> +
> +	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> +				     MTE_BITS_PER_LARGEST_IDX;
> +	ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /

Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
Okay, never mind.

> +	      (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> +	return ret;

	return (...) / ...;

> +}

...

> +static size_t mte_alloc_size(unsigned int num_ranges)
> +{
> +	size_t sizes[4] = { 8, 16, 32, 64 };

Hooray! And now it's not needed anymore...

> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(sizes); i++) {

...as sizes[i] is equivalent of (8 << i).

> +		if (num_ranges <= mte_size_to_ranges(sizes[i]))
> +			return sizes[i];
> +	}
> +	return 128;
> +}

> +}

...

> +/**
> + * mte_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @mte_release_handle().

+ blank line here.

> + * Returns: 64-bit tag storage handle.
> + */

...

> +	/*
> +	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
> +	 * will read. But when the tags are stored in the handle itself, it must
> +	 * have all its bits initialized.
> +	 */
> +	unsigned long result = 0;

	// Actually it's interesting how it's supposed to work on 32-bit
	// builds...
	DECLARE_BITMAP(result, BITS_PER_LONG);

and then

	bitmap_zero();

?

...

> +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> +				     unsigned long *pos, unsigned long bits)
> +{
> +	unsigned long result;
> +
> +	result = bitmap_read(bitmap, *pos, bits);
> +	*pos += bits;
> +	return result;

	unsigned long start = *pos;

	*pos += bits;
	return bitmap_read(bitmap, start, bits);

> +}

...

> +	unsigned short r_sizes[46], sum = 0;

See below.

...

It's cleaner and more robust to have

	sum = 0;

here.

> +	for (i = 0; i < max_ranges; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		r_sizes[i] =
> +			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
> +		if (!r_sizes[i]) {
> +			max_ranges = i;
> +			break;
> +		}
> +		sum += r_sizes[i];
> +	}
> +	if (sum >= 256)
> +		return false;

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-07-21 11:22     ` Andy Shevchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-21 11:22 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> The config implements the algorithm compressing memory tags for ARM MTE
> during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +Programming Interface
> +=====================
> +
> + .. kernel-doc:: arch/arm64/mm/mtecomp.c

:export:

> +

Is it dangling trailing blank line? Drop it.

...

> +#include <linux/bitmap.h>

> +#include <linux/bitops.h>

This is guaranteed to be included by bitmap.h.

> +#include <linux/export.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +/*
> + * Sizes of compressed values. These depend on MTE_TAG_SIZE and

of the

> + * MTE_GRANULES_PER_PAGE.
> + */

...

> +	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
> +	unsigned int cur_idx = 0, i, j;
> +	u8 cur_tag;


> +	out_tags[0] = prev_tag;

out_tags[cur_idx] ?

> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;

> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;

Looking more at this I think there is still a room for improvement. I can't
come up right now with a proposal (lunch time :-), but I would look into

	do {
		...
	} while (i < MTE_...);

approach.

> +			}
> +		}
> +	}

...

> +static size_t mte_size_to_ranges(size_t size)
> +{
> +	size_t largest_bits;

> +	size_t ret = 0;

Redundant assignment. Please, check again all of them.

> +
> +	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> +				     MTE_BITS_PER_LARGEST_IDX;
> +	ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /

Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
Okay, never mind.

> +	      (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> +	return ret;

	return (...) / ...;

> +}

...

> +static size_t mte_alloc_size(unsigned int num_ranges)
> +{
> +	size_t sizes[4] = { 8, 16, 32, 64 };

Hooray! And now it's not needed anymore...

> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(sizes); i++) {

...as sizes[i] is equivalent of (8 << i).

> +		if (num_ranges <= mte_size_to_ranges(sizes[i]))
> +			return sizes[i];
> +	}
> +	return 128;
> +}

> +}

...

> +/**
> + * mte_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @mte_release_handle().

+ blank line here.

> + * Returns: 64-bit tag storage handle.
> + */

...

> +	/*
> +	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
> +	 * will read. But when the tags are stored in the handle itself, it must
> +	 * have all its bits initialized.
> +	 */
> +	unsigned long result = 0;

	// Actually it's interesting how it's supposed to work on 32-bit
	// builds...
	DECLARE_BITMAP(result, BITS_PER_LONG);

and then

	bitmap_zero();

?

...

> +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> +				     unsigned long *pos, unsigned long bits)
> +{
> +	unsigned long result;
> +
> +	result = bitmap_read(bitmap, *pos, bits);
> +	*pos += bits;
> +	return result;

	unsigned long start = *pos;

	*pos += bits;
	return bitmap_read(bitmap, start, bits);

> +}

...

> +	unsigned short r_sizes[46], sum = 0;

See below.

...

It's cleaner and more robust to have

	sum = 0;

here.

> +	for (i = 0; i < max_ranges; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		r_sizes[i] =
> +			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
> +		if (!r_sizes[i]) {
> +			max_ranges = i;
> +			break;
> +		}
> +		sum += r_sizes[i];
> +	}
> +	if (sum >= 256)
> +		return false;

-- 
With Best Regards,
Andy Shevchenko



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-07-21 11:25     ` Andy Shevchenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-21 11:25 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> Ensure that tag sequences containing alternating values are compressed
> to buffers of expected size and correctly decompressed afterwards.

...

> +#include <linux/bits.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>

> +#include <kunit/test.h>

Keep this in a separate group outside of linux/*.

> +#include <linux/types.h>
> +
> +#include <asm/mtecomp.h>

...

> +/* Functions exported from mtecomp.c for this test. */
> +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> +			size_t *out_len);
> +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> +			u8 *tags);

This is interesting. Have you run `make W=1` on this code?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
@ 2023-07-21 11:25     ` Andy Shevchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-21 11:25 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> Ensure that tag sequences containing alternating values are compressed
> to buffers of expected size and correctly decompressed afterwards.

...

> +#include <linux/bits.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>

> +#include <kunit/test.h>

Keep this in a separate group outside of linux/*.

> +#include <linux/types.h>
> +
> +#include <asm/mtecomp.h>

...

> +/* Functions exported from mtecomp.c for this test. */
> +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> +			size_t *out_len);
> +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> +			u8 *tags);

This is interesting. Have you run `make W=1` on this code?

-- 
With Best Regards,
Andy Shevchenko



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-07-23  1:57     ` Yury Norov
  -1 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-23  1:57 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +
> +	if (unlikely(!nbits))
> +		return;
> +	value &= GENMASK(nbits - 1, 0);

Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.

This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.

Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.

Anyways, the comment must warn about that loudly...

> +	if (space >= nbits) {
> +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);

'GENMASK(nbits - 1, 0) << offset' looks really silly.

> +		map[index] |= value << offset;

Here you commit 2 reads and 2 writes for a word instead of one.

> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);

~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.

> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

With all that I think the implementation should look something like
this:

        unsigned long w, mask;

        if (unlikely(nbits == 0))
                return 0;

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_FIRST_WORD_MASK(end);
        *map = w | (value >> BITS_PER_LONG * 2 - end);

It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.

Alexander, can you please try the above and compare?

In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?

Regarding the rest of the series:
 - I still see Evgenii's name in mtecomp.c, and EA0 references;
 - git-am throws warning about trailing line;
 - checkpatch warns 7 times;

Can you fix all the above before sending the new version?

Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.

You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...

Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-07-23  1:57     ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-23  1:57 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +
> +	if (unlikely(!nbits))
> +		return;
> +	value &= GENMASK(nbits - 1, 0);

Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.

This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.

Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.

Anyways, the comment must warn about that loudly...

> +	if (space >= nbits) {
> +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);

'GENMASK(nbits - 1, 0) << offset' looks really silly.

> +		map[index] |= value << offset;

Here you commit 2 reads and 2 writes for a word instead of one.

> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);

~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.

> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

With all that I think the implementation should look something like
this:

        unsigned long w, mask;

        if (unlikely(nbits == 0))
                return 0;

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_FIRST_WORD_MASK(end);
        *map = w | (value >> BITS_PER_LONG * 2 - end);

It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.

Alexander, can you please try the above and compare?

In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?

Regarding the rest of the series:
 - I still see Evgenii's name in mtecomp.c, and EA0 references;
 - git-am throws warning about trailing line;
 - checkpatch warns 7 times;

Can you fix all the above before sending the new version?

Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.

You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...

Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?

Thanks,
Yury

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-07-23  2:29     ` Yury Norov
  -1 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-23  2:29 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:53PM +0200, Alexander Potapenko wrote:
> Add basic tests ensuring that values can be added at arbitrary positions
> of the bitmap, including those spanning into the adjacent unsigned
> longs.
> 
> Signed-off-by: Alexander Potapenko <glider@google.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

[...]

> +/*
> + * Test bitmap should be big enough to include the cases when start is not in
> + * the first word, and start+nbits lands in the following word.
> + */
> +#define TEST_BIT_LEN (BITS_PER_LONG * 3)

Why not just 1000? Is your code safe against unaligned bitmaps?

> +#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))

BITS_TO_BYTES

> +static void __init test_set_get_value(void)

test_bitmap_read_write. Here, and in subjects for #1 and #2.

> +{
> +	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> +	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> +	/* Prevent constant folding. */
> +	volatile unsigned long zero_bits = 0;

Use READ_ONCE() instead of volatile

> +	unsigned long val, bit;
> +	int i;
> +
> +	/* Setting/getting zero bytes should not crash the kernel. */
> +	bitmap_write(NULL, 0, 0, zero_bits);
> +	val = bitmap_read(NULL, 0, zero_bits);
> +	expect_eq_ulong(0, val);

No, val is undefined.

> +
> +	/*
> +	 * Ensure that bitmap_read() reads the same value that was previously
> +	 * written, and two consequent values are correctly merged.
> +	 * The resulting bit pattern is asymmetric to rule out possible issues
> +	 * with bit numeration order.
> +	 */
> +	for (i = 0; i < TEST_BIT_LEN - 7; i++) {

Can you add some empty lines in the block below in sake of
readability? Maybe after expect()?

> +		bitmap_zero(bitmap, TEST_BIT_LEN);
> +		bitmap_write(bitmap, 0b10101UL, i, 5);
> +		val = bitmap_read(bitmap, i, 5);
> +		expect_eq_ulong(0b10101UL, val);
> +		bitmap_write(bitmap, 0b101UL, i + 5, 3);
> +		val = bitmap_read(bitmap, i + 5, 3);
> +		expect_eq_ulong(0b101UL, val);
> +		val = bitmap_read(bitmap, i, 8);
> +		expect_eq_ulong(0b10110101UL, val);
> +	}
> +
> +	/*
> +	 * Check that setting a single bit does not accidentally touch the
> +	 * adjacent bits.
> +	 */
> +	for (i = 0; i < TEST_BIT_LEN; i++) {
> +		/*
> +		 * A 0b10101010 pattern to catch both 0s replaced to 1s and vice
> +		 * versa.
> +		 */
> +		memset(bitmap, 0xaa, TEST_BYTE_LEN);
> +		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
> +		for (bit = 0; bit <= 1; bit++) {
> +			bitmap_write(bitmap, bit, i, 1);
> +			__assign_bit(i, exp_bitmap, bit);
> +			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> +		}

I suggested the other test:

val = DEADBEEF;
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++)
        for (start = 0; start < 1000; i++) {
                if (start + nbits >= 1000)
                        break;;

                v = val & GENMASK(nbits - 1, 0);

		memset(bitmap, 0xaa /* also 0xff and 0x00 */, TEST_BYTE_LEN);
		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);

                for (n = 0; n < nbits; n++)
			__assign_bit(v & BIT(n), exp_bitmap, start + n);

                bitmap_write(bitmap, v, start, nbits);
		expect_eq_bitmap(exp_bitmap, bitmap, 1000);

                r = bitmap_read(bitmap, start, nbits);
                expect_eq(r, v);
        }

> +	}
> +
> +	/* Ensure setting 0 bits does not change anything. */
> +	memset(bitmap, 0xaa, TEST_BYTE_LEN);
> +	memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
> +	for (i = 0; i < TEST_BIT_LEN; i++) {
> +		bitmap_write(bitmap, ~0UL, i, 0);
> +		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> +	}
> +}
> +#undef TEST_BYTE_LEN
> +#undef TEST_BIT_LEN
> +
>  static void __init selftest(void)
>  {
>  	test_zero_clear();
> @@ -1249,6 +1328,8 @@ static void __init selftest(void)
>  	test_for_each_clear_bitrange_from();
>  	test_for_each_set_clump8();
>  	test_for_each_set_bit_wrap();
> +
> +	test_set_get_value();

This should append the test_bitmap_* section

Thanks,
Yury

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
@ 2023-07-23  2:29     ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-23  2:29 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:53PM +0200, Alexander Potapenko wrote:
> Add basic tests ensuring that values can be added at arbitrary positions
> of the bitmap, including those spanning into the adjacent unsigned
> longs.
> 
> Signed-off-by: Alexander Potapenko <glider@google.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

[...]

> +/*
> + * Test bitmap should be big enough to include the cases when start is not in
> + * the first word, and start+nbits lands in the following word.
> + */
> +#define TEST_BIT_LEN (BITS_PER_LONG * 3)

Why not just 1000? Is your code safe against unaligned bitmaps?

> +#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))

BITS_TO_BYTES

> +static void __init test_set_get_value(void)

test_bitmap_read_write. Here, and in subjects for #1 and #2.

> +{
> +	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> +	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> +	/* Prevent constant folding. */
> +	volatile unsigned long zero_bits = 0;

Use READ_ONCE() instead of volatile

> +	unsigned long val, bit;
> +	int i;
> +
> +	/* Setting/getting zero bytes should not crash the kernel. */
> +	bitmap_write(NULL, 0, 0, zero_bits);
> +	val = bitmap_read(NULL, 0, zero_bits);
> +	expect_eq_ulong(0, val);

No, val is undefined.

> +
> +	/*
> +	 * Ensure that bitmap_read() reads the same value that was previously
> +	 * written, and two consequent values are correctly merged.
> +	 * The resulting bit pattern is asymmetric to rule out possible issues
> +	 * with bit numeration order.
> +	 */
> +	for (i = 0; i < TEST_BIT_LEN - 7; i++) {

Can you add some empty lines in the block below in sake of
readability? Maybe after expect()?

> +		bitmap_zero(bitmap, TEST_BIT_LEN);
> +		bitmap_write(bitmap, 0b10101UL, i, 5);
> +		val = bitmap_read(bitmap, i, 5);
> +		expect_eq_ulong(0b10101UL, val);
> +		bitmap_write(bitmap, 0b101UL, i + 5, 3);
> +		val = bitmap_read(bitmap, i + 5, 3);
> +		expect_eq_ulong(0b101UL, val);
> +		val = bitmap_read(bitmap, i, 8);
> +		expect_eq_ulong(0b10110101UL, val);
> +	}
> +
> +	/*
> +	 * Check that setting a single bit does not accidentally touch the
> +	 * adjacent bits.
> +	 */
> +	for (i = 0; i < TEST_BIT_LEN; i++) {
> +		/*
> +		 * A 0b10101010 pattern to catch both 0s replaced to 1s and vice
> +		 * versa.
> +		 */
> +		memset(bitmap, 0xaa, TEST_BYTE_LEN);
> +		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
> +		for (bit = 0; bit <= 1; bit++) {
> +			bitmap_write(bitmap, bit, i, 1);
> +			__assign_bit(i, exp_bitmap, bit);
> +			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> +		}

I suggested the other test:

val = DEADBEEF;
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++)
        for (start = 0; start < 1000; i++) {
                if (start + nbits >= 1000)
                        break;;

                v = val & GENMASK(nbits - 1, 0);

		memset(bitmap, 0xaa /* also 0xff and 0x00 */, TEST_BYTE_LEN);
		memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);

                for (n = 0; n < nbits; n++)
			__assign_bit(v & BIT(n), exp_bitmap, start + n);

                bitmap_write(bitmap, v, start, nbits);
		expect_eq_bitmap(exp_bitmap, bitmap, 1000);

                r = bitmap_read(bitmap, start, nbits);
                expect_eq(r, v);
        }

> +	}
> +
> +	/* Ensure setting 0 bits does not change anything. */
> +	memset(bitmap, 0xaa, TEST_BYTE_LEN);
> +	memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
> +	for (i = 0; i < TEST_BIT_LEN; i++) {
> +		bitmap_write(bitmap, ~0UL, i, 0);
> +		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> +	}
> +}
> +#undef TEST_BYTE_LEN
> +#undef TEST_BIT_LEN
> +
>  static void __init selftest(void)
>  {
>  	test_zero_clear();
> @@ -1249,6 +1328,8 @@ static void __init selftest(void)
>  	test_for_each_clear_bitrange_from();
>  	test_for_each_set_clump8();
>  	test_for_each_set_bit_wrap();
> +
> +	test_set_get_value();

This should append the test_bitmap_* section

Thanks,
Yury

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-23  1:57     ` Yury Norov
  (?)
@ 2023-07-23 15:38     ` Yury Norov
  2023-09-22  7:48         ` Alexander Potapenko
  -1 siblings, 1 reply; 61+ messages in thread
From: Yury Norov @ 2023-07-23 15:38 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Sat, Jul 22, 2023 at 06:57:26PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > +				unsigned long value,
> > +				unsigned long start, unsigned long nbits)
> > +{
> > +	size_t index = BIT_WORD(start);
> > +	unsigned long offset = start % BITS_PER_LONG;
> > +	unsigned long space = BITS_PER_LONG - offset;
> > +
> > +	if (unlikely(!nbits))
> > +		return;
> > +	value &= GENMASK(nbits - 1, 0);
> 
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.
> 
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
> 
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
> 
> Anyways, the comment must warn about that loudly...

OK. I spent a night with that, and I'm still not sure. Pseudo code
that implements it looks like this:

        for (bit = 0; bit < nbits; bit++)
                assign_bit(start + bit, bitmap, val & BIT(bit));

And it ignores trailing bits. So if we're optimizing this pattern,
we'd ignore these bits just as well...

Either way, whatever we decide, let's stay clear with our intentions
and mention explicitly that tail bits are either must be zero, or
ignored.

Alexander, can you add the snippet above to the comments for the
bitmap_write() and bitmap_read(), as well as in the test? Also, if we
decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
generate better code than GENMASK(nbits - 1, 0) does?

Commets are very appreciated.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-23  1:57     ` Yury Norov
  (?)
  (?)
@ 2023-07-24  8:36     ` Andy Shevchenko
  2023-07-25  5:04       ` Yury Norov
  -1 siblings, 1 reply; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-24  8:36 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

> > +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> 
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

But you followed the thread to get a clue why it's written in this form, right?

...

> With all that I think the implementation should look something like
> this:

I would go this way if and only if the code generation on main architectures
with both GCC and clang is better.

And maybe even some performance tests need to be provided.

...

> Alexander, can you please try the above and compare?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Exactly what we need before going with your suggestion(s).

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-24  8:36     ` Andy Shevchenko
@ 2023-07-25  5:04       ` Yury Norov
  2023-07-25  9:00         ` Andy Shevchenko
  0 siblings, 1 reply; 61+ messages in thread
From: Yury Norov @ 2023-07-25  5:04 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> 
> > > +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> > 
> > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
> 
> But you followed the thread to get a clue why it's written in this form, right?

Yes, I did. But I don't expect everyone looking at kernel code would spend
time recovering discussions that explain why that happened. So, at least it
would be fine to drop a comment.
 
> ...
> 
> > With all that I think the implementation should look something like
> > this:
> 
> I would go this way if and only if the code generation on main architectures
> with both GCC and clang is better.
> 
> And maybe even some performance tests need to be provided.

For the following implementation:

  void my_bitmap_write(unsigned long *map, unsigned long value,
                                 unsigned long start, unsigned long nbits)
  {
          unsigned long w, end;
  
          if (unlikely(nbits == 0))
                  return;
  
          value &= GENMASK(nbits - 1, 0);
  
          map += BIT_WORD(start);
          start %= BITS_PER_LONG;
          end = start + nbits - 1;
  
          w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
          *map = w | (value << start);
  
          if (end < BITS_PER_LONG)
                  return;
  
          w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
          *map = w | (value >> (BITS_PER_LONG - start));
  }

This is the bloat-o-meter output:

$ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o
add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851)
Function                                     old     new   delta
test_bitmap_init                            3846    5484   +1638
test_bitmap_write_perf                         -     401    +401
bitmap_write                                   -     271    +271
my_bitmap_write                                -     248    +248
bitmap_read                                    -     229    +229
__pfx_test_bitmap_write_perf                   -      16     +16
__pfx_my_bitmap_write                          -      16     +16
__pfx_bitmap_write                             -      16     +16
__pfx_bitmap_read                              -      16     +16
Total: Before=36964, After=39815, chg +7.71%

And this is the performance test:

        for (cnt = 0; cnt < 5; cnt++) {
                time = ktime_get();
                for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
                        for (i = 0; i < 1000; i++) {
                                if (i + nbits > 1000)
                                        break;
                                bitmap_write(bmap, val, i, nbits);
                        }
                }
                time = ktime_get() - time;
                pr_err("bitmap_write:\t%llu\t", time);

                time = ktime_get();
                for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
                        for (i = 0; i < 1000; i++) {
                                if (i + nbits > 1000)
                                        break;
                                my_bitmap_write(bmap, val, i, nbits);
                        }
                }
                time = ktime_get() - time;
                pr_cont("%llu\n", time);
        }

Which on x86_64/kvm with GCC gives:
                                                Orig    My
[    1.630731] test_bitmap: bitmap_write:	299092	252764
[    1.631584] test_bitmap: bitmap_write:	299522	252554
[    1.632429] test_bitmap: bitmap_write:	299171	258665
[    1.633280] test_bitmap: bitmap_write:	299241	252794
[    1.634133] test_bitmap: bitmap_write:	306716	252934

So, it's ~15% difference in performance and 8% in size.

I don't insist on my implementation, but I think, we'd experiment for more
with code generation.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-25  5:04       ` Yury Norov
@ 2023-07-25  9:00         ` Andy Shevchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-07-25  9:00 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 24, 2023 at 10:04:34PM -0700, Yury Norov wrote:
> On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

...

> > > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
> > 
> > But you followed the thread to get a clue why it's written in this form, right?
> 
> Yes, I did. But I don't expect everyone looking at kernel code would spend
> time recovering discussions that explain why that happened. So, at least it
> would be fine to drop a comment.

See also below.

...

>           w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));

This GENMASK() may generate worse code as compiler uses more instructions
instead of simple approach with the above..

...

> bitmap_write                                   -     271    +271
> my_bitmap_write                                -     248    +248
> bitmap_read                                    -     229    +229

my_ -- means your proposal? Do you mean you have it better in size than original one?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-23  1:57     ` Yury Norov
                       ` (2 preceding siblings ...)
  (?)
@ 2023-07-26  8:08     ` Alexander Potapenko
  2023-07-27  0:14         ` Yury Norov
  -1 siblings, 1 reply; 61+ messages in thread
From: Alexander Potapenko @ 2023-07-26  8:08 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > +                             unsigned long value,
> > +                             unsigned long start, unsigned long nbits)
> > +{
> > +     size_t index = BIT_WORD(start);
> > +     unsigned long offset = start % BITS_PER_LONG;
> > +     unsigned long space = BITS_PER_LONG - offset;
> > +
> > +     if (unlikely(!nbits))
> > +             return;
> > +     value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.

I can easily imagine someone passing -1 (or ~0) as a value, but
wanting to only write n bits of n.

>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
>
> > +     if (space >= nbits) {
> > +             map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

As noted in the other patch discussion, pulling offset into GENMASK is
actually not an identity transformation, because nbits + offset may
exceed 64, producing an invalid mask.

>
> > +             map[index] |= value << offset;
>
> Here you commit 2 reads and 2 writes for a word instead of one.

There won't be two reads and two writes.
The compiler will read map[index] to a register, apply the mask, then
write value << offset to it, then perform the write.
Unless map[] is a volatile, repeated reads/writes will be optimized
out by any decent compiler.

>
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
>
> ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
> ~25 bytes of .text on x86_64.
>
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> With all that I think the implementation should look something like
> this:
>
>         unsigned long w, mask;
>
>         if (unlikely(nbits == 0))
>                 return 0;
>
>         map += BIT_WORD(start);
>         start %= BITS_PER_LONG;
>         end = start + nbits - 1;
>
>         w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
>         *map = w | (value << start);
>
>         if (end < BITS_PER_LONG)
>                 return;
>
>         w = *++map & BITMAP_FIRST_WORD_MASK(end);
>         *map = w | (value >> BITS_PER_LONG * 2 - end);
>
> It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
> it should be more efficient.
>
> Alexander, can you please try the above and compare?

I like the idea of sharing the first write between the branches, and
it can be made even shorter:

===========================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
===========================================================

According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
GCC (which on the other hand does a poor job optimizing both).

Overall, given that there's currently a single user of these
functions, isn't it premature to optimize them without knowing
anything about their performance?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Will godbolt work for you (see above)?


>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;
Will fix, thanks!

>  - git-am throws warning about trailing line;
Interesting, I generate the patches using `git format-patch`, I
thought `git am` should be the inversion of it. I'll check.

>  - checkpatch warns 7 times;

Indeed, there were warnings that I ignored (e.g. one of them was
caused by me adding extern symbols to the test module, as requested
during the review process). I think I can fix all of them.

>
> Can you fix all the above before sending the new version?
>
> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

BE64 works, I'll also need to check the 32-bit versions as well.
Note that MTE is an ARM64 feature (yet we still need to ensure
bitops.h works on 32 bits).

>
> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

I'll provide the exact numbers with the next patch series. Last time I
checked, the order of magnitude was tens of megabytes.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

For out-of-line allocations the data can be derived from
/proc/slabinfo, but we don't calculate inline allocations.
Agreed, a debugfs interface won't hurt.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-26  8:08     ` Alexander Potapenko
@ 2023-07-27  0:14         ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-27  0:14 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > +                             unsigned long value,
> > > +                             unsigned long start, unsigned long nbits)
> > > +{
> > > +     size_t index = BIT_WORD(start);
> > > +     unsigned long offset = start % BITS_PER_LONG;
> > > +     unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > +     if (unlikely(!nbits))
> > > +             return;
> > > +     value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
> 
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.

This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.

...
 
> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
> 
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;

        space >= nbits <=>
        BITS_PER_LONG - offset >= nbits <=> 
        offset + nbits <= BITS_PER_LONG

>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :

So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?

> ~BITMAP_FIRST_WORD_MASK(start));

As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.

>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }
> ===========================================================
> 
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
> 
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
> 
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
> 
> Will godbolt work for you (see above)?

I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.

...

> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
> 
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.

That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.

> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
> 
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-07-27  0:14         ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-07-27  0:14 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > +                             unsigned long value,
> > > +                             unsigned long start, unsigned long nbits)
> > > +{
> > > +     size_t index = BIT_WORD(start);
> > > +     unsigned long offset = start % BITS_PER_LONG;
> > > +     unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > +     if (unlikely(!nbits))
> > > +             return;
> > > +     value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
> 
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.

This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.

...
 
> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
> 
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;

        space >= nbits <=>
        BITS_PER_LONG - offset >= nbits <=> 
        offset + nbits <= BITS_PER_LONG

>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :

So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?

> ~BITMAP_FIRST_WORD_MASK(start));

As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.

>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }
> ===========================================================
> 
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
> 
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
> 
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
> 
> Will godbolt work for you (see above)?

I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.

...

> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
> 
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.

That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.

> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
> 
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-27  0:14         ` Yury Norov
@ 2023-08-04 16:07           ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-08-04 16:07 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 15422 bytes --]

>         space >= nbits <=>
>         BITS_PER_LONG - offset >= nbits <=>
>         offset + nbits <= BITS_PER_LONG
>
> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?

It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.

Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().

>
> >         map[index] |= value << offset;
> >         if (fit)
> >                 return;
> >
> >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);

OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.

> >         map[index + 1] |= (value >> space);
> > }

I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.

1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)

==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
                  unsigned long start, unsigned long nbits)
{
        size_t index = BIT_WORD(start);
        unsigned long offset = start % BITS_PER_LONG;
        unsigned long space = BITS_PER_LONG - offset;

        if (unlikely(!nbits))
                return;
        value &= GENMASK(nbits - 1, 0);
        if (space >= nbits) {
                map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
                map[index] |= value << offset;
                return;
        }
        map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
        map[index] |= value << offset;
        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
    1200: 41 57                push   %r15
    1202: 41 56                push   %r14
    1204: 53                    push   %rbx
    1205: 48 85 c9              test   %rcx,%rcx
    1208: 74 7b                je     1285 <bitmap_write+0x85>
    120a: 48 89 c8              mov    %rcx,%rax
    120d: 49 89 d2              mov    %rdx,%r10
    1210: 49 c1 ea 06          shr    $0x6,%r10
    1214: 41 89 d1              mov    %edx,%r9d
    1217: f6 d9                neg    %cl
    1219: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1220: 49 d3 ef              shr    %cl,%r15
    1223: 41 83 e1 3f          and    $0x3f,%r9d
    1227: 41 b8 40 00 00 00    mov    $0x40,%r8d
    122d: 4c 21 fe              and    %r15,%rsi
    1230: 48 89 f3              mov    %rsi,%rbx
    1233: 44 89 c9              mov    %r9d,%ecx
    1236: 48 d3 e3              shl    %cl,%rbx
    1239: 4d 29 c8              sub    %r9,%r8
    123c: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    1243: 4e 8b 34 d7          mov    (%rdi,%r10,8),%r14
    1247: 49 39 c0              cmp    %rax,%r8
    124a: 73 3f                jae    128b <bitmap_write+0x8b>
    124c: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1253: 44 89 c9              mov    %r9d,%ecx
    1256: 49 d3 e7              shl    %cl,%r15
    1259: 49 f7 d7              not    %r15
    125c: 4d 21 fe              and    %r15,%r14
    125f: 49 09 de              or     %rbx,%r14
    1262: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    1266: 01 c2                add    %eax,%edx
    1268: f6 da                neg    %dl
    126a: 89 d1                mov    %edx,%ecx
    126c: 49 d3 eb              shr    %cl,%r11
    126f: 49 f7 d3              not    %r11
    1272: 44 89 c1              mov    %r8d,%ecx
    1275: 48 d3 ee              shr    %cl,%rsi
    1278: 4e 23 5c d7 08        and    0x8(%rdi,%r10,8),%r11
    127d: 4c 09 de              or     %r11,%rsi
    1280: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    1285: 5b                    pop    %rbx
    1286: 41 5e                pop    %r14
    1288: 41 5f                pop    %r15
    128a: c3                    ret
    128b: 44 89 c9              mov    %r9d,%ecx
    128e: 49 d3 e7              shl    %cl,%r15
    1291: 49 f7 d7              not    %r15
    1294: 4d 21 fe              and    %r15,%r14
    1297: 49 09 de              or     %rbx,%r14
    129a: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    129e: 5b                    pop    %rbx
    129f: 41 5e                pop    %r14
    12a1: 41 5f                pop    %r15
    12a3: c3                    ret
==================================================================

2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):

==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
                     unsigned long start, unsigned long nbits)
{
        unsigned long w, end;

        if (unlikely(nbits == 0))
                return;

        value &= GENMASK(nbits - 1, 0);

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
        *map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
    1160: 48 85 c9              test   %rcx,%rcx
    1163: 0f 84 8e 00 00 00    je     11f7 <my_bitmap_write+0x97>
    1169: 49 89 c9              mov    %rcx,%r9
    116c: f6 d9                neg    %cl
    116e: 48 d3 e6              shl    %cl,%rsi
    1171: 48 d3 ee              shr    %cl,%rsi
    1174: 49 89 d2              mov    %rdx,%r10
    1177: 49 c1 ea 06          shr    $0x6,%r10
    117b: 89 d0                mov    %edx,%eax
    117d: 83 e0 3f              and    $0x3f,%eax
    1180: 4e 8d 04 08          lea    (%rax,%r9,1),%r8
    1184: 4a 8d 0c 08          lea    (%rax,%r9,1),%rcx
    1188: 48 ff c9              dec    %rcx
    118b: 4e 8b 0c d7          mov    (%rdi,%r10,8),%r9
    118f: 48 83 f9 3f          cmp    $0x3f,%rcx
    1193: 77 29                ja     11be <my_bitmap_write+0x5e>
    1195: 41 f6 d8              neg    %r8b
    1198: 48 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%rdx
    119f: 44 89 c1              mov    %r8d,%ecx
    11a2: 48 d3 ea              shr    %cl,%rdx
    11a5: 89 c1                mov    %eax,%ecx
    11a7: 48 d3 ea              shr    %cl,%rdx
    11aa: 48 d3 e2              shl    %cl,%rdx
    11ad: 48 d3 e6              shl    %cl,%rsi
    11b0: 48 f7 d2              not    %rdx
    11b3: 49 21 d1              and    %rdx,%r9
    11b6: 4c 09 ce              or     %r9,%rsi
    11b9: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    11bd: c3                    ret
    11be: f6 da                neg    %dl
    11c0: 89 d1                mov    %edx,%ecx
    11c2: 49 d3 e1              shl    %cl,%r9
    11c5: 49 d3 e9              shr    %cl,%r9
    11c8: 48 89 f2              mov    %rsi,%rdx
    11cb: 89 c1                mov    %eax,%ecx
    11cd: 48 d3 e2              shl    %cl,%rdx
    11d0: 4c 09 ca              or     %r9,%rdx
    11d3: 4a 89 14 d7          mov    %rdx,(%rdi,%r10,8)
    11d7: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    11dc: 41 f6 d8              neg    %r8b
    11df: 44 89 c1              mov    %r8d,%ecx
    11e2: 48 d3 e2              shl    %cl,%rdx
    11e5: 48 d3 ea              shr    %cl,%rdx
    11e8: f6 d8                neg    %al
    11ea: 89 c1                mov    %eax,%ecx
    11ec: 48 d3 ee              shr    %cl,%rsi
    11ef: 48 09 d6              or     %rdx,%rsi
    11f2: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    11f7: c3                    ret
==================================================================

3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:

==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
    12b0: 48 85 c9              test   %rcx,%rcx
    12b3: 74 6e                je     1323 <bitmap_write_new+0x73>
    12b5: 48 89 c8              mov    %rcx,%rax
    12b8: f6 d9                neg    %cl
    12ba: 49 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%r10
    12c1: 49 d3 ea              shr    %cl,%r10
    12c4: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    12cb: 4c 21 d6              and    %r10,%rsi
    12ce: 89 d1                mov    %edx,%ecx
    12d0: 83 e1 3f              and    $0x3f,%ecx
    12d3: 41 b8 40 00 00 00    mov    $0x40,%r8d
    12d9: 49 29 c8              sub    %rcx,%r8
    12dc: 49 89 d1              mov    %rdx,%r9
    12df: 49 c1 e9 06          shr    $0x6,%r9
    12e3: 49 39 c0              cmp    %rax,%r8
    12e6: 4d 0f 42 d3          cmovb  %r11,%r10
    12ea: 49 d3 e2              shl    %cl,%r10
    12ed: 49 f7 d2              not    %r10
    12f0: 4e 23 14 cf          and    (%rdi,%r9,8),%r10
    12f4: 49 89 f3              mov    %rsi,%r11
    12f7: 49 d3 e3              shl    %cl,%r11
    12fa: 4d 09 d3              or     %r10,%r11
    12fd: 4e 89 1c cf          mov    %r11,(%rdi,%r9,8)
    1301: 49 39 c0              cmp    %rax,%r8
    1304: 73 1d                jae    1323 <bitmap_write_new+0x73>
    1306: 01 d0                add    %edx,%eax
    1308: 4a 8b 54 cf 08        mov    0x8(%rdi,%r9,8),%rdx
    130d: 89 c1                mov    %eax,%ecx
    130f: 48 d3 ea              shr    %cl,%rdx
    1312: 48 d3 e2              shl    %cl,%rdx
    1315: 44 89 c1              mov    %r8d,%ecx
    1318: 48 d3 ee              shr    %cl,%rsi
    131b: 48 09 d6              or     %rdx,%rsi
    131e: 4a 89 74 cf 08        mov    %rsi,0x8(%rdi,%r9,8)
    1323: c3                    ret
==================================================================

4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:

==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
    1330: 48 85 c9              test   %rcx,%rcx
    1333: 74 6a                je     139f <bitmap_write_new_shift+0x6f>
    1335: 48 89 c8              mov    %rcx,%rax
    1338: f6 d9                neg    %cl
    133a: 48 d3 e6              shl    %cl,%rsi
    133d: 48 d3 ee              shr    %cl,%rsi
    1340: 41 89 d0              mov    %edx,%r8d
    1343: 41 83 e0 3f          and    $0x3f,%r8d
    1347: 41 b9 40 00 00 00    mov    $0x40,%r9d
    134d: 4d 29 c1              sub    %r8,%r9
    1350: 49 89 d2              mov    %rdx,%r10
    1353: 49 c1 ea 06          shr    $0x6,%r10
    1357: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    135e: 44 89 c1              mov    %r8d,%ecx
    1361: 49 d3 e3              shl    %cl,%r11
    1364: 49 39 c1              cmp    %rax,%r9
    1367: 73 37                jae    13a0 <bitmap_write_new_shift+0x70>
    1369: 53                    push   %rbx
    136a: 49 f7 d3              not    %r11
    136d: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    1371: 48 89 f3              mov    %rsi,%rbx
    1374: 44 89 c1              mov    %r8d,%ecx
    1377: 48 d3 e3              shl    %cl,%rbx
    137a: 4c 09 db              or     %r11,%rbx
    137d: 4a 89 1c d7          mov    %rbx,(%rdi,%r10,8)
    1381: 01 d0                add    %edx,%eax
    1383: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    1388: 89 c1                mov    %eax,%ecx
    138a: 48 d3 ea              shr    %cl,%rdx
    138d: 48 d3 e2              shl    %cl,%rdx
    1390: 44 89 c9              mov    %r9d,%ecx
    1393: 48 d3 ee              shr    %cl,%rsi
    1396: 48 09 d6              or     %rdx,%rsi
    1399: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    139e: 5b                    pop    %rbx
    139f: c3                    ret
    13a0: 44 01 c0              add    %r8d,%eax
    13a3: f6 d8                neg    %al
    13a5: 89 c1                mov    %eax,%ecx
    13a7: 49 d3 e3              shl    %cl,%r11
    13aa: 49 d3 eb              shr    %cl,%r11
    13ad: 49 f7 d3              not    %r11
    13b0: 44 89 c1              mov    %r8d,%ecx
    13b3: 48 d3 e6              shl    %cl,%rsi
    13b6: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    13ba: 4c 09 de              or     %r11,%rsi
    13bd: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    13c1: c3                    ret
==================================================================

For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.

PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.

[-- Attachment #2: bitmap_write.c --]
[-- Type: text/x-csrc, Size: 4039 bytes --]

#include <stddef.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

#define BITS_PER_LONG 64
#define unlikely(x) x
#define UL(x) (x##UL)
#define GENMASK(h, l) \
	(((~UL(0)) - (UL(1) << (l)) + 1) & \
	 (~UL(0) >> (BITS_PER_LONG - 1 - (h))))

#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))


void my_bitmap_write(unsigned long *map, unsigned long value,
		     unsigned long start, unsigned long nbits)
{
	unsigned long w, end;

	if (unlikely(nbits == 0))
		return;

	value &= GENMASK(nbits - 1, 0);

	map += BIT_WORD(start);
	start %= BITS_PER_LONG;
	end = start + nbits - 1;

	w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
	*map = w | (value << start);

	if (end < BITS_PER_LONG)
		return;

	w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
	*map = w | (value >> (BITS_PER_LONG - start));
}

void bitmap_write(unsigned long *map, unsigned long value,
		  unsigned long start, unsigned long nbits)
{
	size_t index = BIT_WORD(start);
	unsigned long offset = start % BITS_PER_LONG;
	unsigned long space = BITS_PER_LONG - offset;

	if (unlikely(!nbits))
		return;
	value &= GENMASK(nbits - 1, 0);
	if (space >= nbits) {
		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
		map[index] |= value << offset;
		return;
	}
	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
	map[index] |= value << offset;
	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

void bitmap_write_new(unsigned long *map, unsigned long value,
		      unsigned long start, unsigned long nbits)
{
	unsigned long offset;
	unsigned long space;
	size_t index;
	bool fit;

	if (unlikely(!nbits))
		return;

	value &= GENMASK(nbits - 1, 0);
	offset = start % BITS_PER_LONG;
	space = BITS_PER_LONG - offset;
	index =	BIT_WORD(start);
	fit = space >= nbits;

	map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : ~BITMAP_FIRST_WORD_MASK(start));
	map[index] |= value << offset;
	if (fit)
		return;

	map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

void bitmap_write_new_shift(unsigned long *map, unsigned long value,
		      unsigned long start, unsigned long nbits)
{
	unsigned long offset;
	unsigned long space;
	size_t index;
	bool fit;

	if (unlikely(!nbits))
		return;

	value &= GENMASK(nbits - 1, 0);
	offset = start % BITS_PER_LONG;
	space = BITS_PER_LONG - offset;
	index =	BIT_WORD(start);
	fit = space >= nbits;

	map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) : ~BITMAP_FIRST_WORD_MASK(start));
	map[index] |= value << offset;
	if (fit)
		return;

	map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

#define MAPSIZE 3

void print_map(unsigned long *map, const char *c)
{
	int i;

	printf("%s: ", c);
	for (i = 0; i < MAPSIZE; i++) {
		printf("%lx ", map[i]);
	}
	printf("\n");
}

#define COMPARE(fn1, fn2)		\
do {					\
	unsigned long one[MAPSIZE], two[MAPSIZE], three[MAPSIZE];\
	int res;						\
								\
	memset(one, 0, sizeof(unsigned long)*MAPSIZE);		\
	memset(two, 0, sizeof(unsigned long)*MAPSIZE);		\
	fn1(one, value, start, nbits);				\
	fn2(two, value, start, nbits);				\
	res = memcmp(one, two, sizeof(unsigned long)*MAPSIZE);	\
	if (res) {						\
		printf(#fn1 " vs. " #fn2 ": [%lu, %lu]: %d\n", 	\
			start, nbits, res);			\
		print_map(one, #fn1);				\
		print_map(two, #fn2);				\
	}							\
} while (0)

void test(unsigned long value, unsigned long start, unsigned long nbits)
{
	COMPARE(bitmap_write, bitmap_write_new);
	COMPARE(bitmap_write, my_bitmap_write);
	COMPARE(bitmap_write, bitmap_write_new_shift);
}

int main()
{
	unsigned long value = 0xfafafafafafafafaUL;
	unsigned long start, nbits;

	for (start = 0; start <= (MAPSIZE-1)*BITS_PER_LONG; start++) {
		for (nbits = 0; nbits <= BITS_PER_LONG; nbits++) {
			test(value, start, nbits);
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-08-04 16:07           ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-08-04 16:07 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 15422 bytes --]

>         space >= nbits <=>
>         BITS_PER_LONG - offset >= nbits <=>
>         offset + nbits <= BITS_PER_LONG
>
> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?

It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.

Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().

>
> >         map[index] |= value << offset;
> >         if (fit)
> >                 return;
> >
> >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);

OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.

> >         map[index + 1] |= (value >> space);
> > }

I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.

1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)

==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
                  unsigned long start, unsigned long nbits)
{
        size_t index = BIT_WORD(start);
        unsigned long offset = start % BITS_PER_LONG;
        unsigned long space = BITS_PER_LONG - offset;

        if (unlikely(!nbits))
                return;
        value &= GENMASK(nbits - 1, 0);
        if (space >= nbits) {
                map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
                map[index] |= value << offset;
                return;
        }
        map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
        map[index] |= value << offset;
        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
    1200: 41 57                push   %r15
    1202: 41 56                push   %r14
    1204: 53                    push   %rbx
    1205: 48 85 c9              test   %rcx,%rcx
    1208: 74 7b                je     1285 <bitmap_write+0x85>
    120a: 48 89 c8              mov    %rcx,%rax
    120d: 49 89 d2              mov    %rdx,%r10
    1210: 49 c1 ea 06          shr    $0x6,%r10
    1214: 41 89 d1              mov    %edx,%r9d
    1217: f6 d9                neg    %cl
    1219: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1220: 49 d3 ef              shr    %cl,%r15
    1223: 41 83 e1 3f          and    $0x3f,%r9d
    1227: 41 b8 40 00 00 00    mov    $0x40,%r8d
    122d: 4c 21 fe              and    %r15,%rsi
    1230: 48 89 f3              mov    %rsi,%rbx
    1233: 44 89 c9              mov    %r9d,%ecx
    1236: 48 d3 e3              shl    %cl,%rbx
    1239: 4d 29 c8              sub    %r9,%r8
    123c: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    1243: 4e 8b 34 d7          mov    (%rdi,%r10,8),%r14
    1247: 49 39 c0              cmp    %rax,%r8
    124a: 73 3f                jae    128b <bitmap_write+0x8b>
    124c: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1253: 44 89 c9              mov    %r9d,%ecx
    1256: 49 d3 e7              shl    %cl,%r15
    1259: 49 f7 d7              not    %r15
    125c: 4d 21 fe              and    %r15,%r14
    125f: 49 09 de              or     %rbx,%r14
    1262: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    1266: 01 c2                add    %eax,%edx
    1268: f6 da                neg    %dl
    126a: 89 d1                mov    %edx,%ecx
    126c: 49 d3 eb              shr    %cl,%r11
    126f: 49 f7 d3              not    %r11
    1272: 44 89 c1              mov    %r8d,%ecx
    1275: 48 d3 ee              shr    %cl,%rsi
    1278: 4e 23 5c d7 08        and    0x8(%rdi,%r10,8),%r11
    127d: 4c 09 de              or     %r11,%rsi
    1280: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    1285: 5b                    pop    %rbx
    1286: 41 5e                pop    %r14
    1288: 41 5f                pop    %r15
    128a: c3                    ret
    128b: 44 89 c9              mov    %r9d,%ecx
    128e: 49 d3 e7              shl    %cl,%r15
    1291: 49 f7 d7              not    %r15
    1294: 4d 21 fe              and    %r15,%r14
    1297: 49 09 de              or     %rbx,%r14
    129a: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    129e: 5b                    pop    %rbx
    129f: 41 5e                pop    %r14
    12a1: 41 5f                pop    %r15
    12a3: c3                    ret
==================================================================

2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):

==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
                     unsigned long start, unsigned long nbits)
{
        unsigned long w, end;

        if (unlikely(nbits == 0))
                return;

        value &= GENMASK(nbits - 1, 0);

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
        *map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
    1160: 48 85 c9              test   %rcx,%rcx
    1163: 0f 84 8e 00 00 00    je     11f7 <my_bitmap_write+0x97>
    1169: 49 89 c9              mov    %rcx,%r9
    116c: f6 d9                neg    %cl
    116e: 48 d3 e6              shl    %cl,%rsi
    1171: 48 d3 ee              shr    %cl,%rsi
    1174: 49 89 d2              mov    %rdx,%r10
    1177: 49 c1 ea 06          shr    $0x6,%r10
    117b: 89 d0                mov    %edx,%eax
    117d: 83 e0 3f              and    $0x3f,%eax
    1180: 4e 8d 04 08          lea    (%rax,%r9,1),%r8
    1184: 4a 8d 0c 08          lea    (%rax,%r9,1),%rcx
    1188: 48 ff c9              dec    %rcx
    118b: 4e 8b 0c d7          mov    (%rdi,%r10,8),%r9
    118f: 48 83 f9 3f          cmp    $0x3f,%rcx
    1193: 77 29                ja     11be <my_bitmap_write+0x5e>
    1195: 41 f6 d8              neg    %r8b
    1198: 48 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%rdx
    119f: 44 89 c1              mov    %r8d,%ecx
    11a2: 48 d3 ea              shr    %cl,%rdx
    11a5: 89 c1                mov    %eax,%ecx
    11a7: 48 d3 ea              shr    %cl,%rdx
    11aa: 48 d3 e2              shl    %cl,%rdx
    11ad: 48 d3 e6              shl    %cl,%rsi
    11b0: 48 f7 d2              not    %rdx
    11b3: 49 21 d1              and    %rdx,%r9
    11b6: 4c 09 ce              or     %r9,%rsi
    11b9: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    11bd: c3                    ret
    11be: f6 da                neg    %dl
    11c0: 89 d1                mov    %edx,%ecx
    11c2: 49 d3 e1              shl    %cl,%r9
    11c5: 49 d3 e9              shr    %cl,%r9
    11c8: 48 89 f2              mov    %rsi,%rdx
    11cb: 89 c1                mov    %eax,%ecx
    11cd: 48 d3 e2              shl    %cl,%rdx
    11d0: 4c 09 ca              or     %r9,%rdx
    11d3: 4a 89 14 d7          mov    %rdx,(%rdi,%r10,8)
    11d7: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    11dc: 41 f6 d8              neg    %r8b
    11df: 44 89 c1              mov    %r8d,%ecx
    11e2: 48 d3 e2              shl    %cl,%rdx
    11e5: 48 d3 ea              shr    %cl,%rdx
    11e8: f6 d8                neg    %al
    11ea: 89 c1                mov    %eax,%ecx
    11ec: 48 d3 ee              shr    %cl,%rsi
    11ef: 48 09 d6              or     %rdx,%rsi
    11f2: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    11f7: c3                    ret
==================================================================

3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:

==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
    12b0: 48 85 c9              test   %rcx,%rcx
    12b3: 74 6e                je     1323 <bitmap_write_new+0x73>
    12b5: 48 89 c8              mov    %rcx,%rax
    12b8: f6 d9                neg    %cl
    12ba: 49 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%r10
    12c1: 49 d3 ea              shr    %cl,%r10
    12c4: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    12cb: 4c 21 d6              and    %r10,%rsi
    12ce: 89 d1                mov    %edx,%ecx
    12d0: 83 e1 3f              and    $0x3f,%ecx
    12d3: 41 b8 40 00 00 00    mov    $0x40,%r8d
    12d9: 49 29 c8              sub    %rcx,%r8
    12dc: 49 89 d1              mov    %rdx,%r9
    12df: 49 c1 e9 06          shr    $0x6,%r9
    12e3: 49 39 c0              cmp    %rax,%r8
    12e6: 4d 0f 42 d3          cmovb  %r11,%r10
    12ea: 49 d3 e2              shl    %cl,%r10
    12ed: 49 f7 d2              not    %r10
    12f0: 4e 23 14 cf          and    (%rdi,%r9,8),%r10
    12f4: 49 89 f3              mov    %rsi,%r11
    12f7: 49 d3 e3              shl    %cl,%r11
    12fa: 4d 09 d3              or     %r10,%r11
    12fd: 4e 89 1c cf          mov    %r11,(%rdi,%r9,8)
    1301: 49 39 c0              cmp    %rax,%r8
    1304: 73 1d                jae    1323 <bitmap_write_new+0x73>
    1306: 01 d0                add    %edx,%eax
    1308: 4a 8b 54 cf 08        mov    0x8(%rdi,%r9,8),%rdx
    130d: 89 c1                mov    %eax,%ecx
    130f: 48 d3 ea              shr    %cl,%rdx
    1312: 48 d3 e2              shl    %cl,%rdx
    1315: 44 89 c1              mov    %r8d,%ecx
    1318: 48 d3 ee              shr    %cl,%rsi
    131b: 48 09 d6              or     %rdx,%rsi
    131e: 4a 89 74 cf 08        mov    %rsi,0x8(%rdi,%r9,8)
    1323: c3                    ret
==================================================================

4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:

==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
    1330: 48 85 c9              test   %rcx,%rcx
    1333: 74 6a                je     139f <bitmap_write_new_shift+0x6f>
    1335: 48 89 c8              mov    %rcx,%rax
    1338: f6 d9                neg    %cl
    133a: 48 d3 e6              shl    %cl,%rsi
    133d: 48 d3 ee              shr    %cl,%rsi
    1340: 41 89 d0              mov    %edx,%r8d
    1343: 41 83 e0 3f          and    $0x3f,%r8d
    1347: 41 b9 40 00 00 00    mov    $0x40,%r9d
    134d: 4d 29 c1              sub    %r8,%r9
    1350: 49 89 d2              mov    %rdx,%r10
    1353: 49 c1 ea 06          shr    $0x6,%r10
    1357: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    135e: 44 89 c1              mov    %r8d,%ecx
    1361: 49 d3 e3              shl    %cl,%r11
    1364: 49 39 c1              cmp    %rax,%r9
    1367: 73 37                jae    13a0 <bitmap_write_new_shift+0x70>
    1369: 53                    push   %rbx
    136a: 49 f7 d3              not    %r11
    136d: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    1371: 48 89 f3              mov    %rsi,%rbx
    1374: 44 89 c1              mov    %r8d,%ecx
    1377: 48 d3 e3              shl    %cl,%rbx
    137a: 4c 09 db              or     %r11,%rbx
    137d: 4a 89 1c d7          mov    %rbx,(%rdi,%r10,8)
    1381: 01 d0                add    %edx,%eax
    1383: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    1388: 89 c1                mov    %eax,%ecx
    138a: 48 d3 ea              shr    %cl,%rdx
    138d: 48 d3 e2              shl    %cl,%rdx
    1390: 44 89 c9              mov    %r9d,%ecx
    1393: 48 d3 ee              shr    %cl,%rsi
    1396: 48 09 d6              or     %rdx,%rsi
    1399: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    139e: 5b                    pop    %rbx
    139f: c3                    ret
    13a0: 44 01 c0              add    %r8d,%eax
    13a3: f6 d8                neg    %al
    13a5: 89 c1                mov    %eax,%ecx
    13a7: 49 d3 e3              shl    %cl,%r11
    13aa: 49 d3 eb              shr    %cl,%r11
    13ad: 49 f7 d3              not    %r11
    13b0: 44 89 c1              mov    %r8d,%ecx
    13b3: 48 d3 e6              shl    %cl,%rsi
    13b6: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    13ba: 4c 09 de              or     %r11,%rsi
    13bd: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    13c1: c3                    ret
==================================================================

For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.

PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.

[-- Attachment #2: bitmap_write.c --]
[-- Type: text/x-csrc, Size: 4039 bytes --]

#include <stddef.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

#define BITS_PER_LONG 64
#define unlikely(x) x
#define UL(x) (x##UL)
#define GENMASK(h, l) \
	(((~UL(0)) - (UL(1) << (l)) + 1) & \
	 (~UL(0) >> (BITS_PER_LONG - 1 - (h))))

#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))


void my_bitmap_write(unsigned long *map, unsigned long value,
		     unsigned long start, unsigned long nbits)
{
	unsigned long w, end;

	if (unlikely(nbits == 0))
		return;

	value &= GENMASK(nbits - 1, 0);

	map += BIT_WORD(start);
	start %= BITS_PER_LONG;
	end = start + nbits - 1;

	w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
	*map = w | (value << start);

	if (end < BITS_PER_LONG)
		return;

	w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
	*map = w | (value >> (BITS_PER_LONG - start));
}

void bitmap_write(unsigned long *map, unsigned long value,
		  unsigned long start, unsigned long nbits)
{
	size_t index = BIT_WORD(start);
	unsigned long offset = start % BITS_PER_LONG;
	unsigned long space = BITS_PER_LONG - offset;

	if (unlikely(!nbits))
		return;
	value &= GENMASK(nbits - 1, 0);
	if (space >= nbits) {
		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
		map[index] |= value << offset;
		return;
	}
	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
	map[index] |= value << offset;
	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

void bitmap_write_new(unsigned long *map, unsigned long value,
		      unsigned long start, unsigned long nbits)
{
	unsigned long offset;
	unsigned long space;
	size_t index;
	bool fit;

	if (unlikely(!nbits))
		return;

	value &= GENMASK(nbits - 1, 0);
	offset = start % BITS_PER_LONG;
	space = BITS_PER_LONG - offset;
	index =	BIT_WORD(start);
	fit = space >= nbits;

	map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : ~BITMAP_FIRST_WORD_MASK(start));
	map[index] |= value << offset;
	if (fit)
		return;

	map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

void bitmap_write_new_shift(unsigned long *map, unsigned long value,
		      unsigned long start, unsigned long nbits)
{
	unsigned long offset;
	unsigned long space;
	size_t index;
	bool fit;

	if (unlikely(!nbits))
		return;

	value &= GENMASK(nbits - 1, 0);
	offset = start % BITS_PER_LONG;
	space = BITS_PER_LONG - offset;
	index =	BIT_WORD(start);
	fit = space >= nbits;

	map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) : ~BITMAP_FIRST_WORD_MASK(start));
	map[index] |= value << offset;
	if (fit)
		return;

	map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
	map[index + 1] |= (value >> space);
}

#define MAPSIZE 3

void print_map(unsigned long *map, const char *c)
{
	int i;

	printf("%s: ", c);
	for (i = 0; i < MAPSIZE; i++) {
		printf("%lx ", map[i]);
	}
	printf("\n");
}

#define COMPARE(fn1, fn2)		\
do {					\
	unsigned long one[MAPSIZE], two[MAPSIZE], three[MAPSIZE];\
	int res;						\
								\
	memset(one, 0, sizeof(unsigned long)*MAPSIZE);		\
	memset(two, 0, sizeof(unsigned long)*MAPSIZE);		\
	fn1(one, value, start, nbits);				\
	fn2(two, value, start, nbits);				\
	res = memcmp(one, two, sizeof(unsigned long)*MAPSIZE);	\
	if (res) {						\
		printf(#fn1 " vs. " #fn2 ": [%lu, %lu]: %d\n", 	\
			start, nbits, res);			\
		print_map(one, #fn1);				\
		print_map(two, #fn2);				\
	}							\
} while (0)

void test(unsigned long value, unsigned long start, unsigned long nbits)
{
	COMPARE(bitmap_write, bitmap_write_new);
	COMPARE(bitmap_write, my_bitmap_write);
	COMPARE(bitmap_write, bitmap_write_new_shift);
}

int main()
{
	unsigned long value = 0xfafafafafafafafaUL;
	unsigned long start, nbits;

	for (start = 0; start <= (MAPSIZE-1)*BITS_PER_LONG; start++) {
		for (nbits = 0; nbits <= BITS_PER_LONG; nbits++) {
			test(value, start, nbits);
		}
	}
	return 0;
}

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-08-04 16:07           ` Alexander Potapenko
@ 2023-08-04 19:55             ` Yury Norov
  -1 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-08-04 19:55 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> >         space >= nbits <=>
> >         BITS_PER_LONG - offset >= nbits <=>
> >         offset + nbits <= BITS_PER_LONG
> >
> > >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
> 
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:

         mask = BITMAP_LAST_WORD_MASK(nbits);
         value &= mask;
         ...
         map[index] &= (fit ? (~mask << offset)) :

> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
> 
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().
 
Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.

> > >         map[index] |= value << offset;
> > >         if (fit)
> > >                 return;
> > >
> > >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> 
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
> 
> > >         map[index + 1] |= (value >> space);
> > > }
> 
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
> 
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)

I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.

[...]

> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:

30% better in size - that's impressive!
 
> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;
> 
>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }

Thanks,
Yury

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-08-04 19:55             ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-08-04 19:55 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> >         space >= nbits <=>
> >         BITS_PER_LONG - offset >= nbits <=>
> >         offset + nbits <= BITS_PER_LONG
> >
> > >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
> 
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:

         mask = BITMAP_LAST_WORD_MASK(nbits);
         value &= mask;
         ...
         map[index] &= (fit ? (~mask << offset)) :

> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
> 
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().
 
Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.

> > >         map[index] |= value << offset;
> > >         if (fit)
> > >                 return;
> > >
> > >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> 
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
> 
> > >         map[index + 1] |= (value >> space);
> > > }
> 
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
> 
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)

I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.

[...]

> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:

30% better in size - that's impressive!
 
> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;
> 
>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }

Thanks,
Yury

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-08-04 19:55             ` Yury Norov
@ 2023-08-04 20:05               ` Andy Shevchenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-08-04 20:05 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote:
> On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:

> > > > ~BITMAP_FIRST_WORD_MASK(start));
> > >
> > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > > and vise-versa.
> > 
> > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> > BITMAP_LAST_WORD_MASK().
>  
> Wow... If that's consistent across different compilers/arches, we'd
> just drop the latter. Thanks for pointing that. I'll check.

It might be...

...

> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> > ~BITMAP_FIRST_WORD_MASK(start));

...that both parts of ternary are using negation. So compiler may use negation
only once. (Haven't looked at the assembly.)

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-08-04 20:05               ` Andy Shevchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Shevchenko @ 2023-08-04 20:05 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote:
> On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:

> > > > ~BITMAP_FIRST_WORD_MASK(start));
> > >
> > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > > and vise-versa.
> > 
> > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> > BITMAP_LAST_WORD_MASK().
>  
> Wow... If that's consistent across different compilers/arches, we'd
> just drop the latter. Thanks for pointing that. I'll check.

It might be...

...

> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> > ~BITMAP_FIRST_WORD_MASK(start));

...that both parts of ternary are using negation. So compiler may use negation
only once. (Haven't looked at the assembly.)

-- 
With Best Regards,
Andy Shevchenko



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-08-18 17:57     ` Catalin Marinas
  -1 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2023-08-18 17:57 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

Just some random comments, I haven't reviewed the series properly.

On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
> new file mode 100644
> index 0000000000000..af6716d53c1a8
> --- /dev/null
> +++ b/Documentation/arch/arm64/mte-tag-compression.rst
> @@ -0,0 +1,212 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================================================
> +Tag Compression for Memory Tagging Extension (MTE)
> +==================================================
> +
> +This document describes the algorithm used to compress memory tags used by the
> +ARM Memory Tagging Extension (MTE)
> +
> +Introduction
> +============
> +
> +MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
> +(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
> +carves out 3.125% (1/16) of the available physical memory to store the tags.
> +
> +When MTE pages are saved to swap, their tags need to be stored in the kernel
> +memory. If the system swap is used heavily, these tags may take a substantial
> +portion of the physical memory, which in the case of a zram-backed swap may
> +even exceed the memory used to store the swapped pages themselves.

Hmm, I'm not sure about this claim ;). Is the zram so good that it
manages a 32x compression (4096/128)?

How much would we save if we only do the compression when it can fit in
63 bits?

> +/**
> + * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: unsigned short array to store the size of every range.
> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */
> +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> +			size_t *out_len)
> +{
> +	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
> +	unsigned int cur_idx = 0, i, j;
> +	u8 cur_tag;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	out_tags[0] = prev_tag;
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;
> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}
> +EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);

What's with the exports here? Are we expecting these functions to be
called from loadable modules?

-- 
Catalin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-08-18 17:57     ` Catalin Marinas
  0 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2023-08-18 17:57 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

Just some random comments, I haven't reviewed the series properly.

On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
> new file mode 100644
> index 0000000000000..af6716d53c1a8
> --- /dev/null
> +++ b/Documentation/arch/arm64/mte-tag-compression.rst
> @@ -0,0 +1,212 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================================================
> +Tag Compression for Memory Tagging Extension (MTE)
> +==================================================
> +
> +This document describes the algorithm used to compress memory tags used by the
> +ARM Memory Tagging Extension (MTE)
> +
> +Introduction
> +============
> +
> +MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
> +(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
> +carves out 3.125% (1/16) of the available physical memory to store the tags.
> +
> +When MTE pages are saved to swap, their tags need to be stored in the kernel
> +memory. If the system swap is used heavily, these tags may take a substantial
> +portion of the physical memory, which in the case of a zram-backed swap may
> +even exceed the memory used to store the swapped pages themselves.

Hmm, I'm not sure about this claim ;). Is the zram so good that it
manages a 32x compression (4096/128)?

How much would we save if we only do the compression when it can fit in
63 bits?

> +/**
> + * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: unsigned short array to store the size of every range.
> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */
> +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> +			size_t *out_len)
> +{
> +	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
> +	unsigned int cur_idx = 0, i, j;
> +	u8 cur_tag;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	out_tags[0] = prev_tag;
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;
> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}
> +EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);

What's with the exports here? Are we expecting these functions to be
called from loadable modules?

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-20 17:39   ` Alexander Potapenko
@ 2023-08-18 18:18     ` Catalin Marinas
  -1 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2023-08-18 18:18 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> Soon after booting Android, tag compression saves ~2.5x memory previously
> spent by mteswap.c on tag allocations. With the growing uptime, the
> savings reach 20x and even more.

This sounds like a marketing claim ;). The way I read your statement is
that the memory used for tag storage is 2.5x less with your patches and
that's great. It means a 2.5x compression on average. How does the
compression get so much better to 20x with more uptime? The number of
tag storage allocations should be proportional to the pages swapped out
(not equal as not all would be mapped as PROT_MTE). So you can indeed
have a lot more pages swapped out than available in RAM and the tag
storage can take space but not sure which numbers you divided to get
20x.

Anyway, it would be nice to see the full picture of what the savings
relative to the total RAM is. Given that the swap in this instance is
zram, you have an upper bound of how many pages it can store. I'm just
trying to assess whether the complexity added here is worth it.

Maybe not as good as the RLE algorithm here, I was wondering whether we
could use zswap to save the tags together with the page. I looked some
time ago at it seemed slightly easier for zswap than zram. Another
option is to make the swap format more generic to support metadata
storage. Yet another option is for the zram to flag that it can compress
the metadata together with the data (no swap format change needed; when
decompressing the page, it populates the tags as well).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-08-18 18:18     ` Catalin Marinas
  0 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2023-08-18 18:18 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> Soon after booting Android, tag compression saves ~2.5x memory previously
> spent by mteswap.c on tag allocations. With the growing uptime, the
> savings reach 20x and even more.

This sounds like a marketing claim ;). The way I read your statement is
that the memory used for tag storage is 2.5x less with your patches and
that's great. It means a 2.5x compression on average. How does the
compression get so much better to 20x with more uptime? The number of
tag storage allocations should be proportional to the pages swapped out
(not equal as not all would be mapped as PROT_MTE). So you can indeed
have a lot more pages swapped out than available in RAM and the tag
storage can take space but not sure which numbers you divided to get
20x.

Anyway, it would be nice to see the full picture of what the savings
relative to the total RAM is. Given that the swap in this instance is
zram, you have an upper bound of how many pages it can store. I'm just
trying to assess whether the complexity added here is worth it.

Maybe not as good as the RLE algorithm here, I was wondering whether we
could use zswap to save the tags together with the page. I looked some
time ago at it seemed slightly easier for zswap than zram. Another
option is to make the swap format more generic to support metadata
storage. Yet another option is for the zram to flag that it can compress
the metadata together with the data (no swap format change needed; when
decompressing the page, it populates the tags as well).

-- 
Catalin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
  2023-08-18 18:18     ` Catalin Marinas
@ 2023-09-20 13:26       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 13:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Aug 18, 2023 at 8:18 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> > Soon after booting Android, tag compression saves ~2.5x memory previously
> > spent by mteswap.c on tag allocations. With the growing uptime, the
> > savings reach 20x and even more.
>
> This sounds like a marketing claim ;). The way I read your statement is
> that the memory used for tag storage is 2.5x less with your patches and
> that's great. It means a 2.5x compression on average. How does the
> compression get so much better to 20x with more uptime?

I am currently looking at this, and I think this happens because some
userspace apps assign zero tags to userspace pages, and those tags are
effectively zero-cost, because they can be compressed into 8 bytes of
the Xarray pointer.
As the uptime grows, the share of such pages also grows.
I agree though that this is a marginal use case, and the average
compression rate is more representative.

> The number of
> tag storage allocations should be proportional to the pages swapped out
> (not equal as not all would be mapped as PROT_MTE).

We think a reasonable upper bound for PROT_MTE pages is 40%, but
currently it is probably more like 20%.


> So you can indeed
> have a lot more pages swapped out than available in RAM and the tag
> storage can take space but not sure which numbers you divided to get
> 20x.

Right now (in kernel 6.5) the amount of memory spent to store the tags
can be calculated as 128 * (number of mte_allocate_tag_storage() calls
- number of mte_free_tag_storage() calls).
In my patchset I calculate the total amount of memory
allocated/deallocated from the mte-tags-N caches and compare that with
128 * (total number of live objects in those caches).
E.g. the stats after booting up the device (~120s uptime) look as follows:

8 bytes: 14007 allocations, 256 deallocations
16 bytes: 1583 allocations, 179 deallocations
32 bytes: 1634 allocations, 205 deallocations
64 bytes: 1538 allocations, 142 deallocations
128 bytes: 10881 allocations, 1340 deallocations
uncompressed tag storage size: 3522688
compressed tag storage size: 1488792

(note 8-byte allocations contribute to uncompressed storage, but not
to compressed storage).

After running various apps, I made the device use almost 19Mb swap space:
8 bytes: 71352 allocations, 8093 deallocations
16 bytes: 5102 allocations, 2598 deallocations
32 bytes: 8206 allocations, 4536 deallocations
64 bytes: 9489 allocations, 5377 deallocations
128 bytes: 43258 allocations, 23364 deallocations
uncompressed tag storage size: 11960192
compressed tag storage size: 2967104

(Note the share of allocations compressed into 8 bytes is slowly growing)

In this case the compression ratio is 4x. I must admit I could not
reproduce the 20x compression this time, and after thinking a little I
anticipate the absolute values to be lower in that case (maybe that
was observed at some point where a lot of uncompressed data was
evicted from swap).

I therefore think I'd better make a modester claim in the docs/patch
description.



>
> Anyway, it would be nice to see the full picture of what the savings
> relative to the total RAM is. Given that the swap in this instance is
> zram, you have an upper bound of how many pages it can store. I'm just
> trying to assess whether the complexity added here is worth it.
>
> Maybe not as good as the RLE algorithm here, I was wondering whether we
> could use zswap to save the tags together with the page. I looked some
> time ago at it seemed slightly easier for zswap than zram. Another
> option is to make the swap format more generic to support metadata
> storage. Yet another option is for the zram to flag that it can compress
> the metadata together with the data (no swap format change needed; when
> decompressing the page, it populates the tags as well).
>
> --
> Catalin



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-09-20 13:26       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 13:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Aug 18, 2023 at 8:18 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> > Soon after booting Android, tag compression saves ~2.5x memory previously
> > spent by mteswap.c on tag allocations. With the growing uptime, the
> > savings reach 20x and even more.
>
> This sounds like a marketing claim ;). The way I read your statement is
> that the memory used for tag storage is 2.5x less with your patches and
> that's great. It means a 2.5x compression on average. How does the
> compression get so much better to 20x with more uptime?

I am currently looking at this, and I think this happens because some
userspace apps assign zero tags to userspace pages, and those tags are
effectively zero-cost, because they can be compressed into 8 bytes of
the Xarray pointer.
As the uptime grows, the share of such pages also grows.
I agree though that this is a marginal use case, and the average
compression rate is more representative.

> The number of
> tag storage allocations should be proportional to the pages swapped out
> (not equal as not all would be mapped as PROT_MTE).

We think a reasonable upper bound for PROT_MTE pages is 40%, but
currently it is probably more like 20%.


> So you can indeed
> have a lot more pages swapped out than available in RAM and the tag
> storage can take space but not sure which numbers you divided to get
> 20x.

Right now (in kernel 6.5) the amount of memory spent to store the tags
can be calculated as 128 * (number of mte_allocate_tag_storage() calls
- number of mte_free_tag_storage() calls).
In my patchset I calculate the total amount of memory
allocated/deallocated from the mte-tags-N caches and compare that with
128 * (total number of live objects in those caches).
E.g. the stats after booting up the device (~120s uptime) look as follows:

8 bytes: 14007 allocations, 256 deallocations
16 bytes: 1583 allocations, 179 deallocations
32 bytes: 1634 allocations, 205 deallocations
64 bytes: 1538 allocations, 142 deallocations
128 bytes: 10881 allocations, 1340 deallocations
uncompressed tag storage size: 3522688
compressed tag storage size: 1488792

(note 8-byte allocations contribute to uncompressed storage, but not
to compressed storage).

After running various apps, I made the device use almost 19Mb swap space:
8 bytes: 71352 allocations, 8093 deallocations
16 bytes: 5102 allocations, 2598 deallocations
32 bytes: 8206 allocations, 4536 deallocations
64 bytes: 9489 allocations, 5377 deallocations
128 bytes: 43258 allocations, 23364 deallocations
uncompressed tag storage size: 11960192
compressed tag storage size: 2967104

(Note the share of allocations compressed into 8 bytes is slowly growing)

In this case the compression ratio is 4x. I must admit I could not
reproduce the 20x compression this time, and after thinking a little I
anticipate the absolute values to be lower in that case (maybe that
was observed at some point where a lot of uncompressed data was
evicted from swap).

I therefore think I'd better make a modester claim in the docs/patch
description.



>
> Anyway, it would be nice to see the full picture of what the savings
> relative to the total RAM is. Given that the swap in this instance is
> zram, you have an upper bound of how many pages it can store. I'm just
> trying to assess whether the complexity added here is worth it.
>
> Maybe not as good as the RLE algorithm here, I was wondering whether we
> could use zswap to save the tags together with the page. I looked some
> time ago at it seemed slightly easier for zswap than zram. Another
> option is to make the swap format more generic to support metadata
> storage. Yet another option is for the zram to flag that it can compress
> the metadata together with the data (no swap format change needed; when
> decompressing the page, it populates the tags as well).
>
> --
> Catalin



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
  2023-08-18 18:18     ` Catalin Marinas
@ 2023-09-20 14:22       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 14:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

>
> Anyway, it would be nice to see the full picture of what the savings
> relative to the total RAM is. Given that the swap in this instance is
> zram, you have an upper bound of how many pages it can store. I'm just
> trying to assess whether the complexity added here is worth it.
Assuming the average compression rate of 2.5x, and the share of tagged
pages being 20%, we'll be saving 0.375% of the swapped memory:
  20% / 32 * (2.5-1)/2.5
With the compression rate of 4x and 40% of PROT_MTE pages, that would
be 0.9375%, which is tens of megabytes (probably still a little
though).

>
> Maybe not as good as the RLE algorithm here, I was wondering whether we
> could use zswap to save the tags together with the page. I looked some
> time ago at it seemed slightly easier for zswap than zram.

Android uses zram, so unfortunately we'll need to handle both.

> Another
> option is to make the swap format more generic to support metadata
> storage. Yet another option is for the zram to flag that it can compress
> the metadata together with the data (no swap format change needed; when
> decompressing the page, it populates the tags as well).

I haven't looked into this, but this might sound promising. We'll have
to copy the page contents to a temporary buffer holding both the page
data and the metadata, but it might be ok.

Another idea you've mentioned in one of the other patches is to only
compress the data if it fits into 63 bits.
Looks like this alone could yield 2x+ compression for 4K pages, and
will spare us the kmem cache juggling in this patch series.


> --
> Catalin



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-09-20 14:22       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 14:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

>
> Anyway, it would be nice to see the full picture of what the savings
> relative to the total RAM is. Given that the swap in this instance is
> zram, you have an upper bound of how many pages it can store. I'm just
> trying to assess whether the complexity added here is worth it.
Assuming the average compression rate of 2.5x, and the share of tagged
pages being 20%, we'll be saving 0.375% of the swapped memory:
  20% / 32 * (2.5-1)/2.5
With the compression rate of 4x and 40% of PROT_MTE pages, that would
be 0.9375%, which is tens of megabytes (probably still a little
though).

>
> Maybe not as good as the RLE algorithm here, I was wondering whether we
> could use zswap to save the tags together with the page. I looked some
> time ago at it seemed slightly easier for zswap than zram.

Android uses zram, so unfortunately we'll need to handle both.

> Another
> option is to make the swap format more generic to support metadata
> storage. Yet another option is for the zram to flag that it can compress
> the metadata together with the data (no swap format change needed; when
> decompressing the page, it populates the tags as well).

I haven't looked into this, but this might sound promising. We'll have
to copy the page contents to a temporary buffer holding both the page
data and the metadata, but it might be ok.

Another idea you've mentioned in one of the other patches is to only
compress the data if it fits into 63 bits.
Looks like this alone could yield 2x+ compression for 4K pages, and
will spare us the kmem cache juggling in this patch series.


> --
> Catalin



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
  2023-09-20 13:26       ` Alexander Potapenko
@ 2023-09-20 16:18         ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 16:18 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Wed, Sep 20, 2023 at 3:26 PM Alexander Potapenko <glider@google.com> wrote:
>
> On Fri, Aug 18, 2023 at 8:18 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> > > Soon after booting Android, tag compression saves ~2.5x memory previously
> > > spent by mteswap.c on tag allocations. With the growing uptime, the
> > > savings reach 20x and even more.
> >
> > This sounds like a marketing claim ;). The way I read your statement is
> > that the memory used for tag storage is 2.5x less with your patches and
> > that's great. It means a 2.5x compression on average. How does the
> > compression get so much better to 20x with more uptime?
>
> I am currently looking at this, and I think this happens because some
> userspace apps assign zero tags to userspace pages, and those tags are
> effectively zero-cost, because they can be compressed into 8 bytes of
> the Xarray pointer.
> As the uptime grows, the share of such pages also grows.
> I agree though that this is a marginal use case, and the average
> compression rate is more representative.
>
> > The number of
> > tag storage allocations should be proportional to the pages swapped out
> > (not equal as not all would be mapped as PROT_MTE).
>
> We think a reasonable upper bound for PROT_MTE pages is 40%, but
> currently it is probably more like 20%.
>
>
> > So you can indeed
> > have a lot more pages swapped out than available in RAM and the tag
> > storage can take space but not sure which numbers you divided to get
> > 20x.
>
> Right now (in kernel 6.5) the amount of memory spent to store the tags
> can be calculated as 128 * (number of mte_allocate_tag_storage() calls
> - number of mte_free_tag_storage() calls).
> In my patchset I calculate the total amount of memory
> allocated/deallocated from the mte-tags-N caches and compare that with
> 128 * (total number of live objects in those caches).
> E.g. the stats after booting up the device (~120s uptime) look as follows:
>
> 8 bytes: 14007 allocations, 256 deallocations
> 16 bytes: 1583 allocations, 179 deallocations
> 32 bytes: 1634 allocations, 205 deallocations
> 64 bytes: 1538 allocations, 142 deallocations
> 128 bytes: 10881 allocations, 1340 deallocations
> uncompressed tag storage size: 3522688
> compressed tag storage size: 1488792
>
> (note 8-byte allocations contribute to uncompressed storage, but not
> to compressed storage).
>
> After running various apps, I made the device use almost 19Mb swap space:

Sorry, this is by no means 19Mb

> 8 bytes: 71352 allocations, 8093 deallocations
> 16 bytes: 5102 allocations, 2598 deallocations
> 32 bytes: 8206 allocations, 4536 deallocations
> 64 bytes: 9489 allocations, 5377 deallocations
> 128 bytes: 43258 allocations, 23364 deallocations
> uncompressed tag storage size: 11960192
> compressed tag storage size: 2967104

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-09-20 16:18         ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-20 16:18 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Wed, Sep 20, 2023 at 3:26 PM Alexander Potapenko <glider@google.com> wrote:
>
> On Fri, Aug 18, 2023 at 8:18 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:56PM +0200, Alexander Potapenko wrote:
> > > Soon after booting Android, tag compression saves ~2.5x memory previously
> > > spent by mteswap.c on tag allocations. With the growing uptime, the
> > > savings reach 20x and even more.
> >
> > This sounds like a marketing claim ;). The way I read your statement is
> > that the memory used for tag storage is 2.5x less with your patches and
> > that's great. It means a 2.5x compression on average. How does the
> > compression get so much better to 20x with more uptime?
>
> I am currently looking at this, and I think this happens because some
> userspace apps assign zero tags to userspace pages, and those tags are
> effectively zero-cost, because they can be compressed into 8 bytes of
> the Xarray pointer.
> As the uptime grows, the share of such pages also grows.
> I agree though that this is a marginal use case, and the average
> compression rate is more representative.
>
> > The number of
> > tag storage allocations should be proportional to the pages swapped out
> > (not equal as not all would be mapped as PROT_MTE).
>
> We think a reasonable upper bound for PROT_MTE pages is 40%, but
> currently it is probably more like 20%.
>
>
> > So you can indeed
> > have a lot more pages swapped out than available in RAM and the tag
> > storage can take space but not sure which numbers you divided to get
> > 20x.
>
> Right now (in kernel 6.5) the amount of memory spent to store the tags
> can be calculated as 128 * (number of mte_allocate_tag_storage() calls
> - number of mte_free_tag_storage() calls).
> In my patchset I calculate the total amount of memory
> allocated/deallocated from the mte-tags-N caches and compare that with
> 128 * (total number of live objects in those caches).
> E.g. the stats after booting up the device (~120s uptime) look as follows:
>
> 8 bytes: 14007 allocations, 256 deallocations
> 16 bytes: 1583 allocations, 179 deallocations
> 32 bytes: 1634 allocations, 205 deallocations
> 64 bytes: 1538 allocations, 142 deallocations
> 128 bytes: 10881 allocations, 1340 deallocations
> uncompressed tag storage size: 3522688
> compressed tag storage size: 1488792
>
> (note 8-byte allocations contribute to uncompressed storage, but not
> to compressed storage).
>
> After running various apps, I made the device use almost 19Mb swap space:

Sorry, this is by no means 19Mb

> 8 bytes: 71352 allocations, 8093 deallocations
> 16 bytes: 5102 allocations, 2598 deallocations
> 32 bytes: 8206 allocations, 4536 deallocations
> 64 bytes: 9489 allocations, 5377 deallocations
> 128 bytes: 43258 allocations, 23364 deallocations
> uncompressed tag storage size: 11960192
> compressed tag storage size: 2967104

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-23  1:57     ` Yury Norov
@ 2023-09-22  7:47       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;

Double-checked there are none in v5 (only the Suggested-by: tag)

>  - git-am throws warning about trailing line;

I checked locally that `git am` does not warn about v5 patches. But
given that the patches are generated with `git format-patch` I suspect
they get garbled when you download them, could it be the case?

>  - checkpatch warns 7 times;

It now warns 4 times, three warnings are about updating MAINTAINERS (I
don't think there's need for this), the last one is about
CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description
text in Kconfig.

> Can you fix all the above before sending the new version?

> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

I did now.


> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

In the other thread I mentioned that although 20x compression is
reachable, it may not lead to practical savings. I reworded the
description, having added the absolute numbers.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

Done in v5

> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-09-22  7:47       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;

Double-checked there are none in v5 (only the Suggested-by: tag)

>  - git-am throws warning about trailing line;

I checked locally that `git am` does not warn about v5 patches. But
given that the patches are generated with `git format-patch` I suspect
they get garbled when you download them, could it be the case?

>  - checkpatch warns 7 times;

It now warns 4 times, three warnings are about updating MAINTAINERS (I
don't think there's need for this), the last one is about
CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description
text in Kconfig.

> Can you fix all the above before sending the new version?

> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

I did now.


> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

In the other thread I mentioned that although 20x compression is
reachable, it may not lead to practical savings. I reworded the
description, having added the absolute numbers.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

Done in v5

> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-23 15:38     ` Yury Norov
@ 2023-09-22  7:48         ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

> Either way, whatever we decide, let's stay clear with our intentions
> and mention explicitly that tail bits are either must be zero, or
> ignored.
>
> Alexander, can you add the snippet above to the comments for the
> bitmap_write() and bitmap_read(), as well as in the test? Also, if we
> decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
> generate better code than GENMASK(nbits - 1, 0) does?

Added the snippet above to bitmap_write(). I think however it is
obvious that bitmap_read() reads only nbits?

> Commets are very appreciated.
>
> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-09-22  7:48         ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

> Either way, whatever we decide, let's stay clear with our intentions
> and mention explicitly that tail bits are either must be zero, or
> ignored.
>
> Alexander, can you add the snippet above to the comments for the
> bitmap_write() and bitmap_read(), as well as in the test? Also, if we
> decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
> generate better code than GENMASK(nbits - 1, 0) does?

Added the snippet above to bitmap_write(). I think however it is
obvious that bitmap_read() reads only nbits?

> Commets are very appreciated.
>
> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-08-04 19:55             ` Yury Norov
@ 2023-09-22  7:49               ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

> OK. Can you put a comment explaining this? Or maybe would be even
> better to use BITMAP_LAST_WORD_MASK() here:
>
>          mask = BITMAP_LAST_WORD_MASK(nbits);
>          value &= mask;
>          ...
>          map[index] &= (fit ? (~mask << offset)) :

I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's
self-explanatory now, WDYT?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
@ 2023-09-22  7:49               ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

> OK. Can you put a comment explaining this? Or maybe would be even
> better to use BITMAP_LAST_WORD_MASK() here:
>
>          mask = BITMAP_LAST_WORD_MASK(nbits);
>          value &= mask;
>          ...
>          map[index] &= (fit ? (~mask << offset)) :

I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's
self-explanatory now, WDYT?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-23  2:29     ` Yury Norov
@ 2023-09-22  7:57       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:57 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

> > +/*
> > + * Test bitmap should be big enough to include the cases when start is not in
> > + * the first word, and start+nbits lands in the following word.
> > + */
> > +#define TEST_BIT_LEN (BITS_PER_LONG * 3)
>
> Why not just 1000? Is your code safe against unaligned bitmaps?

done

> > +#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))
>
> BITS_TO_BYTES

done

>
> > +static void __init test_set_get_value(void)
>
> test_bitmap_read_write. Here, and in subjects for #1 and #2.

done

> > +{
> > +     DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> > +     DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> > +     /* Prevent constant folding. */
> > +     volatile unsigned long zero_bits = 0;
>
> Use READ_ONCE() instead of volatile

done

> > +     unsigned long val, bit;
> > +     int i;
> > +
> > +     /* Setting/getting zero bytes should not crash the kernel. */
> > +     bitmap_write(NULL, 0, 0, zero_bits);
> > +     val = bitmap_read(NULL, 0, zero_bits);
> > +     expect_eq_ulong(0, val);
>
> No, val is undefined.

Why? bitmap_read(..., ..., 0) always returns 0.

>
> > +
> > +     /*
> > +      * Ensure that bitmap_read() reads the same value that was previously
> > +      * written, and two consequent values are correctly merged.
> > +      * The resulting bit pattern is asymmetric to rule out possible issues
> > +      * with bit numeration order.
> > +      */
> > +     for (i = 0; i < TEST_BIT_LEN - 7; i++) {
>
> Can you add some empty lines in the block below in sake of
> readability? Maybe after expect()?

Done

(snip)

> I suggested the other test:
>
> val = DEADBEEF;
> for (nbits = 1; nbits <= BITS_PER_LONG; nbits++)
>         for (start = 0; start < 1000; i++) {
>                 if (start + nbits >= 1000)
>                         break;;
>
>                 v = val & GENMASK(nbits - 1, 0);
>
>                 memset(bitmap, 0xaa /* also 0xff and 0x00 */, TEST_BYTE_LEN);
>                 memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
>
>                 for (n = 0; n < nbits; n++)
>                         __assign_bit(v & BIT(n), exp_bitmap, start + n);
>
>                 bitmap_write(bitmap, v, start, nbits);
>                 expect_eq_bitmap(exp_bitmap, bitmap, 1000);
>
>                 r = bitmap_read(bitmap, start, nbits);
>                 expect_eq(r, v);
>         }

I factored out test cases that can benefit from different background
patterns into a separate function and added yours there. Thanks!

> > @@ -1249,6 +1328,8 @@ static void __init selftest(void)
> >       test_for_each_clear_bitrange_from();
> >       test_for_each_set_clump8();
> >       test_for_each_set_bit_wrap();
> > +
> > +     test_set_get_value();
>
> This should append the test_bitmap_* section

Done

> Thanks,
> Yury

Thanks a lot for your comments!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
@ 2023-09-22  7:57       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  7:57 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

> > +/*
> > + * Test bitmap should be big enough to include the cases when start is not in
> > + * the first word, and start+nbits lands in the following word.
> > + */
> > +#define TEST_BIT_LEN (BITS_PER_LONG * 3)
>
> Why not just 1000? Is your code safe against unaligned bitmaps?

done

> > +#define TEST_BYTE_LEN (BITS_TO_LONGS(TEST_BIT_LEN) * sizeof(unsigned long))
>
> BITS_TO_BYTES

done

>
> > +static void __init test_set_get_value(void)
>
> test_bitmap_read_write. Here, and in subjects for #1 and #2.

done

> > +{
> > +     DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> > +     DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> > +     /* Prevent constant folding. */
> > +     volatile unsigned long zero_bits = 0;
>
> Use READ_ONCE() instead of volatile

done

> > +     unsigned long val, bit;
> > +     int i;
> > +
> > +     /* Setting/getting zero bytes should not crash the kernel. */
> > +     bitmap_write(NULL, 0, 0, zero_bits);
> > +     val = bitmap_read(NULL, 0, zero_bits);
> > +     expect_eq_ulong(0, val);
>
> No, val is undefined.

Why? bitmap_read(..., ..., 0) always returns 0.

>
> > +
> > +     /*
> > +      * Ensure that bitmap_read() reads the same value that was previously
> > +      * written, and two consequent values are correctly merged.
> > +      * The resulting bit pattern is asymmetric to rule out possible issues
> > +      * with bit numeration order.
> > +      */
> > +     for (i = 0; i < TEST_BIT_LEN - 7; i++) {
>
> Can you add some empty lines in the block below in sake of
> readability? Maybe after expect()?

Done

(snip)

> I suggested the other test:
>
> val = DEADBEEF;
> for (nbits = 1; nbits <= BITS_PER_LONG; nbits++)
>         for (start = 0; start < 1000; i++) {
>                 if (start + nbits >= 1000)
>                         break;;
>
>                 v = val & GENMASK(nbits - 1, 0);
>
>                 memset(bitmap, 0xaa /* also 0xff and 0x00 */, TEST_BYTE_LEN);
>                 memset(exp_bitmap, 0xaa, TEST_BYTE_LEN);
>
>                 for (n = 0; n < nbits; n++)
>                         __assign_bit(v & BIT(n), exp_bitmap, start + n);
>
>                 bitmap_write(bitmap, v, start, nbits);
>                 expect_eq_bitmap(exp_bitmap, bitmap, 1000);
>
>                 r = bitmap_read(bitmap, start, nbits);
>                 expect_eq(r, v);
>         }

I factored out test cases that can benefit from different background
patterns into a separate function and added yours there. Thanks!

> > @@ -1249,6 +1328,8 @@ static void __init selftest(void)
> >       test_for_each_clear_bitrange_from();
> >       test_for_each_set_clump8();
> >       test_for_each_set_bit_wrap();
> > +
> > +     test_set_get_value();
>
> This should append the test_bitmap_* section

Done

> Thanks,
> Yury

Thanks a lot for your comments!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-21 11:22     ` Andy Shevchenko
@ 2023-09-22  8:03       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:03 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Jul 21, 2023 at 1:22 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> > The config implements the algorithm compressing memory tags for ARM MTE
> > during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +Programming Interface
> > +=====================
> > +
> > + .. kernel-doc:: arch/arm64/mm/mtecomp.c
>
> :export:

Done

> > +
>
> Is it dangling trailing blank line? Drop it.

Sorry, it's hard to attribute this comment. I am assuming it is
related to Documentation/arch/arm64/mte-tag-compression.rst - done.

> ...
>
> > +#include <linux/bitmap.h>
>
> > +#include <linux/bitops.h>
>
> This is guaranteed to be included by bitmap.h.

I think we'd better stick to IWYU here.
Ingo's patch: https://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git/commit/?id=32b1e9e4f5774951a3a80604a39fa1f0674c1833
specifically adds bitmap.h where bits.h is already present, without
removing the latter.
Although there might not be general consensus on this in the kernel
right now, I think Ingo's "Fast Kernel Headers" set out a good
direction.

>
> > +/*
> > + * Sizes of compressed values. These depend on MTE_TAG_SIZE and
>
> of the

This comment is gone now

>
> > +     out_tags[0] = prev_tag;
>
> out_tags[cur_idx] ?

Yeah, looks more readable. Done.

>
> > +     for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > +             for (j = 0; j < 2; j++) {
> > +                     cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > +                     if (cur_tag == prev_tag) {
> > +                             out_sizes[cur_idx]++;
>
> > +                     } else {
> > +                             cur_idx++;
> > +                             prev_tag = cur_tag;
> > +                             out_tags[cur_idx] = prev_tag;
> > +                             out_sizes[cur_idx] = 1;
>
> Looking more at this I think there is still a room for improvement. I can't
> come up right now with a proposal (lunch time :-), but I would look into
>
>         do {
>                 ...
>         } while (i < MTE_...);
>
> approach.

We can e.g. get rid of the nested loop and iterate over tags instead
of bytes (see v5)


> > +static size_t mte_size_to_ranges(size_t size)
> > +{
> > +     size_t largest_bits;
>
> > +     size_t ret = 0;
>
> Redundant assignment. Please, check again all of them.

Done.

> > +
> > +     largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> > +                                  MTE_BITS_PER_LARGEST_IDX;
> > +     ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
>
> Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
> Okay, never mind.

Ack

> > +           (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> > +     return ret;
>
>         return (...) / ...;

Done

> > +}
>
> ...
>
> > +static size_t mte_alloc_size(unsigned int num_ranges)
> > +{
> > +     size_t sizes[4] = { 8, 16, 32, 64 };
>
> Hooray! And now it's not needed anymore...
>
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(sizes); i++) {
>
> ...as sizes[i] is equivalent of (8 << i).

It's gone now.

> ...
>
> > +/**
> > + * mte_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @mte_release_handle().
>
> + blank line here.

Done (here and in other places in the file), but I'm wondering why
https://docs.kernel.org/doc-guide/kernel-doc.html does not mandate it.

>
> > + * Returns: 64-bit tag storage handle.
> > + */
>
> ...
>
> > +     /*
> > +      * mte_compress_to_buf() only initializes the bits that mte_decompress()
> > +      * will read. But when the tags are stored in the handle itself, it must
> > +      * have all its bits initialized.
> > +      */
> > +     unsigned long result = 0;
>
>         // Actually it's interesting how it's supposed to work on 32-bit
>         // builds...

It is not supposed to work on 32 bit.
First, the code is in arch/arm64 :)
Second, 32-bit CPUs do not support MTE (which reserves the four upper
bits of the address)


>
> > +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> > +                                  unsigned long *pos, unsigned long bits)
> > +{
> > +     unsigned long result;
> > +
> > +     result = bitmap_read(bitmap, *pos, bits);
> > +     *pos += bits;
> > +     return result;
>
>         unsigned long start = *pos;
>
>         *pos += bits;
>         return bitmap_read(bitmap, start, bits);

Done, thanks!

> > +}
>
> ...
>
> > +     unsigned short r_sizes[46], sum = 0;
>
> See below.
>
> ...
>
> It's cleaner and more robust to have
>
>         sum = 0;
>
> here.

Moved it inside the loop init statement


> --
> With Best Regards,
> Andy Shevchenko
>

Thank you!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-09-22  8:03       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:03 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Jul 21, 2023 at 1:22 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> > The config implements the algorithm compressing memory tags for ARM MTE
> > during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +Programming Interface
> > +=====================
> > +
> > + .. kernel-doc:: arch/arm64/mm/mtecomp.c
>
> :export:

Done

> > +
>
> Is it dangling trailing blank line? Drop it.

Sorry, it's hard to attribute this comment. I am assuming it is
related to Documentation/arch/arm64/mte-tag-compression.rst - done.

> ...
>
> > +#include <linux/bitmap.h>
>
> > +#include <linux/bitops.h>
>
> This is guaranteed to be included by bitmap.h.

I think we'd better stick to IWYU here.
Ingo's patch: https://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git/commit/?id=32b1e9e4f5774951a3a80604a39fa1f0674c1833
specifically adds bitmap.h where bits.h is already present, without
removing the latter.
Although there might not be general consensus on this in the kernel
right now, I think Ingo's "Fast Kernel Headers" set out a good
direction.

>
> > +/*
> > + * Sizes of compressed values. These depend on MTE_TAG_SIZE and
>
> of the

This comment is gone now

>
> > +     out_tags[0] = prev_tag;
>
> out_tags[cur_idx] ?

Yeah, looks more readable. Done.

>
> > +     for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > +             for (j = 0; j < 2; j++) {
> > +                     cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > +                     if (cur_tag == prev_tag) {
> > +                             out_sizes[cur_idx]++;
>
> > +                     } else {
> > +                             cur_idx++;
> > +                             prev_tag = cur_tag;
> > +                             out_tags[cur_idx] = prev_tag;
> > +                             out_sizes[cur_idx] = 1;
>
> Looking more at this I think there is still a room for improvement. I can't
> come up right now with a proposal (lunch time :-), but I would look into
>
>         do {
>                 ...
>         } while (i < MTE_...);
>
> approach.

We can e.g. get rid of the nested loop and iterate over tags instead
of bytes (see v5)


> > +static size_t mte_size_to_ranges(size_t size)
> > +{
> > +     size_t largest_bits;
>
> > +     size_t ret = 0;
>
> Redundant assignment. Please, check again all of them.

Done.

> > +
> > +     largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> > +                                  MTE_BITS_PER_LARGEST_IDX;
> > +     ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
>
> Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
> Okay, never mind.

Ack

> > +           (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> > +     return ret;
>
>         return (...) / ...;

Done

> > +}
>
> ...
>
> > +static size_t mte_alloc_size(unsigned int num_ranges)
> > +{
> > +     size_t sizes[4] = { 8, 16, 32, 64 };
>
> Hooray! And now it's not needed anymore...
>
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(sizes); i++) {
>
> ...as sizes[i] is equivalent of (8 << i).

It's gone now.

> ...
>
> > +/**
> > + * mte_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @mte_release_handle().
>
> + blank line here.

Done (here and in other places in the file), but I'm wondering why
https://docs.kernel.org/doc-guide/kernel-doc.html does not mandate it.

>
> > + * Returns: 64-bit tag storage handle.
> > + */
>
> ...
>
> > +     /*
> > +      * mte_compress_to_buf() only initializes the bits that mte_decompress()
> > +      * will read. But when the tags are stored in the handle itself, it must
> > +      * have all its bits initialized.
> > +      */
> > +     unsigned long result = 0;
>
>         // Actually it's interesting how it's supposed to work on 32-bit
>         // builds...

It is not supposed to work on 32 bit.
First, the code is in arch/arm64 :)
Second, 32-bit CPUs do not support MTE (which reserves the four upper
bits of the address)


>
> > +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> > +                                  unsigned long *pos, unsigned long bits)
> > +{
> > +     unsigned long result;
> > +
> > +     result = bitmap_read(bitmap, *pos, bits);
> > +     *pos += bits;
> > +     return result;
>
>         unsigned long start = *pos;
>
>         *pos += bits;
>         return bitmap_read(bitmap, start, bits);

Done, thanks!

> > +}
>
> ...
>
> > +     unsigned short r_sizes[46], sum = 0;
>
> See below.
>
> ...
>
> It's cleaner and more robust to have
>
>         sum = 0;
>
> here.

Moved it inside the loop init statement


> --
> With Best Regards,
> Andy Shevchenko
>

Thank you!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-08-18 17:57     ` Catalin Marinas
@ 2023-09-22  8:04       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

> > +MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
> > +(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
> > +carves out 3.125% (1/16) of the available physical memory to store the tags.
> > +
> > +When MTE pages are saved to swap, their tags need to be stored in the kernel
> > +memory. If the system swap is used heavily, these tags may take a substantial
> > +portion of the physical memory, which in the case of a zram-backed swap may
> > +even exceed the memory used to store the swapped pages themselves.
>
> Hmm, I'm not sure about this claim ;). Is the zram so good that it
> manages a 32x compression (4096/128)?

For trivial data, like zero pages, this is possible in theory, but I
am not sure whether zram takes advantage of this. I removed the claim
:)

>
> How much would we save if we only do the compression when it can fit in
> 63 bits?

Good question. What I've observed is that soon after boot roughly 50%
of tag buffers can be compressed into 63 bits.


> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > +                     size_t *out_len)
> > +{
...
> > +}
> > +EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
>
> What's with the exports here? Are we expecting these functions to be
> called from loadable modules?

Yes, this is for the KUnit test module

>
> --
> Catalin



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-09-22  8:04       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, pcc, andreyknvl, andriy.shevchenko, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

> > +MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
> > +(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
> > +carves out 3.125% (1/16) of the available physical memory to store the tags.
> > +
> > +When MTE pages are saved to swap, their tags need to be stored in the kernel
> > +memory. If the system swap is used heavily, these tags may take a substantial
> > +portion of the physical memory, which in the case of a zram-backed swap may
> > +even exceed the memory used to store the swapped pages themselves.
>
> Hmm, I'm not sure about this claim ;). Is the zram so good that it
> manages a 32x compression (4096/128)?

For trivial data, like zero pages, this is possible in theory, but I
am not sure whether zram takes advantage of this. I removed the claim
:)

>
> How much would we save if we only do the compression when it can fit in
> 63 bits?

Good question. What I've observed is that soon after boot roughly 50%
of tag buffers can be compressed into 63 bits.


> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > +                     size_t *out_len)
> > +{
...
> > +}
> > +EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
>
> What's with the exports here? Are we expecting these functions to be
> called from loadable modules?

Yes, this is for the KUnit test module

>
> --
> Catalin



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
  2023-07-21 11:25     ` Andy Shevchenko
@ 2023-09-22  8:05       ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:05 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Jul 21, 2023 at 1:25 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> > Ensure that tag sequences containing alternating values are compressed
> > to buffers of expected size and correctly decompressed afterwards.
>
> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
>
> > +#include <kunit/test.h>
>
> Keep this in a separate group outside of linux/*.

Done.

>
> > +#include <linux/types.h>
> > +
> > +#include <asm/mtecomp.h>
>
> ...
>
> > +/* Functions exported from mtecomp.c for this test. */
> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > +                     size_t *out_len);
> > +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> > +                     u8 *tags);
>
> This is interesting. Have you run `make W=1` on this code?

You are right, this is fishy. I added mtecomp.h and moved the
declarations there instead.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
@ 2023-09-22  8:05       ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-22  8:05 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Jul 21, 2023 at 1:25 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> > Ensure that tag sequences containing alternating values are compressed
> > to buffers of expected size and correctly decompressed afterwards.
>
> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
>
> > +#include <kunit/test.h>
>
> Keep this in a separate group outside of linux/*.

Done.

>
> > +#include <linux/types.h>
> > +
> > +#include <asm/mtecomp.h>
>
> ...
>
> > +/* Functions exported from mtecomp.c for this test. */
> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > +                     size_t *out_len);
> > +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> > +                     u8 *tags);
>
> This is interesting. Have you run `make W=1` on this code?

You are right, this is fishy. I added mtecomp.h and moved the
declarations there instead.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-09-22  7:57       ` Alexander Potapenko
@ 2023-09-22 13:28         ` Yury Norov
  -1 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-09-22 13:28 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Sep 22, 2023 at 09:57:32AM +0200, Alexander Potapenko wrote:
> > > +     unsigned long val, bit;
> > > +     int i;
> > > +
> > > +     /* Setting/getting zero bytes should not crash the kernel. */
> > > +     bitmap_write(NULL, 0, 0, zero_bits);
> > > +     val = bitmap_read(NULL, 0, zero_bits);
> > > +     expect_eq_ulong(0, val);
> >
> > No, val is undefined.
> 
> Why? bitmap_read(..., ..., 0) always returns 0.

Because it's unexpected and most likely wrong to pass 0 bits. We
guarantee that bitmap_read() will return immediately, and will not
touch the memory. But we don't guarantee that we return any specific
value.

It's not a hot path, at least now, and we can spend few extra cycles
to clear output register and return 0, but user should not rely on it
in any way, especially in a test that is intended to show an example
of using the new API. 

Consider a less relaxed environment, where we really have to count
cycles. In such environment, we'd return a content of the 1st input
argument, just because it's already in R0, and compiled doesn't have
to:

    mov r0, #0
    ret

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
@ 2023-09-22 13:28         ` Yury Norov
  0 siblings, 0 replies; 61+ messages in thread
From: Yury Norov @ 2023-09-22 13:28 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Sep 22, 2023 at 09:57:32AM +0200, Alexander Potapenko wrote:
> > > +     unsigned long val, bit;
> > > +     int i;
> > > +
> > > +     /* Setting/getting zero bytes should not crash the kernel. */
> > > +     bitmap_write(NULL, 0, 0, zero_bits);
> > > +     val = bitmap_read(NULL, 0, zero_bits);
> > > +     expect_eq_ulong(0, val);
> >
> > No, val is undefined.
> 
> Why? bitmap_read(..., ..., 0) always returns 0.

Because it's unexpected and most likely wrong to pass 0 bits. We
guarantee that bitmap_read() will return immediately, and will not
touch the memory. But we don't guarantee that we return any specific
value.

It's not a hot path, at least now, and we can spend few extra cycles
to clear output register and return 0, but user should not rely on it
in any way, especially in a test that is intended to show an example
of using the new API. 

Consider a less relaxed environment, where we really have to count
cycles. In such environment, we'd return a content of the 1st input
argument, just because it's already in R0, and compiled doesn't have
to:

    mov r0, #0
    ret

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-09-22 13:28         ` Yury Norov
@ 2023-09-27 12:33           ` Alexander Potapenko
  -1 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-27 12:33 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Sep 22, 2023 at 3:30 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Fri, Sep 22, 2023 at 09:57:32AM +0200, Alexander Potapenko wrote:
> > > > +     unsigned long val, bit;
> > > > +     int i;
> > > > +
> > > > +     /* Setting/getting zero bytes should not crash the kernel. */
> > > > +     bitmap_write(NULL, 0, 0, zero_bits);
> > > > +     val = bitmap_read(NULL, 0, zero_bits);
> > > > +     expect_eq_ulong(0, val);
> > >
> > > No, val is undefined.
> >
> > Why? bitmap_read(..., ..., 0) always returns 0.
>
> Because it's unexpected and most likely wrong to pass 0 bits. We
> guarantee that bitmap_read() will return immediately, and will not
> touch the memory. But we don't guarantee that we return any specific
> value.

Fair enough, I'll remove the expect_eq_ulong() from the test in v6 and
will also add the requirement for nbits to be nonzero to the doc
comments for bitmap_read() and bitmap_write().

> It's not a hot path, at least now, and we can spend few extra cycles
> to clear output register and return 0, but user should not rely on it
> in any way, especially in a test that is intended to show an example
> of using the new API.

Ok, for now I'll keep the "return 0" part.

> Consider a less relaxed environment, where we really have to count
> cycles. In such environment, we'd return a content of the 1st input
> argument, just because it's already in R0, and compiled doesn't have
> to:
>
>     mov r0, #0
>     ret

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
@ 2023-09-27 12:33           ` Alexander Potapenko
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Potapenko @ 2023-09-27 12:33 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Fri, Sep 22, 2023 at 3:30 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Fri, Sep 22, 2023 at 09:57:32AM +0200, Alexander Potapenko wrote:
> > > > +     unsigned long val, bit;
> > > > +     int i;
> > > > +
> > > > +     /* Setting/getting zero bytes should not crash the kernel. */
> > > > +     bitmap_write(NULL, 0, 0, zero_bits);
> > > > +     val = bitmap_read(NULL, 0, zero_bits);
> > > > +     expect_eq_ulong(0, val);
> > >
> > > No, val is undefined.
> >
> > Why? bitmap_read(..., ..., 0) always returns 0.
>
> Because it's unexpected and most likely wrong to pass 0 bits. We
> guarantee that bitmap_read() will return immediately, and will not
> touch the memory. But we don't guarantee that we return any specific
> value.

Fair enough, I'll remove the expect_eq_ulong() from the test in v6 and
will also add the requirement for nbits to be nonzero to the doc
comments for bitmap_read() and bitmap_write().

> It's not a hot path, at least now, and we can spend few extra cycles
> to clear output register and return 0, but user should not rely on it
> in any way, especially in a test that is intended to show an example
> of using the new API.

Ok, for now I'll keep the "return 0" part.

> Consider a less relaxed environment, where we really have to count
> cycles. In such environment, we'd return a content of the 1st input
> argument, just because it's already in R0, and compiled doesn't have
> to:
>
>     mov r0, #0
>     ret

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2023-09-27 12:34 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 17:39 [PATCH v4 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
2023-07-20 17:39 ` Alexander Potapenko
2023-07-20 17:39 ` [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
2023-07-20 17:39   ` Alexander Potapenko
2023-07-23  1:57   ` Yury Norov
2023-07-23  1:57     ` Yury Norov
2023-07-23 15:38     ` Yury Norov
2023-09-22  7:48       ` Alexander Potapenko
2023-09-22  7:48         ` Alexander Potapenko
2023-07-24  8:36     ` Andy Shevchenko
2023-07-25  5:04       ` Yury Norov
2023-07-25  9:00         ` Andy Shevchenko
2023-07-26  8:08     ` Alexander Potapenko
2023-07-27  0:14       ` Yury Norov
2023-07-27  0:14         ` Yury Norov
2023-08-04 16:07         ` Alexander Potapenko
2023-08-04 16:07           ` Alexander Potapenko
2023-08-04 19:55           ` Yury Norov
2023-08-04 19:55             ` Yury Norov
2023-08-04 20:05             ` Andy Shevchenko
2023-08-04 20:05               ` Andy Shevchenko
2023-09-22  7:49             ` Alexander Potapenko
2023-09-22  7:49               ` Alexander Potapenko
2023-09-22  7:47     ` Alexander Potapenko
2023-09-22  7:47       ` Alexander Potapenko
2023-07-20 17:39 ` [PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
2023-07-20 17:39   ` Alexander Potapenko
2023-07-23  2:29   ` Yury Norov
2023-07-23  2:29     ` Yury Norov
2023-09-22  7:57     ` Alexander Potapenko
2023-09-22  7:57       ` Alexander Potapenko
2023-09-22 13:28       ` Yury Norov
2023-09-22 13:28         ` Yury Norov
2023-09-27 12:33         ` Alexander Potapenko
2023-09-27 12:33           ` Alexander Potapenko
2023-07-20 17:39 ` [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
2023-07-20 17:39   ` Alexander Potapenko
2023-07-21 11:22   ` Andy Shevchenko
2023-07-21 11:22     ` Andy Shevchenko
2023-09-22  8:03     ` Alexander Potapenko
2023-09-22  8:03       ` Alexander Potapenko
2023-08-18 17:57   ` Catalin Marinas
2023-08-18 17:57     ` Catalin Marinas
2023-09-22  8:04     ` Alexander Potapenko
2023-09-22  8:04       ` Alexander Potapenko
2023-07-20 17:39 ` [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression Alexander Potapenko
2023-07-20 17:39   ` Alexander Potapenko
2023-07-21 11:25   ` Andy Shevchenko
2023-07-21 11:25     ` Andy Shevchenko
2023-09-22  8:05     ` Alexander Potapenko
2023-09-22  8:05       ` Alexander Potapenko
2023-07-20 17:39 ` [PATCH v4 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
2023-07-20 17:39   ` Alexander Potapenko
2023-08-18 18:18   ` Catalin Marinas
2023-08-18 18:18     ` Catalin Marinas
2023-09-20 13:26     ` Alexander Potapenko
2023-09-20 13:26       ` Alexander Potapenko
2023-09-20 16:18       ` Alexander Potapenko
2023-09-20 16:18         ` Alexander Potapenko
2023-09-20 14:22     ` Alexander Potapenko
2023-09-20 14:22       ` Alexander Potapenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.