[PATCH v3 0/5] Implement MTE tag compression for swapped pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Implement MTE tag compression for swapped pages
@ 2023-07-17 11:37 Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying 128 bytes per page. This is especially problematic
for devices that use zram-backed in-memory swap, because tags stored
uncompressed in the heap effectively reduce the available amount of
swap memory.

The RLE-based EA0 algorithm suggested by Evgenii Stepanov and
implemented in this patch series is able to efficiently compress
128-byte tag buffers, resulting in practical compression ratio between
2.5x and 20x. In most cases it is possible to store the compressed data
in 63-bit Xarray values, resulting in no extra memory allocations.

Our measurements show that EA0 provides better compression than existing
kernel compression algorithms (LZ4, LZO, LZ4HC, ZSTD) can offer, because
EA0 specifically targets 128-byte buffers.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to set/get bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v3:
 - as suggested by Andy Shevchenko, use
   bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
 - switched to unsigned long to reduce typecasts
 - simplified the compression code

v2:
 - as suggested by Yuri Norov, replace the poorly implemented struct
   bitq with <linux/bitmap.h>

Alexander Potapenko (5):
  lib/bitmap: add bitmap_{set,get}_value()
  lib/test_bitmap: add tests for bitmap_{set,get}_value()
  arm64: mte: implement CONFIG_ARM64_MTE_COMP
  arm64: mte: add a test for MTE tags compression
  arm64: mte: add compression support to mteswap.c

 arch/arm64/Kconfig               |  20 ++
 arch/arm64/include/asm/mtecomp.h |  60 +++++
 arch/arm64/mm/Makefile           |   7 +
 arch/arm64/mm/mtecomp.c          | 406 +++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap.c          |  20 +-
 arch/arm64/mm/mteswap.h          |  12 +
 arch/arm64/mm/mteswap_comp.c     |  52 ++++
 arch/arm64/mm/mteswap_nocomp.c   |  38 +++
 arch/arm64/mm/test_mtecomp.c     | 177 ++++++++++++++
 include/linux/bitmap.h           |  57 +++++
 lib/test_bitmap.c                |  33 +++
 11 files changed, 871 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c
 create mode 100644 arch/arm64/mm/test_mtecomp.c

-- 
2.41.0.255.g8b1d071c50-goog

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
@ 2023-07-17 11:37 ` Alexander Potapenko
  2023-07-17 13:01   ` Andy Shevchenko
  2023-07-17 15:50   ` Yury Norov
  2023-07-17 11:37 ` [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
---
 include/linux/bitmap.h | 57 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..4559366084988 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_get_value(map, start, nbits)         Get bit value of size 'nbits'
+ *                                              from map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_set_value(map, value, start, nbits)  Set bit value of size 'nbits'
+ *                                              of map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,31 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_get_value - get a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits
+ *
+ * Returns value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_get_value(const unsigned long *map,
+					     unsigned long start,
+					     unsigned long nbits)
+{
+	const size_t index = BIT_WORD(start);
+	const unsigned long offset = start % BITS_PER_LONG;
+	const unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +628,34 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_set_value - set n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits
+ */
+static inline void bitmap_set_value(unsigned long *map,
+				    unsigned long value,
+				    unsigned long start, unsigned long nbits)
+{
+	const size_t index = BIT_WORD(start);
+	const unsigned long offset = start % BITS_PER_LONG;
+	const unsigned long space = BITS_PER_LONG - offset;
+
+	value &= GENMASK(nbits - 1, 0);
+
+	if (space >= nbits) {
+		map[index] &= ~(GENMASK(nbits + offset - 1, offset));
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
@ 2023-07-17 11:37 ` Alexander Potapenko
  2023-07-17 13:04   ` Andy Shevchenko
  2023-07-17 16:11   ` Yury Norov
  2023-07-17 11:37 ` [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"

v3:
 - switch to using bitmap_{set,get}_value()
 - change the expected bit pattern in test_set_get_value(),
   as the test was incorrectly assuming 0 is the LSB.
---
 lib/test_bitmap.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..c2ab54040c249 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
 	return true;
 }
 
+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+		 const unsigned long exp_ulong, unsigned long x)
+{
+	if (exp_ulong != x) {
+		pr_err("[%s:%u] expected %lu, got %lu\n",
+			srcfile, line, exp_ulong, x);
+		return false;
+	}
+	return true;
+}
 
 static bool __init
 __check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
 	})
 
 #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
 #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
 #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
 #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,25 @@ static void __init test_bitmap_const_eval(void)
 	BUILD_BUG_ON(~var != ~BIT(25));
 }
 
+static void __init test_set_get_value(void)
+{
+	DECLARE_BITMAP(bitmap, BITS_PER_LONG * 2);
+	unsigned long val;
+	int i;
+
+	for (i = 0; i < BITS_PER_LONG * 2 - 7; i++) {
+		bitmap_zero(bitmap, BITS_PER_LONG * 2);
+		bitmap_set_value(bitmap, 0b10101UL, i, 5);
+		val = bitmap_get_value(bitmap, i, 5);
+		expect_eq_ulong(0b10101UL, val);
+		bitmap_set_value(bitmap, 0b101UL, i + 5, 3);
+		val = bitmap_get_value(bitmap, i + 5, 3);
+		expect_eq_ulong(0b101UL, val);
+		val = bitmap_get_value(bitmap, i, 8);
+		expect_eq_ulong(0b10110101UL, val);
+	}
+}
+
 static void __init selftest(void)
 {
 	test_zero_clear();
@@ -1249,6 +1280,8 @@ static void __init selftest(void)
 	test_for_each_clear_bitrange_from();
 	test_for_each_set_clump8();
 	test_for_each_set_bit_wrap();
+
+	test_set_get_value();
 }
 
 KSTM_MODULE_LOADERS(test_bitmap);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
@ 2023-07-17 11:37 ` Alexander Potapenko
  2023-07-17 13:49   ` Andy Shevchenko
  2023-07-19  6:09   ` Yury Norov
  2023-07-17 11:37 ` [PATCH v3 4/5] arm64: mte: add a test for MTE tags compression Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
  4 siblings, 2 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

The config implements the EA0 algorithm suggested by Evgenii Stepanov
to compress the memory tags for ARM MTE during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yuri Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 arch/arm64/Kconfig               |  10 +
 arch/arm64/include/asm/mtecomp.h |  60 +++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/mtecomp.c          | 406 +++++++++++++++++++++++++++++++
 4 files changed, 477 insertions(+)
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..52cdc7603cf7c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,16 @@ config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 MTE"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 MTE.
+
+	  128-byte tag buffers corresponding to 4K pages can be compressed using
+	  the EA0 algorithm to save heap memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..0c444c0d4ac04
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+/*
+ * ea0_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @ea0_release_handle().
+ */
+unsigned long ea0_compress(u8 *tags);
+
+/*
+ * ea0_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @ea0_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns true on success, false on error.
+ */
+bool ea0_decompress(unsigned long handle, u8 *tags);
+
+/*
+ * ea0_release_handle() - release the handle returned by ea0_compress().
+ * @handle: handle returned by ea0_compress().
+ */
+void ea0_release_handle(unsigned long handle);
+
+/* Functions below are exported for testing purposes. */
+
+/*
+ * ea0_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by ea0_compress.
+ */
+int ea0_storage_size(unsigned long handle);
+
+/*
+ * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: u16 array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
+
+/*
+ * ea0_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[256] containing the tag of every range.
+ * @r_sizes: u16[256] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..50a379c035aee
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <eugenis@google.com>
+ */
+
+/*
+ * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
+ * compression algorithms.
+ *
+ * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
+ * array of tags into a smaller byte sequence that can be stored in a
+ * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+ * an 8-byte pointer.
+ *
+ * We encapsulate tag storage memory management in this module, because it is
+ * tightly coupled with the pointer representation.
+ *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
+ *     that can be stored in Xarray
+ *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
+ *     the provided 128-byte buffer.
+ *
+ * The compression algorithm works as follows.
+ *
+ * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+ *    @r_tags containing tag values and @r_sizes containing range lengths) by
+ *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
+ *
+ * 2. Depending on the number N of ranges, the following storage class is picked:
+ *            N <= 6:  8 bytes (inline case, no allocation required);
+ *       6 < N <= 11: 16 bytes
+ *      11 < N <= 23: 32 bytes
+ *      23 < N <= 46: 64 bytes
+ *      46 < N:       128 bytes (no compression will be performed)
+ *
+ * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
+ *    The element itself is thrown away from @r_sizes, because it can be
+ *    reconstructed from the sum of the remaining elements. Note that now none
+ *    of the remaining @r_sizes elements is greater than 127.
+ *
+ * 4. For the inline case, the following values are stored in the 8-byte handle:
+ *       largest_idx : i4
+ *      r_tags[0..5] : i4 x 6
+ *     r_sizes[0..4] : i7 x 5
+ *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
+ *
+ *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
+ *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
+ *    distinguished from a kernel pointer.
+ *
+ * 5. For the out-of-line case, the storage is allocated from one of the
+ *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
+ *    on 8 bytes, so its bits 2..0 can be used to store the size class:
+ *     - 0 for 128 bytes
+ *     - 1 for 16
+ *     - 2 for 32
+ *     - 4 for 64.
+ *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
+ *
+ * 6. The data layout in the allocated storage is as follows:
+ *         largest_idx : i6
+ *        r_tags[0..N] : i4 x N
+ *     r_sizes[0..N-1] : i7 x (N-1)
+ *
+ * The decompression algorithm performs the steps below.
+ *
+ * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
+ *    out-of line.
+ *
+ * 2. For the inline case, treat the handle itself as the input buffer.
+ *
+ * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
+ *    the input buffer length. To obtain the pointer to the input buffer, unset
+ *    bits 2..0 of the handle and set bit 63.
+ *
+ * 4. If the input buffer is 128 byte long, copy its contents to the output
+ *    buffer.
+ *
+ * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
+ *    Calculate the removed largest element of @r_sizes:
+ *      largest = 256 - sum(r_sizes)
+ *    and insert it into @r_sizes at position @largest_idx.
+ *
+ * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
+ *    @r_sizes[i] times.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/swab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ *  0: mte-tags-128
+ *  1: mte-tags-16
+ *  2: mte-tags-32
+ *  3: mte-tags-64
+ */
+#define NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[NUM_CACHES];
+
+/*
+ * Sizes of compressed values.
+ */
+#define BITS_PER_TAG 4
+#define BITS_PER_SIZE 7
+#define BITS_PER_LARGEST_IDX_INLINE 4
+#define BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static int ea0_size_to_cache_id(int len)
+{
+	if (len < 128)
+		return fls(len >> 4);
+	return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static int ea0_cache_id_to_size(int id)
+{
+	if (id == 0)
+		return 128;
+	return 8 << id;
+}
+
+/* Transform tags into tag ranges. */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
+{
+	u8 prev_tag = U8_MAX;
+	int cur_idx = -1;
+	u8 cur_tag;
+	int i, j;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+		for (j = 0; j < 2; j++) {
+			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+			if (cur_tag == prev_tag) {
+				out_sizes[cur_idx]++;
+			} else {
+				cur_idx++;
+				prev_tag = cur_tag;
+				out_tags[cur_idx] = prev_tag;
+				out_sizes[cur_idx] = 1;
+			}
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
+
+/* Transform tag ranges back into tags. */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
+{
+	int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static int ea0_alloc_size(int num_ranges)
+{
+	if (num_ranges <= 6)
+		return 8;
+	if (num_ranges <= 11)
+		return 16;
+	if (num_ranges <= 23)
+		return 32;
+	if (num_ranges <= 46)
+		return 64;
+	return 128;
+}
+
+/* Translate allocation size into maximum number of ranges that it can hold. */
+static int ea0_size_to_ranges(int size)
+{
+	switch (size) {
+	case 8:
+		return 6;
+	case 16:
+		return 11;
+	case 32:
+		return 23;
+	case 64:
+		return 46;
+	default:
+		return 0;
+	}
+}
+#define RANGES_INLINE ea0_size_to_ranges(8)
+
+/* Is the data stored inline in the handle itself? */
+static bool ea0_is_inline(unsigned long handle)
+{
+	return (handle & NOINLINE_MASK) != NOINLINE_MASK;
+}
+
+/* Get the size of the buffer backing @handle. */
+int ea0_storage_size(unsigned long handle)
+{
+	if (ea0_is_inline(handle))
+		return 8;
+	return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
+
+static void bitmap_write(unsigned long *bitmap, unsigned long value,
+			 unsigned long *pos, unsigned long bits)
+{
+	bitmap_set_value(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
+				unsigned long *bitmap, int max_ranges)
+{
+	unsigned long bit_pos = 0, l_bits;
+	int largest_idx = -1, i;
+	short largest = 0;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+						 BITS_PER_LARGEST_IDX;
+	bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
+	for (i = len; i < max_ranges; i++)
+		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
+	for (i = 0; i < len; i++) {
+		if (i == largest_idx)
+			continue;
+		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
+}
+
+/* Compress @tags and return a handle. */
+unsigned long ea0_compress(u8 *tags)
+{
+	int alloc_size, cache_id;
+	struct kmem_cache *cache;
+	short r_sizes[256];
+	u8 r_tags[256];
+	int r_len = ARRAY_SIZE(r_tags);
+	unsigned long *storage;
+	/*
+	 * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = ea0_alloc_size(r_len);
+	if (alloc_size == 8) {
+		ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    RANGES_INLINE);
+		return result;
+	}
+	cache_id = ea0_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_alloc(cache, GFP_KERNEL);
+	if (alloc_size < 128) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    ea0_size_to_ranges(alloc_size));
+		return ((unsigned long)storage | cache_id) & HANDLE_MASK;
+	}
+	memcpy(storage, tags, alloc_size);
+	return (unsigned long)storage & HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
+
+static unsigned long bitmap_read(const unsigned long *bitmap,
+				 unsigned long *pos, unsigned long bits)
+{
+	unsigned long result;
+
+	result = bitmap_get_value(bitmap, *pos, bits);
+	*pos += bits;
+	return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
+				    u8 *tags)
+{
+	int largest_idx, i;
+	short r_sizes[46], sum = 0;
+	u8 r_tags[46];
+	unsigned long bit_pos = 0, l_bits;
+
+	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+						 BITS_PER_LARGEST_IDX;
+	largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
+	for (i = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= 256)
+		return false;
+	r_sizes[largest_idx] = 256 - sum;
+	ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *ea0_storage(unsigned long handle)
+{
+	if (ea0_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/* Decompress tags from the buffer referenced by @handle. */
+bool ea0_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+
+	if (size == 128) {
+		memcpy(tags, storage, size);
+		return true;
+	}
+	if (size == 8)
+		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
+	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
+}
+EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
+
+/* Release the memory referenced by @handle. */
+void ea0_release_handle(unsigned long handle)
+{
+	void *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+	struct kmem_cache *c;
+
+	if (!storage)
+		return;
+
+	c = mtecomp_caches[ea0_size_to_cache_id(size)];
+	kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	char name[16];
+	int size;
+	int i;
+
+	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
+	for (i = 0; i < NUM_CACHES; i++) {
+		size = ea0_cache_id_to_size(i);
+		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	return 0;
+}
+module_init(mtecomp_init);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 4/5] arm64: mte: add a test for MTE tags compression
  2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
                   ` (2 preceding siblings ...)
  2023-07-17 11:37 ` [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
@ 2023-07-17 11:37 ` Alexander Potapenko
  2023-07-17 11:37 ` [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
  4 siblings, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v3:
  - addressed comments by Andy Shevchenko in another patch:
   - switched from u64 to unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
   - fixed includes order
---
 arch/arm64/Kconfig           |  10 ++
 arch/arm64/mm/Makefile       |   1 +
 arch/arm64/mm/test_mtecomp.c | 177 +++++++++++++++++++++++++++++++++++
 3 files changed, 188 insertions(+)
 create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 52cdc7603cf7c..a6f1a36e8d27b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2103,6 +2103,16 @@ config ARM64_MTE_COMP
 	  128-byte tag buffers corresponding to 4K pages can be compressed using
 	  the EA0 algorithm to save heap memory.
 
+config ARM64_MTE_COMP_KUNIT_TEST
+	tristate "Test tag compression for ARM64 MTE" if !KUNIT_ALL_TESTS
+	default KUNIT_ALL_TESTS
+	depends on KUNIT && ARM64_MTE_COMP
+	help
+	  Test EA0 compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+	  Ensure that tag sequences containing alternating values are compressed
+	  to buffers of expected size and correctly decompressed afterwards.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..bb71c6224e34a
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for EA0, the compression algorithm for MTE tags.
+ */
+
+#include <kunit/test.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/*
+ * Test that ea0_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	short r_sizes[256];
+	int r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 1);
+	KUNIT_EXPECT_EQ(test, r_tags[0], 0);
+	KUNIT_EXPECT_EQ(test, r_sizes[0], 256);
+	ea0_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+	const short ex_sizes[] = { 1, 2, 2, 1, 250 };
+	short r_sizes[256];
+	int r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 5);
+	KUNIT_EXPECT_EQ(test, memcmp(r_tags, ex_tags, sizeof(ex_tags)), 0);
+	KUNIT_EXPECT_EQ(test, memcmp(r_sizes, ex_sizes, sizeof(ex_sizes)), 0);
+	ea0_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Test that repeated 0xa0 byte produces 256 ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	short r_sizes[256];
+	int r_len = 256;
+	u8 r_tags[256];
+
+	memset(tags, 0xa0, 128);
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	KUNIT_EXPECT_EQ(test, r_len, 256);
+	ea0_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	handle = ea0_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, ea0_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, ea0_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+
+	memset(tags, 0, 128);
+	tags[0] = 0xa0;
+	tags[1] = 0x0a;
+	tags[2] = 0xab;
+
+	handle = ea0_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, ea0_storage_size(handle), 8);
+	KUNIT_EXPECT_TRUE(test, ea0_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Generate a buffer that will contain @nranges of tag ranges, test that it
+ * compresses into @exp_size bytes and decompresses into the original tag
+ * sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges, int exp_size)
+{
+	u8 tags[128], dtags[128];
+	unsigned long handle;
+	int i;
+
+	memset(tags, 0, 128);
+
+	if (nranges > 1) {
+		nranges--;
+		for (i = 0; i < nranges / 2; i++)
+			tags[i] = 0xab;
+		if (nranges % 2)
+			tags[nranges / 2] = 0xa0;
+	}
+
+	handle = ea0_compress(tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	KUNIT_EXPECT_EQ(test, ea0_storage_size(handle), exp_size);
+	KUNIT_EXPECT_TRUE(test, ea0_decompress(handle, dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+	int i, exp_size;
+
+	for (i = 1; i <= 256; i++) {
+		if (i < 7)
+			exp_size = 8;
+		else if (i < 12)
+			exp_size = 16;
+		else if (i < 24)
+			exp_size = 32;
+		else if (i < 47)
+			exp_size = 64;
+		else
+			exp_size = 128;
+		compress_range_helper(test, i, exp_size);
+	}
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+	KUNIT_CASE(test_tags_to_ranges_zero),
+	KUNIT_CASE(test_tags_to_ranges_simple),
+	KUNIT_CASE(test_tags_to_ranges_repeated),
+	KUNIT_CASE(test_compress_zero),
+	KUNIT_CASE(test_compress_simple),
+	KUNIT_CASE(test_compress_ranges),
+	{}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+	.name = "mtecomp",
+	.test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <glider@google.com>");
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
                   ` (3 preceding siblings ...)
  2023-07-17 11:37 ` [PATCH v3 4/5] arm64: mte: add a test for MTE tags compression Alexander Potapenko
@ 2023-07-17 11:37 ` Alexander Potapenko
  2023-07-17 13:53   ` Andy Shevchenko
  4 siblings, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 11:37 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Define the internal mteswap.h interface:
 - _mte_alloc_and_save_tags()
 - _mte_free_saved_tags()
 - _mte_restore_tags()

, that encapsulates saving tags for a struct page (together with memory
allocation), restoring tags, and deleting the storage allocated for them.

These functions accept opaque pointers, which may point to 128-byte
tag buffers, as well as smaller buffers containing compressed tags, or
have compressed tags stored directly in them.

The existing code from mteswap.c operating with uncompressed tags is split
away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
provides compression with the EA0 algorithm. The latter implementation
is picked if CONFIG_ARM64_MTE_COMP=y.

Soon after booting Android, tag compression saves ~2.5x memory previously
spent by mteswap.c on tag allocations. With the growing uptime, the
savings reach 20x and even more.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v3:
  - Addressed comments by Andy Shevchenko in another patch:
   - fixed includes order
   - replaced u64 with unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
---
 arch/arm64/mm/Makefile         |  5 ++++
 arch/arm64/mm/mteswap.c        | 20 ++++++-------
 arch/arm64/mm/mteswap.h        | 12 ++++++++
 arch/arm64/mm/mteswap_comp.c   | 52 ++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap_nocomp.c | 38 +++++++++++++++++++++++++
 5 files changed, 116 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 170dc62b010b9..46a798e2b67cb 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+ifdef CONFIG_ARM64_MTE_COMP
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_comp.o
+else
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_nocomp.o
+endif
 obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1b..9d8f87fd191a2 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,8 +5,11 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+
 #include <asm/mte.h>
 
+#include "mteswap.h"
+
 static DEFINE_XARRAY(mte_pages);
 
 void *mte_allocate_tag_storage(void)
@@ -27,20 +30,18 @@ int mte_save_tags(struct page *page)
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
+	tag_storage = _mte_alloc_and_save_tags(page);
 	if (!tag_storage)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
-
 	/* page_private contains the swap entry.val set in do_swap_page */
 	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		_mte_free_saved_tags(tag_storage);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		_mte_free_saved_tags(ret);
 	}
 
 	return 0;
@@ -53,10 +54,7 @@ void mte_restore_tags(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
-	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
-		set_page_mte_tagged(page);
-	}
+	_mte_restore_tags(tags, page);
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
@@ -64,7 +62,7 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	swp_entry_t entry = swp_entry(type, offset);
 	void *tags = xa_erase(&mte_pages, entry.val);
 
-	mte_free_tag_storage(tags);
+	_mte_free_saved_tags(tags);
 }
 
 void mte_invalidate_tags_area(int type)
@@ -78,7 +76,7 @@ void mte_invalidate_tags_area(int type)
 	xa_lock(&mte_pages);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
 		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		_mte_free_saved_tags(tags);
 	}
 	xa_unlock(&mte_pages);
 }
diff --git a/arch/arm64/mm/mteswap.h b/arch/arm64/mm/mteswap.h
new file mode 100644
index 0000000000000..bf25f2b3e75a4
--- /dev/null
+++ b/arch/arm64/mm/mteswap.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTESWAP_H_
+#define ARCH_ARM64_MM_MTESWAP_H_
+
+#include <linux/mm_types.h>
+
+void *_mte_alloc_and_save_tags(struct page *page);
+void _mte_free_saved_tags(void *tags);
+void _mte_restore_tags(void *tags, struct page *page);
+
+#endif // ARCH_ARM64_MM_MTESWAP_H_
diff --git a/arch/arm64/mm/mteswap_comp.c b/arch/arm64/mm/mteswap_comp.c
new file mode 100644
index 0000000000000..7a369c3fb9c94
--- /dev/null
+++ b/arch/arm64/mm/mteswap_comp.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management with EA0 compression. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+#include <asm/mtecomp.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	u8 tags[128];
+	unsigned long handle;
+
+	mte_save_page_tags(page_address(page), tags);
+	handle = ea0_compress(tags);
+	return xa_mk_value(handle);
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	unsigned long handle = xa_to_value(storage);
+	int size;
+
+	if (!handle)
+		return;
+	size = ea0_storage_size(handle);
+	ea0_release_handle(handle);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	unsigned long handle = xa_to_value(tags);
+	u8 tags_decomp[128];
+
+	if (!handle)
+		return;
+
+	if (try_page_mte_tagging(page)) {
+		if (!ea0_decompress(handle, tags_decomp))
+			return;
+		mte_restore_page_tags(page_address(page), tags_decomp);
+		set_page_mte_tagged(page);
+	}
+}
+MODULE_IMPORT_NS(MTECOMP);
diff --git a/arch/arm64/mm/mteswap_nocomp.c b/arch/arm64/mm/mteswap_nocomp.c
new file mode 100644
index 0000000000000..32733998b1879
--- /dev/null
+++ b/arch/arm64/mm/mteswap_nocomp.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management without compression support. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	void *storage;
+
+	storage = mte_allocate_tag_storage();
+	if (!storage)
+		return NULL;
+
+	mte_save_page_tags(page_address(page), storage);
+	return storage;
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	mte_free_tag_storage(storage);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	if (try_page_mte_tagging(page)) {
+		mte_restore_page_tags(page_address(page), tags);
+		set_page_mte_tagged(page);
+	}
+}
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
@ 2023-07-17 13:01   ` Andy Shevchenko
  2023-07-17 14:14     ` Alexander Potapenko
  2023-07-17 15:50   ` Yury Norov
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 13:01 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> The two new functions allow setting/getting values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a couple of minor changes:

Since changes are minor, please make sure that the authorship is kept
untouched.

>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value())

> Cc: Arnd Bergmann <arnd@arndb.de>

You can use --cc to `git send-email` instead of polluting the commit message.

> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
> Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>

With above, I think you can also add Co-developed-by (as the changes were
made).

...

> +static inline void bitmap_set_value(unsigned long *map,
> +				    unsigned long value,
> +				    unsigned long start, unsigned long nbits)
> +{
> +	const size_t index = BIT_WORD(start);
> +	const unsigned long offset = start % BITS_PER_LONG;
> +	const unsigned long space = BITS_PER_LONG - offset;
> +
> +	value &= GENMASK(nbits - 1, 0);
> +
> +	if (space >= nbits) {

> +		map[index] &= ~(GENMASK(nbits + offset - 1, offset));

I remember that this construction may bring horrible code on some architectures
with some version(s) of the compiler (*).

To fix that I found an easy refactoring:

		map[index] &= ~(GENMASK(nbits, 0) << offset));


(*) don't remember the actual versions, though, but anyway...

> +		map[index] |= value << offset;
> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 11:37 ` [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
@ 2023-07-17 13:04   ` Andy Shevchenko
  2023-07-18 10:19     ` Alexander Potapenko
  2023-07-17 16:11   ` Yury Norov
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 13:04 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 01:37:05PM +0200, Alexander Potapenko wrote:
> Add basic tests ensuring that values can be added at arbitrary positions
> of the bitmap, including those spanning into the adjacent unsigned
> longs.

I always in favour of a new test case!
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

> Signed-off-by: Alexander Potapenko <glider@google.com>
> 
> ---
> This patch was previously called
> "lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"

Hint, you may always just link to lore mail archive for easier access and
handling. Also with `b4` at hand the lore link can be used to resurrect
a discussion (in case it's needed).

> v3:
>  - switch to using bitmap_{set,get}_value()
>  - change the expected bit pattern in test_set_get_value(),
>    as the test was incorrectly assuming 0 is the LSB.
> ---
>  lib/test_bitmap.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
> index 187f5b2db4cf1..c2ab54040c249 100644
> --- a/lib/test_bitmap.c
> +++ b/lib/test_bitmap.c
> @@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
>  	return true;
>  }
>  
> +static bool __init
> +__check_eq_ulong(const char *srcfile, unsigned int line,
> +		 const unsigned long exp_ulong, unsigned long x)
> +{
> +	if (exp_ulong != x) {
> +		pr_err("[%s:%u] expected %lu, got %lu\n",
> +			srcfile, line, exp_ulong, x);
> +		return false;
> +	}
> +	return true;
> +}
>  
>  static bool __init
>  __check_eq_bitmap(const char *srcfile, unsigned int line,
> @@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
>  	})
>  
>  #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
> +#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
>  #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
>  #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
>  #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
> @@ -1222,6 +1234,25 @@ static void __init test_bitmap_const_eval(void)
>  	BUILD_BUG_ON(~var != ~BIT(25));
>  }
>  
> +static void __init test_set_get_value(void)
> +{
> +	DECLARE_BITMAP(bitmap, BITS_PER_LONG * 2);
> +	unsigned long val;
> +	int i;
> +
> +	for (i = 0; i < BITS_PER_LONG * 2 - 7; i++) {
> +		bitmap_zero(bitmap, BITS_PER_LONG * 2);
> +		bitmap_set_value(bitmap, 0b10101UL, i, 5);
> +		val = bitmap_get_value(bitmap, i, 5);
> +		expect_eq_ulong(0b10101UL, val);
> +		bitmap_set_value(bitmap, 0b101UL, i + 5, 3);
> +		val = bitmap_get_value(bitmap, i + 5, 3);
> +		expect_eq_ulong(0b101UL, val);
> +		val = bitmap_get_value(bitmap, i, 8);
> +		expect_eq_ulong(0b10110101UL, val);
> +	}
> +}
> +
>  static void __init selftest(void)
>  {
>  	test_zero_clear();
> @@ -1249,6 +1280,8 @@ static void __init selftest(void)
>  	test_for_each_clear_bitrange_from();
>  	test_for_each_set_clump8();
>  	test_for_each_set_bit_wrap();
> +
> +	test_set_get_value();
>  }
>  
>  KSTM_MODULE_LOADERS(test_bitmap);
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-17 11:37 ` [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
@ 2023-07-17 13:49   ` Andy Shevchenko
  2023-07-18 15:33     ` Alexander Potapenko
  2023-07-19  6:09   ` Yury Norov
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 13:49 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +config ARM64_MTE_COMP
> +	bool "Tag compression for ARM64 MTE"

At least here, make sure everybody understands what you are talking about. WTF
MTE is?

> +	default y
> +	depends on ARM64_MTE
> +	help
> +	  Enable tag compression support for ARM64 MTE.
> +
> +	  128-byte tag buffers corresponding to 4K pages can be compressed using
> +	  the EA0 algorithm to save heap memory.

>  config ARM64_SVE
>  	bool "ARM Scalable Vector Extension support"

You see the difference?

...

> +/*

Are you deliberately made it NON-kernel-doc? If so, why? And why does it
have too many similarities with above mentioned format?

> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.

In case you are going to make them real kernel-doc:s, make sure kernel-doc
validator doesn't warn. Here, for example, return section is missing. The easy
fix is to add : after Returns. Same to the rest of function descriptions. Also
why you put the descriptions in to the header file? It's a bit unusual for the
exported ones.

> + */

...

> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.

u16? I don't see it.

> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */

> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.

Ditto.

> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);

In both cases signed integer may be promoted with a sign. Is it a problem here?

...

> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.
> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.

> + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value

ea0_compress() is usual way how we refer to the functions. Let tools to make
the necessary references.

> + *     that can be stored in Xarray
> + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into

Ditto. And so on.

> + *     the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + *    @r_tags containing tag values and @r_sizes containing range lengths) by
> + *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + *            N <= 6:  8 bytes (inline case, no allocation required);
> + *       6 < N <= 11: 16 bytes
> + *      11 < N <= 23: 32 bytes
> + *      23 < N <= 46: 64 bytes
> + *      46 < N:       128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + *    The element itself is thrown away from @r_sizes, because it can be
> + *    reconstructed from the sum of the remaining elements. Note that now none
> + *    of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + *       largest_idx : i4
> + *      r_tags[0..5] : i4 x 6
> + *     r_sizes[0..4] : i7 x 5
> + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + *    distinguished from a kernel pointer.
> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + *    on 8 bytes, so its bits 2..0 can be used to store the size class:

> + *     - 0 for 128 bytes
> + *     - 1 for 16
> + *     - 2 for 32
> + *     - 4 for 64.

Is this chosen deliberately (for performance?)? Otherwise why not put them in
natural exponential growing?

> + *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + *         largest_idx : i6
> + *        r_tags[0..N] : i4 x N
> + *     r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + *    out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + *    the input buffer length. To obtain the pointer to the input buffer, unset
> + *    bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + *    buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + *    Calculate the removed largest element of @r_sizes:
> + *      largest = 256 - sum(r_sizes)
> + *    and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + *    @r_sizes[i] times.
> + */

...

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>

bitmap guarantees that bits.h will be included.

> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> +	u8 prev_tag = U8_MAX;

> +	int cur_idx = -1;

At which circumstances does this assignment make sense?

> +	u8 cur_tag;
> +	int i, j;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;

Who guarantees this one is not [-1]?

> +			} else {

> +				cur_idx++;

Aha, above seems a bit prone to out of boundaries errors. Can you make it
unsigned and start from 0?

> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}

...

> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> +	int i, j, pos = 0;

Wouldn't be more correct to have this assignment inside the first for-loop?

> +	u8 prev;
> +
> +	for (i = 0; i < r_len; i++) {
> +		for (j = 0; j < r_sizes[i]; j++) {
> +			if (pos % 2)
> +				tags[pos / 2] = (prev << 4) | r_tags[i];
> +			else
> +				prev = r_tags[i];
> +			pos++;
> +		}
> +	}
> +}

...

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Don't forget to undef it when not needed.

...

> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> +			 unsigned long *pos, unsigned long bits)

Please, don't use reserved namespace. Yours is ea0, use it:
ea0_bitmap_write()! Same to other similarly named functions.

...

> +	unsigned long bit_pos = 0, l_bits;
> +	int largest_idx = -1, i;
> +	short largest = 0;

Here and elsewhere, please, double check the correctness and/or necessity of
signdness and assignments of local variables.

...

> +	for (i = 0; i < len; i++) {
> +		if (sizes[i] > largest) {

Here

		if (largest >= sizes[i])
			continue;
makes sense, but...

> +			largest = sizes[i];
> +			largest_idx = i;
> +		}
> +	}

...

> +	for (i = 0; i < len; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);

...here I would do the opposite since it's one liner.

> +	}

...

> +	u8 r_tags[256];
> +	int r_len = ARRAY_SIZE(r_tags);

sizeof() ?

...

> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;

Is it a dup? Perhaps a helper for this?

Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
EA0_BITS_...

...

> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> +	unsigned long *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +
> +	if (size == 128) {
> +		memcpy(tags, storage, size);
> +		return true;
> +	}
> +	if (size == 8)
> +		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);

Maybe

	switch (ea0_storage_size(handle)) {
		...
	default:
	}

?

> +	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}

...

> +void ea0_release_handle(unsigned long handle)
> +{
> +	void *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +	struct kmem_cache *c;

> +	if (!storage)
> +		return;

I find slightly better for maintaining in the form as

	struct kmem_cache *c;
	void *storage;
	int size;

	storage = ea0_storage(handle);
	if (!storage)
		return;

	size = ea0_storage_size(handle);

> +	c = mtecomp_caches[ea0_size_to_cache_id(size)];
> +	kmem_cache_free(c, storage);
> +}

...

> +static int mtecomp_init(void)
> +{
> +	char name[16];
> +	int size;
> +	int i;
> +
> +	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);

Why not static_assert()?

> +	for (i = 0; i < NUM_CACHES; i++) {
> +		size = ea0_cache_id_to_size(i);
> +		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

sizeof() will work the same way without need of having kernel.h be included.

> +		mtecomp_caches[i] =
> +			kmem_cache_create(name, size, size, 0, NULL);
> +	}
> +	return 0;
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-17 11:37 ` [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
@ 2023-07-17 13:53   ` Andy Shevchenko
  2023-07-18 10:48     ` Alexander Potapenko
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 13:53 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 01:37:08PM +0200, Alexander Potapenko wrote:
> Define the internal mteswap.h interface:
>  - _mte_alloc_and_save_tags()
>  - _mte_free_saved_tags()
>  - _mte_restore_tags()
> 
> , that encapsulates saving tags for a struct page (together with memory
> allocation), restoring tags, and deleting the storage allocated for them.
> 
> These functions accept opaque pointers, which may point to 128-byte
> tag buffers, as well as smaller buffers containing compressed tags, or
> have compressed tags stored directly in them.
> 
> The existing code from mteswap.c operating with uncompressed tags is split
> away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
> provides compression with the EA0 algorithm. The latter implementation
> is picked if CONFIG_ARM64_MTE_COMP=y.
> 
> Soon after booting Android, tag compression saves ~2.5x memory previously
> spent by mteswap.c on tag allocations. With the growing uptime, the
> savings reach 20x and even more.

...

> +#ifndef ARCH_ARM64_MM_MTESWAP_H_
> +#define ARCH_ARM64_MM_MTESWAP_H_

> +#include <linux/mm_types.h>

But you actually don't use that.

struct page;

forward declaration is enough.

> +void *_mte_alloc_and_save_tags(struct page *page);
> +void _mte_free_saved_tags(void *tags);
> +void _mte_restore_tags(void *tags, struct page *page);
> +
> +#endif // ARCH_ARM64_MM_MTESWAP_H_

...

> +void _mte_free_saved_tags(void *storage)
> +{
> +	unsigned long handle = xa_to_value(storage);
> +	int size;
> +
> +	if (!handle)
> +		return;

Perhaps

	unsigned long handle;

	handle = xa_to_value(storage);
	if (!handle)
		return;

> +	size = ea0_storage_size(handle);
> +	ea0_release_handle(handle);
> +}

> +void _mte_restore_tags(void *tags, struct page *page)
> +{

As per above.

> +	if (try_page_mte_tagging(page)) {
> +		if (!ea0_decompress(handle, tags_decomp))
> +			return;
> +		mte_restore_page_tags(page_address(page), tags_decomp);
> +		set_page_mte_tagged(page);
> +	}

I think you may drop an indentation level by

	if (!try_page_mte_tagging(page))
		return;

> +}

...

> +void _mte_restore_tags(void *tags, struct page *page)
> +{
> +	if (try_page_mte_tagging(page)) {
> +		mte_restore_page_tags(page_address(page), tags);
> +		set_page_mte_tagged(page);
> +	}

Ditto.

> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 13:01   ` Andy Shevchenko
@ 2023-07-17 14:14     ` Alexander Potapenko
  2023-07-17 14:29       ` Andy Shevchenko
  2023-07-17 14:53       ` Alexander Potapenko
  0 siblings, 2 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 14:14 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

>
> > Cc: Arnd Bergmann <arnd@arndb.de>
>
> You can use --cc to `git send-email` instead of polluting the commit message.

Right. But as far as I can tell, certain kernel devs prefer to be CCed
on the whole series, whereas others do not want to see anything but
the actual patch they were interested in.
I am not sure about Arnd's preferences, so I just decided to keep the
tag from the original patch by Syed Nayyar Waris (which I also
consider to be an indication of the fact "that potentially interested
parties have been included in the discussion" per
https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by)

> > Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
> > Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
> > Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> > Suggested-by: Yury Norov <yury.norov@gmail.com>
> > Signed-off-by: Alexander Potapenko <glider@google.com>
>
> With above, I think you can also add Co-developed-by (as the changes were
> made).

Ok, will do.

> ...
>
> > +static inline void bitmap_set_value(unsigned long *map,
> > +                                 unsigned long value,
> > +                                 unsigned long start, unsigned long nbits)
> > +{
> > +     const size_t index = BIT_WORD(start);
> > +     const unsigned long offset = start % BITS_PER_LONG;
> > +     const unsigned long space = BITS_PER_LONG - offset;
> > +
> > +     value &= GENMASK(nbits - 1, 0);
> > +
> > +     if (space >= nbits) {
>
> > +             map[index] &= ~(GENMASK(nbits + offset - 1, offset));
>
> I remember that this construction may bring horrible code on some architectures
> with some version(s) of the compiler (*).

Wow, even the trunk Clang and GCC seem to generate better code for
your version of this line: https://godbolt.org/z/36Kqxhe6j

> To fix that I found an easy refactoring:
>
>                 map[index] &= ~(GENMASK(nbits, 0) << offset));
>

I'll take this one.

> (*) don't remember the actual versions, though, but anyway...
>
> > +             map[index] |= value << offset;
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> --
> With Best Regards,
> Andy Shevchenko
>
>


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 14:14     ` Alexander Potapenko
@ 2023-07-17 14:29       ` Andy Shevchenko
  2023-07-17 14:31         ` Andy Shevchenko
  2023-07-17 14:53       ` Alexander Potapenko
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 14:29 UTC (permalink / raw)
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:

+Cc: Nathan (on code generation question below)

...

> > > Cc: Arnd Bergmann <arnd@arndb.de>
> >
> > You can use --cc to `git send-email` instead of polluting the commit message.
> 
> Right. But as far as I can tell, certain kernel devs prefer to be CCed
> on the whole series, whereas others do not want to see anything but
> the actual patch they were interested in.
> I am not sure about Arnd's preferences, so I just decided to keep the
> tag from the original patch by Syed Nayyar Waris (which I also
> consider to be an indication of the fact "that potentially interested
> parties have been included in the discussion" per
> https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by)

My personal statistics from the field that more than 90% of maintainers would
like to receive 100% of the series. Of course it depends on the series (if it's
treewide, I will agree with you). Here another point to my suggestion is that
Arnd is SoC tree maintainer, where ARM is one of the biggest player, so I think
he would like to see arm*: prefixed patches anyway.

...

> > > +             map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> >
> > I remember that this construction may bring horrible code on some architectures
> > with some version(s) of the compiler (*).
> 
> Wow, even the trunk Clang and GCC seem to generate better code for
> your version of this line: https://godbolt.org/z/36Kqxhe6j

Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
cause is that in the original version compiler can't prove that l is constant
for GENMASK().

> > To fix that I found an easy refactoring:
> >
> >                 map[index] &= ~(GENMASK(nbits, 0) << offset));
> 
> I'll take this one.
> 
> > (*) don't remember the actual versions, though, but anyway...

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 14:29       ` Andy Shevchenko
@ 2023-07-17 14:31         ` Andy Shevchenko
  2023-07-17 16:15           ` Yury Norov
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 14:31 UTC (permalink / raw)
  To: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 05:29:12PM +0300, Andy Shevchenko wrote:
> On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:

...

> > > > +             map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> > >
> > > I remember that this construction may bring horrible code on some architectures
> > > with some version(s) of the compiler (*).
> > 
> > Wow, even the trunk Clang and GCC seem to generate better code for
> > your version of this line: https://godbolt.org/z/36Kqxhe6j
> 
> Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
> cause is that in the original version compiler can't prove that l is constant
> for GENMASK().
> 
> > > To fix that I found an easy refactoring:
> > >
> > >                 map[index] &= ~(GENMASK(nbits, 0) << offset));

nbits - 1 it should be, btw. In any case it seems the code is still better.

> > I'll take this one.
> > 
> > > (*) don't remember the actual versions, though, but anyway...

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 14:14     ` Alexander Potapenko
  2023-07-17 14:29       ` Andy Shevchenko
@ 2023-07-17 14:53       ` Alexander Potapenko
  2023-07-17 15:03         ` Andy Shevchenko
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 14:53 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

> >
> > I remember that this construction may bring horrible code on some architectures
> > with some version(s) of the compiler (*).
>
> Wow, even the trunk Clang and GCC seem to generate better code for
> your version of this line: https://godbolt.org/z/36Kqxhe6j

Oh wait.
First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.

>
> > To fix that I found an easy refactoring:
> >
> >                 map[index] &= ~(GENMASK(nbits, 0) << offset));
> >

Second, the line above should probably be:
  map[index] &= ~(GENMASK(nbits - 1, 0) << offset));

, right?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 14:53       ` Alexander Potapenko
@ 2023-07-17 15:03         ` Andy Shevchenko
  2023-07-17 16:29           ` Alexander Potapenko
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 15:03 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 04:53:51PM +0200, Alexander Potapenko wrote:

...

> > > I remember that this construction may bring horrible code on some architectures
> > > with some version(s) of the compiler (*).
> >
> > Wow, even the trunk Clang and GCC seem to generate better code for
> > your version of this line: https://godbolt.org/z/36Kqxhe6j
> 
> Oh wait.
> First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
> long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.

Still slightly better. And note, that the same GENMASK() is used at the
beginning of the function. Compiler actually might cache it.

> > > To fix that I found an easy refactoring:
> > >
> > >                 map[index] &= ~(GENMASK(nbits, 0) << offset));
> > >
> 
> Second, the line above should probably be:
>   map[index] &= ~(GENMASK(nbits - 1, 0) << offset));
> 
> , right?

Yes.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
  2023-07-17 13:01   ` Andy Shevchenko
@ 2023-07-17 15:50   ` Yury Norov
  2023-07-18  9:30     ` Alexander Potapenko
  1 sibling, 1 reply; 37+ messages in thread
From: Yury Norov @ 2023-07-17 15:50 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> The two new functions allow setting/getting values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a couple of minor changes:
>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value())

Please preserve Syed's authorship ('From' field in git log).

> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
> Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> ---
>  include/linux/bitmap.h | 57 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 03644237e1efb..4559366084988 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -76,7 +76,11 @@ struct device;
>   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
>   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
>   *  bitmap_get_value8(map, start)               Get 8bit value from map at start
> + *  bitmap_get_value(map, start, nbits)         Get bit value of size 'nbits'
> + *                                              from map at start
>   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> + *  bitmap_set_value(map, value, start, nbits)  Set bit value of size 'nbits'
> + *                                              of map at start

The 'bit value of size' sounds more confusing than it should. The size
of bit is actually a bit... Can you rephrase? Moreover, 'set bits' has
a meaning of actually setting them, i.e. switching to '1'. Maybe:
"Copy 'nbits' to bitmap starting at 'start'"?

>   *
>   * Note, bitmap_zero() and bitmap_fill() operate over the region of
>   * unsigned longs, that is, bits behind bitmap till the unsigned long
> @@ -583,6 +587,31 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
>  	return (map[index] >> offset) & 0xFF;
>  }
>  
> +/**
> + * bitmap_get_value - get a value of n-bits from the memory region
> + * @map: address to the bitmap memory region
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits


 * @nbits: size of value in bits, up to BITS_PER_LONG

> + *
> + * Returns value of nbits located at the @start bit offset within the @map
> + * memory region.
> + */
> +static inline unsigned long bitmap_get_value(const unsigned long *map,
> +					     unsigned long start,
> +					     unsigned long nbits)
> +{
> +	const size_t index = BIT_WORD(start);
> +	const unsigned long offset = start % BITS_PER_LONG;
> +	const unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long value_low, value_high;
> +
> +	if (space >= nbits)
> +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> +	return (value_low >> offset) | (value_high << space);
> +}

When nbits == 0, copy-like functions shouldn't touch any memory. See how
other bitmap and find_bit functions hold it.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 11:37 ` [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
  2023-07-17 13:04   ` Andy Shevchenko
@ 2023-07-17 16:11   ` Yury Norov
  2023-07-17 16:28     ` Andy Shevchenko
  2023-07-17 16:42     ` Alexander Potapenko
  1 sibling, 2 replies; 37+ messages in thread
From: Yury Norov @ 2023-07-17 16:11 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 01:37:05PM +0200, Alexander Potapenko wrote:
> Add basic tests ensuring that values can be added at arbitrary positions
> of the bitmap, including those spanning into the adjacent unsigned
> longs.
> 
> Signed-off-by: Alexander Potapenko <glider@google.com>
 
Thanks for the test!

> ---
> This patch was previously called
> "lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
> 
> v3:
>  - switch to using bitmap_{set,get}_value()
>  - change the expected bit pattern in test_set_get_value(),
>    as the test was incorrectly assuming 0 is the LSB.
> ---
>  lib/test_bitmap.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
> index 187f5b2db4cf1..c2ab54040c249 100644
> --- a/lib/test_bitmap.c
> +++ b/lib/test_bitmap.c
> @@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
>  	return true;
>  }
>  
> +static bool __init
> +__check_eq_ulong(const char *srcfile, unsigned int line,
> +		 const unsigned long exp_ulong, unsigned long x)
> +{
> +	if (exp_ulong != x) {
> +		pr_err("[%s:%u] expected %lu, got %lu\n",
> +			srcfile, line, exp_ulong, x);
> +		return false;
> +	}
> +	return true;
> +}
>  
>  static bool __init
>  __check_eq_bitmap(const char *srcfile, unsigned int line,
> @@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
>  	})
>  
>  #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
> +#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
>  #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
>  #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
>  #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
> @@ -1222,6 +1234,25 @@ static void __init test_bitmap_const_eval(void)
>  	BUILD_BUG_ON(~var != ~BIT(25));
>  }
>  
> +static void __init test_set_get_value(void)
> +{
> +	DECLARE_BITMAP(bitmap, BITS_PER_LONG * 2);

It's too short. Can you make it long enough to ensure it works as
expected when start is not in the 1st word, and start+nbits is in
the following word.

> +	unsigned long val;
> +	int i;
> +
> +	for (i = 0; i < BITS_PER_LONG * 2 - 7; i++) {
> +		bitmap_zero(bitmap, BITS_PER_LONG * 2);
> +		bitmap_set_value(bitmap, 0b10101UL, i, 5);
> +		val = bitmap_get_value(bitmap, i, 5);
> +		expect_eq_ulong(0b10101UL, val);

Can you also check that the rest of bitmap is untouched?
Something like:

	DECLARE_BITMAP(bitmap, ...);
	DECLARE_BITMAP(orig, ...);

        memset(orig, 0x5a, ...);
        memset(bitmap, 0x5a, ...);

        for (j = start; j < start + nbits; j++)
                if (val & BIT(j - start))
                        __set_bit(j, orig);
                else
                        __clear_bit(j, orig);

        bitmap_set_value(bitmap, val, start, nbits);
        expect_eq_bitmap(orig, bitmap, ...);

I like this kind of testing because it gives people a better
understanding of what happens behind all that optimization tricks.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 14:31         ` Andy Shevchenko
@ 2023-07-17 16:15           ` Yury Norov
  0 siblings, 0 replies; 37+ messages in thread
From: Yury Norov @ 2023-07-17 16:15 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, linux-kernel,
	linux-arm-kernel, eugenis, syednwaris, william.gray,
	Arnd Bergmann

On Mon, Jul 17, 2023 at 05:31:44PM +0300, Andy Shevchenko wrote:
> On Mon, Jul 17, 2023 at 05:29:12PM +0300, Andy Shevchenko wrote:
> > On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:
> 
> ...
> 
> > > > > +             map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> > > >
> > > > I remember that this construction may bring horrible code on some architectures
> > > > with some version(s) of the compiler (*).
> > > 
> > > Wow, even the trunk Clang and GCC seem to generate better code for
> > > your version of this line: https://godbolt.org/z/36Kqxhe6j
> > 
> > Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
> > cause is that in the original version compiler can't prove that l is constant
> > for GENMASK().
> > 
> > > > To fix that I found an easy refactoring:
> > > >
> > > >                 map[index] &= ~(GENMASK(nbits, 0) << offset));
> 
> nbits - 1 it should be, btw. In any case it seems the code is still better.
 
 Yep.

 Alexander, for the next round can you please show disassembly for the
 functions in case of compile-time and run-time defined start and nbits?


 Thanks,
 Yury

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 16:11   ` Yury Norov
@ 2023-07-17 16:28     ` Andy Shevchenko
  2023-07-17 16:42     ` Alexander Potapenko
  1 sibling, 0 replies; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-17 16:28 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 09:11:50AM -0700, Yury Norov wrote:
> On Mon, Jul 17, 2023 at 01:37:05PM +0200, Alexander Potapenko wrote:

...

>                 if (val & BIT(j - start))
>                         __set_bit(j, orig);
>                 else
>                         __clear_bit(j, orig);

JFYI: __asign_bit() can be used here.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 15:03         ` Andy Shevchenko
@ 2023-07-17 16:29           ` Alexander Potapenko
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 16:29 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 5:03 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Mon, Jul 17, 2023 at 04:53:51PM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > > I remember that this construction may bring horrible code on some architectures
> > > > with some version(s) of the compiler (*).
> > >
> > > Wow, even the trunk Clang and GCC seem to generate better code for
> > > your version of this line: https://godbolt.org/z/36Kqxhe6j
> >
> > Oh wait.
> > First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
> > long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.
>
> Still slightly better. And note, that the same GENMASK() is used at the
> beginning of the function. Compiler actually might cache it.

I think that the compiler might consider this transformation invalid
because "GENMASK(nbits - 1, 0) << offset" and "GENMASK(nbits + offset
- 1, offset)" have different values in the case "nbits + offset"
exceed 64.
Restricting the domain of bitmap_set_value() might result in better
code, but it is easier to stick to your version, which prevents the
overflow.

> > > > To fix that I found an easy refactoring:
> > > >
> > > >                 map[index] &= ~(GENMASK(nbits, 0) << offset));
> > > >
> >
> > Second, the line above should probably be:
> >   map[index] &= ~(GENMASK(nbits - 1, 0) << offset));
> >
> > , right?
>
> Yes.

This "nbits vs. nbits - 1" was not detected by test_set_get_value(),
so I'll add an extra test case for it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 16:11   ` Yury Norov
  2023-07-17 16:28     ` Andy Shevchenko
@ 2023-07-17 16:42     ` Alexander Potapenko
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-17 16:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 6:11 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:05PM +0200, Alexander Potapenko wrote:
> > Add basic tests ensuring that values can be added at arbitrary positions
> > of the bitmap, including those spanning into the adjacent unsigned
> > longs.
> >
> > Signed-off-by: Alexander Potapenko <glider@google.com>
>
> Thanks for the test!
>
> > ---
> > This patch was previously called
> > "lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
> >
> > v3:
> >  - switch to using bitmap_{set,get}_value()
> >  - change the expected bit pattern in test_set_get_value(),
> >    as the test was incorrectly assuming 0 is the LSB.
> > ---
> >  lib/test_bitmap.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
> > index 187f5b2db4cf1..c2ab54040c249 100644
> > --- a/lib/test_bitmap.c
> > +++ b/lib/test_bitmap.c
> > @@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
> >       return true;
> >  }
> >
> > +static bool __init
> > +__check_eq_ulong(const char *srcfile, unsigned int line,
> > +              const unsigned long exp_ulong, unsigned long x)
> > +{
> > +     if (exp_ulong != x) {
> > +             pr_err("[%s:%u] expected %lu, got %lu\n",
> > +                     srcfile, line, exp_ulong, x);
> > +             return false;
> > +     }
> > +     return true;
> > +}
> >
> >  static bool __init
> >  __check_eq_bitmap(const char *srcfile, unsigned int line,
> > @@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
> >       })
> >
> >  #define expect_eq_uint(...)          __expect_eq(uint, ##__VA_ARGS__)
> > +#define expect_eq_ulong(...)         __expect_eq(ulong, ##__VA_ARGS__)
> >  #define expect_eq_bitmap(...)                __expect_eq(bitmap, ##__VA_ARGS__)
> >  #define expect_eq_pbl(...)           __expect_eq(pbl, ##__VA_ARGS__)
> >  #define expect_eq_u32_array(...)     __expect_eq(u32_array, ##__VA_ARGS__)
> > @@ -1222,6 +1234,25 @@ static void __init test_bitmap_const_eval(void)
> >       BUILD_BUG_ON(~var != ~BIT(25));
> >  }
> >
> > +static void __init test_set_get_value(void)
> > +{
> > +     DECLARE_BITMAP(bitmap, BITS_PER_LONG * 2);
>
> It's too short. Can you make it long enough to ensure it works as
> expected when start is not in the 1st word, and start+nbits is in
> the following word.
>
> > +     unsigned long val;
> > +     int i;
> > +
> > +     for (i = 0; i < BITS_PER_LONG * 2 - 7; i++) {
> > +             bitmap_zero(bitmap, BITS_PER_LONG * 2);
> > +             bitmap_set_value(bitmap, 0b10101UL, i, 5);
> > +             val = bitmap_get_value(bitmap, i, 5);
> > +             expect_eq_ulong(0b10101UL, val);
>
> Can you also check that the rest of bitmap is untouched?
> Something like:
>
>         DECLARE_BITMAP(bitmap, ...);
>         DECLARE_BITMAP(orig, ...);
>
>         memset(orig, 0x5a, ...);
>         memset(bitmap, 0x5a, ...);
>
>         for (j = start; j < start + nbits; j++)
>                 if (val & BIT(j - start))
>                         __set_bit(j, orig);
>                 else
>                         __clear_bit(j, orig);
>
>         bitmap_set_value(bitmap, val, start, nbits);
>         expect_eq_bitmap(orig, bitmap, ...);
>
> I like this kind of testing because it gives people a better
> understanding of what happens behind all that optimization tricks.

Will do. In fact the difference between GENMASK(n, 0) and GENMASK(n-1,
0) discussed in the other patch requires exactly this kind of testing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-17 15:50   ` Yury Norov
@ 2023-07-18  9:30     ` Alexander Potapenko
  2023-07-18 14:01       ` Andy Shevchenko
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-18  9:30 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> > The two new functions allow setting/getting values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> >
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a couple of minor changes:
> >  - instead of using roundup(), which adds an unnecessary dependency
> >    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> >  - indentation is reduced by not using else-clauses (suggested by
> >    checkpatch for bitmap_get_value())
>
> Please preserve Syed's authorship ('From' field in git log).
Done

> >   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> > + *  bitmap_set_value(map, value, start, nbits)  Set bit value of size 'nbits'
> > + *                                              of map at start
>
> The 'bit value of size' sounds more confusing than it should. The size
> of bit is actually a bit... Can you rephrase?
How about "Get an nbits-sized value from map at start" and "Set an
nbits-sized value to map at start"?

> Moreover, 'set bits' has
> a meaning of actually setting them, i.e. switching to '1'. Maybe:
> "Copy 'nbits' to bitmap starting at 'start'"?

Right now it is in line with the comment for bitmap_set_value8 (and
the names of the functions also have _set_ in them).
Shall I also change that comment?
WDYT about "Put an nbits-sized value to map at start"?


> > +/**
> > + * bitmap_get_value - get a value of n-bits from the memory region
> > + * @map: address to the bitmap memory region
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits
>
>
>  * @nbits: size of value in bits, up to BITS_PER_LONG
Ok

> > + *
> > + * Returns value of nbits located at the @start bit offset within the @map
> > + * memory region.
> > + */
> > +static inline unsigned long bitmap_get_value(const unsigned long *map,
> > +                                          unsigned long start,
> > +                                          unsigned long nbits)
> > +{
> > +     const size_t index = BIT_WORD(start);
> > +     const unsigned long offset = start % BITS_PER_LONG;
> > +     const unsigned long space = BITS_PER_LONG - offset;
> > +     unsigned long value_low, value_high;
> > +
> > +     if (space >= nbits)
> > +             return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> > +     value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> > +     value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> > +     return (value_low >> offset) | (value_high << space);
> > +}
>
> When nbits == 0, copy-like functions shouldn't touch any memory. See how
> other bitmap and find_bit functions hold it.

I think this is different from what other bitmap functions do, but it
should be enough to bail out on !nbits, i.e.:

    if (!nbits)
        return 0;

You probably meant adding a __builtin_constant_p() (which is used all
over the place in bitmap.h), but:
 - the compiler won't have problem optimizing away the code for a
constant nbits=0;
 - we anyway need a dynamic check for the case nbits is not constant
(for both bitmap_get_value() and bitmap_set_value(), I assume).

What do you think?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
  2023-07-17 13:04   ` Andy Shevchenko
@ 2023-07-18 10:19     ` Alexander Potapenko
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-18 10:19 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 3:04 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:05PM +0200, Alexander Potapenko wrote:
> > Add basic tests ensuring that values can be added at arbitrary positions
> > of the bitmap, including those spanning into the adjacent unsigned
> > longs.
>
> I always in favour of a new test case!
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>
> > Signed-off-by: Alexander Potapenko <glider@google.com>
> >
> > ---
> > This patch was previously called
> > "lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
>
> Hint, you may always just link to lore mail archive for easier access and
> handling. Also with `b4` at hand the lore link can be used to resurrect
> a discussion (in case it's needed).

Will add the link in v4
(guess you didn't want it in the final patch description, correct?)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-17 13:53   ` Andy Shevchenko
@ 2023-07-18 10:48     ` Alexander Potapenko
  2023-07-18 14:13       ` Andy Shevchenko
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-18 10:48 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

> > +#include <linux/mm_types.h>
>
> But you actually don't use that.
>
> struct page;
>
> forward declaration is enough.
Fair enough.
>
> > +void *_mte_alloc_and_save_tags(struct page *page);
> > +void _mte_free_saved_tags(void *tags);
> > +void _mte_restore_tags(void *tags, struct page *page);
> > +
> > +#endif // ARCH_ARM64_MM_MTESWAP_H_
>
> ...
>
> > +void _mte_free_saved_tags(void *storage)
> > +{
> > +     unsigned long handle = xa_to_value(storage);
> > +     int size;
> > +
> > +     if (!handle)
> > +             return;
>
> Perhaps
>
>         unsigned long handle;
>
>         handle = xa_to_value(storage);
>         if (!handle)
>                 return;

I don't have a strong preference and am happy to change this, but, out
of curiosity, why do you think it is better?
This pattern (calling (even non-)trivial functions when declaring
variables) is widely used across the kernel.
Or is it just for consistency with how `handle` is used in the rest of the file?


> > +void _mte_restore_tags(void *tags, struct page *page)
> > +{
>
> As per above.

Ack

> > +     if (try_page_mte_tagging(page)) {
> > +             if (!ea0_decompress(handle, tags_decomp))
> > +                     return;
> > +             mte_restore_page_tags(page_address(page), tags_decomp);
> > +             set_page_mte_tagged(page);
> > +     }
>
> I think you may drop an indentation level by
>
>         if (!try_page_mte_tagging(page))
>                 return;
>
> > +}

Ack

> ...
>
> > +void _mte_restore_tags(void *tags, struct page *page)
> > +{
> > +     if (try_page_mte_tagging(page)) {
> > +             mte_restore_page_tags(page_address(page), tags);
> > +             set_page_mte_tagged(page);
> > +     }
>
> Ditto.
Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-18  9:30     ` Alexander Potapenko
@ 2023-07-18 14:01       ` Andy Shevchenko
  2023-07-18 17:03         ` Yury Norov
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-18 14:01 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Yury Norov, catalin.marinas, will, pcc, andreyknvl, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:
> On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <yury.norov@gmail.com> wrote:
> > On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:

...

> > When nbits == 0, copy-like functions shouldn't touch any memory. See how
> > other bitmap and find_bit functions hold it.
> 
> I think this is different from what other bitmap functions do, but it
> should be enough to bail out on !nbits, i.e.:
> 
>     if (!nbits)
>         return 0;
> 
> You probably meant adding a __builtin_constant_p() (which is used all
> over the place in bitmap.h), but:
>  - the compiler won't have problem optimizing away the code for a
> constant nbits=0;
>  - we anyway need a dynamic check for the case nbits is not constant
> (for both bitmap_get_value() and bitmap_set_value(), I assume).
> 
> What do you think?

The idea behind is to eliminate the code completely for the cases nbits != 0.
In your case the dynamic check will be there. That's what we want to avoid.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c
  2023-07-18 10:48     ` Alexander Potapenko
@ 2023-07-18 14:13       ` Andy Shevchenko
  0 siblings, 0 replies; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-18 14:13 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Tue, Jul 18, 2023 at 12:48:00PM +0200, Alexander Potapenko wrote:

...

> > > +void _mte_free_saved_tags(void *storage)
> > > +{
> > > +     unsigned long handle = xa_to_value(storage);
> > > +     int size;
> > > +
> > > +     if (!handle)
> > > +             return;
> >
> > Perhaps
> >
> >         unsigned long handle;
> >
> >         handle = xa_to_value(storage);
> >         if (!handle)
> >                 return;
> 
> I don't have a strong preference and am happy to change this, but, out
> of curiosity, why do you think it is better?
> This pattern (calling (even non-)trivial functions when declaring
> variables) is widely used across the kernel.
> Or is it just for consistency with how `handle` is used in the rest of the file?

Ah, it's pure maintenance and error prone approach in case some code is sneezed
in between.

Imagine that you have


	foo = bar(x);
	...many lines that by some reason don't make one page on the screen...
	if (!foo)
		...do something...

Now if by unsuccessful rebase or by non-experienced developer we got

	foo = bar(x);
	...part 1 of many lines that by some reason don't make one page on the screen...
	baz(foo);
	...part 2 of many lines that by some reason don't make one page on the screen...
	if (!foo)
		...do something...

the compiler will eliminate the check — you got your mine on the nice minefield!

> > > +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-17 13:49   ` Andy Shevchenko
@ 2023-07-18 15:33     ` Alexander Potapenko
  2023-07-18 17:17       ` Andy Shevchenko
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-18 15:33 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> > The config implements the EA0 algorithm suggested by Evgenii Stepanov
> > to compress the memory tags for ARM MTE during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +config ARM64_MTE_COMP
> > +     bool "Tag compression for ARM64 MTE"
>
> At least here, make sure everybody understands what you are talking about. WTF
> MTE is?
Replaced with "Memory Tagging Extension"


> ...
>
> > +/*
>
> Are you deliberately made it NON-kernel-doc? If so, why? And why does it
> have too many similarities with above mentioned format?

No, just by lack of habit.

> > + * ea0_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> > + */
> > +unsigned long ea0_compress(u8 *tags);
> > +
> > +/*
> > + * ea0_decompress() - decompress the tag array addressed by the handle.
> > + * @handle: handle returned by @ea0_decompress()
> > + * @tags: 128-byte array to write the tags to.
> > + *
> > + * Reads the compressed data and writes it into the user-supplied tag array.
> > + * Returns true on success, false on error.
>
> In case you are going to make them real kernel-doc:s, make sure kernel-doc
> validator doesn't warn.

I'll try to.
However it doesn't seem to be very picky.

  $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h

warns about e.g. parameter name mismatch, but does not care about the
missing return section.

> Here, for example, return section is missing. The easy
> fix is to add : after Returns. Same to the rest of function descriptions.
Done

> Also
> why you put the descriptions in to the header file? It's a bit unusual for the
> exported ones.

https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
this, and I thought anyone wanting to understand how an interface
works would prefer reading the header rather than the implementation.
I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
better place for them.

> > +/*
> > + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> > + * @tags: 128-byte array containing 256 MTE tags.
> > + * @out_tags: u8 array to store the tag of every range.
> > + * @out_sizes: u16 array to store the size of every range.
>
> u16? I don't see it.
Fixed.


> > +/*
> > + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> > + * @r_tags: u8[256] containing the tag of every range.
> > + * @r_sizes: u16[256] containing the size of every range.
>
> Ditto.
Fixed.

>
> > + * @r_len: length of @r_tags and @r_sizes.
> > + * @tags: 128-byte array to write the tags to.
> > + */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
>
> In both cases signed integer may be promoted with a sign. Is it a problem here?
Not sure if you mean r_len or r_sizes, but all those values are >= 0
and <= 256, so there should be no problems.
(shorts could have been ints as well, we are just saving some stack
space in ea0_compress()/ea0_decompress()).

> > + *
> > + * We encapsulate tag storage memory management in this module, because it is
> > + * tightly coupled with the pointer representation.
>
> > + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
>
> ea0_compress() is usual way how we refer to the functions. Let tools to make
> the necessary references.

Done.

>
> > + *     that can be stored in Xarray
> > + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
>
> Ditto. And so on.
Done.


> > + * 5. For the out-of-line case, the storage is allocated from one of the
> > + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> > + *    on 8 bytes, so its bits 2..0 can be used to store the size class:
>
> > + *     - 0 for 128 bytes
> > + *     - 1 for 16
> > + *     - 2 for 32
> > + *     - 4 for 64.
>
> Is this chosen deliberately (for performance?)? Otherwise why not put them in
> natural exponential growing?

This way pointers to uncompressed buffers do not have extra data stored in them.
This property does not have any immediate use though.


> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
>
> bitmap guarantees that bits.h will be included.

I am following the IWYU principle here, and I believe it's a good thing to do.
I've seen cases where these transitive dependencies rotted after some
refactoring, but the fact was only noticed in certain configurations.
Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
by optimizing headers
(https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
relies on explicit dependencies which are easier to untangle.

>
> ...
>
> > +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> > +{
> > +     u8 prev_tag = U8_MAX;
>
> > +     int cur_idx = -1;
>
> At which circumstances does this assignment make sense?
This value is never used to index the array, it is incremented on the
first loop iteration.

> > +     for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > +             for (j = 0; j < 2; j++) {
> > +                     cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > +                     if (cur_tag == prev_tag) {
> > +                             out_sizes[cur_idx]++;
>
> Who guarantees this one is not [-1]?
The fact that prev_tag is initialized to U8_MAX, whereas cur_tag <= 15.


> > +                     } else {
>
> > +                             cur_idx++;
>
> Aha, above seems a bit prone to out of boundaries errors.
Not really, but since you stumbled upon it, I'd better make it more readable.

> Can you make it
> unsigned and start from 0?

Changed to start with 0, but I am a bit hesitant about making it
unsigned: it is now no more special than a loop variable.

> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > +     int i, j, pos = 0;
>
> Wouldn't be more correct to have this assignment inside the first for-loop?

Do you mean setting it back to 0 on every iteration of the outer loop?
We sure don't want that, since pos is the location in the tags[] array
where the next tag is written.
If what you meant is doing "for (i = 0, pos = 0; ...)" this is a
question of preference, but I can do that.


>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Don't forget to undef it when not needed.

Ok, will do.
Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
The intuitive answer is "no", but then there should be some difference
between those and RANGES_INLINE?

>
> ...
>
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > +                      unsigned long *pos, unsigned long bits)
>
> Please, don't use reserved namespace. Yours is ea0, use it:
> ea0_bitmap_write()! Same to other similarly named functions.

Done.
However there are two parallel namespaces now: "ea0" for the
compression algorithm, and "memcomp" for the module initialization and
data structures.
Dunno if it makes sense to merge them (and rename the .c file accordingly).

> ...
>
> > +     unsigned long bit_pos = 0, l_bits;
> > +     int largest_idx = -1, i;
> > +     short largest = 0;
>
> Here and elsewhere, please, double check the correctness and/or necessity of
> signdness and assignments of local variables.

I replaced a bunch of ints with size_t where it made sense (basically
all array indices remain ints, but all sizes are size_t now).
Also removed the assignment to largest_idx above.


> Here
>
>                 if (largest >= sizes[i])
>                         continue;
> makes sense, but...
>
> > +                     largest = sizes[i];
> > +                     largest_idx = i;
> > +             }
> > +     }
>
> ...
>
> > +     for (i = 0; i < len; i++) {
> > +             if (i == largest_idx)
> > +                     continue;
> > +             bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
>
> ...here I would do the opposite since it's one liner.

Agreed.

>
> > +     }
>
> ...
>
> > +     u8 r_tags[256];
> > +     int r_len = ARRAY_SIZE(r_tags);
>
No, it is the length of the arrays (both r_tags and r_sizes).
Below you make a good point it will spare us a kernel.h dependency,
but for the sake of keeping the code error-prone we probably shouldn't
assume r_tags is a byte array.


> > +     l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> > +                                              BITS_PER_LARGEST_IDX;
>
> Is it a dup? Perhaps a helper for this?
Do you mean it is the same in ea0_compress_to_buf() and
ea0_decompress_from_buf()?
Added a helper.

> Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
> EA0_BITS_...

Done.

> ...
>
> > +bool ea0_decompress(unsigned long handle, u8 *tags)
> > +{
> > +     unsigned long *storage = ea0_storage(handle);
> > +     int size = ea0_storage_size(handle);
> > +
> > +     if (size == 128) {
> > +             memcpy(tags, storage, size);
> > +             return true;
> > +     }
> > +     if (size == 8)
> > +             return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
>
> Maybe
>
>         switch (ea0_storage_size(handle)) {
>                 ...
>         default:
>         }
>
> ?

Sounds reasonable, done.

> > +void ea0_release_handle(unsigned long handle)
> > +{
> > +     void *storage = ea0_storage(handle);
> > +     int size = ea0_storage_size(handle);
> > +     struct kmem_cache *c;
>
> > +     if (!storage)
> > +             return;
>
> I find slightly better for maintaining in the form as
>
>         struct kmem_cache *c;
>         void *storage;
>         int size;
>
>         storage = ea0_storage(handle);
>         if (!storage)
>                 return;
>
>         size = ea0_storage_size(handle);

Done.

> > +static int mtecomp_init(void)
> > +{
> > +     char name[16];
> > +     int size;
> > +     int i;
> > +
> > +     BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
>
> Why not static_assert()?
>
> > +     for (i = 0; i < NUM_CACHES; i++) {
> > +             size = ea0_cache_id_to_size(i);
> > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> sizeof() will work the same way without need of having kernel.h be included.

Done.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-18 14:01       ` Andy Shevchenko
@ 2023-07-18 17:03         ` Yury Norov
  2023-07-18 17:20           ` Andy Shevchenko
  2023-07-19  9:00           ` Alexander Potapenko
  0 siblings, 2 replies; 37+ messages in thread
From: Yury Norov @ 2023-07-18 17:03 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Tue, Jul 18, 2023 at 05:01:28PM +0300, Andy Shevchenko wrote:
> On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:
> > On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <yury.norov@gmail.com> wrote:
> > > On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> 
> ...
> 
> > > When nbits == 0, copy-like functions shouldn't touch any memory. See how
> > > other bitmap and find_bit functions hold it.
> > 
> > I think this is different from what other bitmap functions do, but it
> > should be enough to bail out on !nbits, i.e.:
> > 
> >     if (!nbits)
> >         return 0;
> > 
> > You probably meant adding a __builtin_constant_p() (which is used all
> > over the place in bitmap.h), but:

No, I didn't mean that.

> >  - the compiler won't have problem optimizing away the code for a
> > constant nbits=0;

Look at your code, having nbits == 0 in mind:

       const size_t index = BIT_WORD(start);
       const unsigned long offset = start % BITS_PER_LONG;
       const unsigned long space = BITS_PER_LONG - offset;
       unsigned long value_low, value_high;

       if (space >= nbits) // This is always the case
               return (map[index] >> offset) & GENMASK(nbits - 1, 0);
       ...             ^^                      ^^
                       Unconditional fetch     Wshift-count-overflow

Thanks to GENMASK() implementation, you'll be warned by GENMASK_INPUT_CHECK()
if nbits is a compile-time variable. In case of runtime, it's a pure undef,
not mentioning useless, expensive and dangerous fetch.

> >  - we anyway need a dynamic check for the case nbits is not constant
> > (for both bitmap_get_value() and bitmap_set_value(), I assume).
> > 
> > What do you think?

I think that instead of speculations, it's better to cover nbits == 0
with the explicit tests for run- and compile-time. That way you're
always on a safe side.

bitmap_get_val(NULL, 0, 0) shouldn't crash the kernel.

> The idea behind is to eliminate the code completely for the cases nbits != 0.
> In your case the dynamic check will be there. That's what we want to avoid.

Alexander is right - we can't avoid testing against 0 if we need to
test for 0... In case of other functions we have inline and outline
implementations, controlled by small_const_nbits().

As you can see, the small_const_nbits() tests against 0 explicitly,
although it's free at compile time. But if nbits == 0, we pick
outline version of a function regardless.

On their turn, outline versions again do their test against nbits == 0,
but most of the time implicitly.

In case of bitmap_set_val, we are touching at max 2 words, and there's
no reason for outline version, so we have to test nbits against 0
inside inline code. 

Having all that in mind, and because nbits == 0 is most likely an
error we'd follow the following rules:
 - no memory must be touched as we're potentially in error condition,
   and pointer may be corrupted;
 - the cost of the check must be as minimal as possible.

So I suggest:

        if (unlikely(nbits == 0))
                return;

For readers that would literally mean: we don't expect that, and we find
it suspicious, but we'll handle that as correct as we can.

By the way, Alexander, please drop that 'const' things. Those are for
pointers or some global variables, not for inline functions with 4
lines of code. (If you think it helps the code to be safe than no - it's
unsafe even with consts.)

Thanks,
Yury

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-18 15:33     ` Alexander Potapenko
@ 2023-07-18 17:17       ` Andy Shevchenko
  2023-07-19 12:16         ` Alexander Potapenko
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-18 17:17 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Tue, Jul 18, 2023 at 05:33:37PM +0200, Alexander Potapenko wrote:
> On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:

...

> However it doesn't seem to be very picky.
> 
>   $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h
> 
> warns about e.g. parameter name mismatch, but does not care about the
> missing return section.

-Wreturn is missing

...

> > Also
> > why you put the descriptions in to the header file? It's a bit unusual for the
> > exported ones.
> 
> https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> this, and I thought anyone wanting to understand how an interface
> works would prefer reading the header rather than the implementation.
> I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> better place for them.

With the kernel doc in the C file you may also comment the internal ones and
generate documentation only for exported ones. This is not possible with h.

...

> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> > In both cases signed integer may be promoted with a sign. Is it a problem here?
> Not sure if you mean r_len or r_sizes,

Mostly about the latter.

> but all those values are >= 0
> and <= 256, so there should be no problems.
> (shorts could have been ints as well, we are just saving some stack
> space in ea0_compress()/ea0_decompress()).

Then why they are signed? Please, justify that.
Signdness prone to subtle and hard to hunt errors due to integer promotions.

...

> > > +#include <linux/bits.h>
> > > +#include <linux/bitmap.h>
> >
> > bitmap guarantees that bits.h will be included.
> 
> I am following the IWYU principle here, and I believe it's a good thing to do.
> I've seen cases where these transitive dependencies rotted after some
> refactoring, but the fact was only noticed in certain configurations.
> Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> by optimizing headers
> (https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
> relies on explicit dependencies which are easier to untangle.

Yes, but we have some guarantees. E.g., we don't include compiler*.h
when types.h is included, because of the guarantees.

Otherwise your code misses _a lot_ of headers.

...

> > Can you make it unsigned and start from 0?
> 
> Changed to start with 0, but I am a bit hesitant about making it
> unsigned: it is now no more special than a loop variable.

Signdness is a beast in C, needs always an additional justification.

...

> > > +     int i, j, pos = 0;
> >
> > Wouldn't be more correct to have this assignment inside the first for-loop?
> 
> Do you mean setting it back to 0 on every iteration of the outer loop?

Yes.

> We sure don't want that, since pos is the location in the tags[] array
> where the next tag is written.

OK!

...

> > > +#define RANGES_INLINE ea0_size_to_ranges(8)
> >
> > Don't forget to undef it when not needed.
> 
> Ok, will do.

> Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> The intuitive answer is "no",

Correct.

> but then there should be some difference between those and RANGES_INLINE?

Yes, one is widely used constant and one is a _localized_ helper.

...

> > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > +                      unsigned long *pos, unsigned long bits)
> >
> > Please, don't use reserved namespace. Yours is ea0, use it:
> > ea0_bitmap_write()! Same to other similarly named functions.
> 
> Done.
> However there are two parallel namespaces now: "ea0" for the
> compression algorithm, and "memcomp" for the module initialization and
> data structures.
> Dunno if it makes sense to merge them (and rename the .c file accordingly).

Your choice. Mu point, just do prefix it with something meaningful.

...

> > > +     u8 r_tags[256];
> > > +     int r_len = ARRAY_SIZE(r_tags);
> >
> No, it is the length of the arrays (both r_tags and r_sizes).
> Below you make a good point it will spare us a kernel.h dependency,
> but for the sake of keeping the code error-prone we probably shouldn't
> assume r_tags is a byte array.

It's a common practice even outside of Linux kernel to use sizeof() against
char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

...

> > > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

You see, if you grep for similar calls I'm pretty sure the order of 2 of power
of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
occurs at least once.


-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-18 17:03         ` Yury Norov
@ 2023-07-18 17:20           ` Andy Shevchenko
  2023-07-19  9:00           ` Alexander Potapenko
  1 sibling, 0 replies; 37+ messages in thread
From: Andy Shevchenko @ 2023-07-18 17:20 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	linux, linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

On Tue, Jul 18, 2023 at 10:03:25AM -0700, Yury Norov wrote:
> On Tue, Jul 18, 2023 at 05:01:28PM +0300, Andy Shevchenko wrote:
> > On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:

...

> > The idea behind is to eliminate the code completely for the cases nbits != 0.
> > In your case the dynamic check will be there. That's what we want to avoid.
> 
> Alexander is right - we can't avoid testing against 0 if we need to
> test for 0... In case of other functions we have inline and outline
> implementations, controlled by small_const_nbits().
> 
> As you can see, the small_const_nbits() tests against 0 explicitly,
> although it's free at compile time. But if nbits == 0, we pick
> outline version of a function regardless.
> 
> On their turn, outline versions again do their test against nbits == 0,
> but most of the time implicitly.
> 
> In case of bitmap_set_val, we are touching at max 2 words, and there's
> no reason for outline version, so we have to test nbits against 0
> inside inline code. 
> 
> Having all that in mind, and because nbits == 0 is most likely an
> error we'd follow the following rules:
>  - no memory must be touched as we're potentially in error condition,
>    and pointer may be corrupted;
>  - the cost of the check must be as minimal as possible.
> 
> So I suggest:
> 
>         if (unlikely(nbits == 0))
>                 return;
> 
> For readers that would literally mean: we don't expect that, and we find
> it suspicious, but we'll handle that as correct as we can.

Okay, thank you for elaborated answer.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-17 11:37 ` [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
  2023-07-17 13:49   ` Andy Shevchenko
@ 2023-07-19  6:09   ` Yury Norov
  2023-07-19 14:00     ` Alexander Potapenko
  2023-07-19 20:32     ` Evgenii Stepanov
  1 sibling, 2 replies; 37+ messages in thread
From: Yury Norov @ 2023-07-19  6:09 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.
> 
> Suggested-by: Evgenii Stepanov <eugenis@google.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> 
> ---
>  v3:
>   - Addressed comments by Andy Shevchenko:
>    - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
>    - switched to unsigned long everywhere (fewer casts)
>    - simplified the code, removed redundant checks
>    - dropped ea0_compress_inline()
>  - added bit size constants and helpers to access the bitmap
>  - explicitly initialize all compressed sizes in ea0_compress_to_buf()
>  - initialize all handle bits
> 
>  v2:
>   - as suggested by Yuri Norov, switched from struct bitq (which is
>     not needed anymore) to <linux/bitmap.h>
>   - add missing symbol exports
> ---
>  arch/arm64/Kconfig               |  10 +
>  arch/arm64/include/asm/mtecomp.h |  60 +++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/mtecomp.c          | 406 +++++++++++++++++++++++++++++++
>  4 files changed, 477 insertions(+)
>  create mode 100644 arch/arm64/include/asm/mtecomp.h
>  create mode 100644 arch/arm64/mm/mtecomp.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2511b30d0f67..52cdc7603cf7c 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2093,6 +2093,16 @@ config ARM64_EPAN
>  	  if the cpu does not implement the feature.
>  endmenu # "ARMv8.7 architectural features"
>  
> +config ARM64_MTE_COMP
> +	bool "Tag compression for ARM64 MTE"
> +	default y
> +	depends on ARM64_MTE
> +	help
> +	  Enable tag compression support for ARM64 MTE.
> +
> +	  128-byte tag buffers corresponding to 4K pages can be compressed using
> +	  the EA0 algorithm to save heap memory.
> +
>  config ARM64_SVE
>  	bool "ARM Scalable Vector Extension support"
>  	default y
> diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
> new file mode 100644
> index 0000000000000..0c444c0d4ac04
> --- /dev/null
> +++ b/arch/arm64/include/asm/mtecomp.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_MTECOMP_H
> +#define __ASM_MTECOMP_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.
> + */
> +bool ea0_decompress(unsigned long handle, u8 *tags);
> +
> +/*
> + * ea0_release_handle() - release the handle returned by ea0_compress().
> + * @handle: handle returned by ea0_compress().
> + */
> +void ea0_release_handle(unsigned long handle);
> +
> +/* Functions below are exported for testing purposes. */

Then declare them in a separate local header or simply in the test, but
please not in a public header.

> +
> +/*
> + * ea0_storage_size() - calculate the memory occupied by compressed tags.
> + * @handle: storage handle returned by ea0_compress.
> + */
> +int ea0_storage_size(unsigned long handle);
> +
> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.
> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
> +
> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.
> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> +
> +#endif // __ASM_MTECOMP_H
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d0..46778f6dd83c2 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
>  obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
>  obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
>  obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
> +obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
>  KASAN_SANITIZE_physaddr.o	+= n
>  
>  obj-$(CONFIG_KASAN)		+= kasan_init.o
> diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
> new file mode 100644
> index 0000000000000..50a379c035aee
> --- /dev/null
> +++ b/arch/arm64/mm/mtecomp.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * MTE tag compression algorithm.
> + * Proposed by Evgenii Stepanov <eugenis@google.com>
> + */
> +
> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.

This is the 4th time I see mr. Stepanov's credentials in the patch.
I've no doubts he's a worthy gentleman but please avoid mentioning
people in source code. Suggested-by is enough. IIRC, the rule for
that exists for about decade.

For the purpose of namespacing, the mte_compress/mte_decompress would
sound better.

> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.
> + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
> + *     that can be stored in Xarray
> + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
> + *     the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + *    @r_tags containing tag values and @r_sizes containing range lengths) by
> + *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + *            N <= 6:  8 bytes (inline case, no allocation required);
> + *       6 < N <= 11: 16 bytes
> + *      11 < N <= 23: 32 bytes
> + *      23 < N <= 46: 64 bytes
> + *      46 < N:       128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + *    The element itself is thrown away from @r_sizes, because it can be
> + *    reconstructed from the sum of the remaining elements. Note that now none
> + *    of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + *       largest_idx : i4
> + *      r_tags[0..5] : i4 x 6
> + *     r_sizes[0..4] : i7 x 5
> + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + *    distinguished from a kernel pointer.

I honestly tried to understand... For example, what the
        'r_sizes[0..4] : i7 x 5'
means? An array of 5 elements, 17 bits each? But it's alone greater
than size of pointer... Oh gosh...

Moreover, MTE tags are all 4-bits.

It seems like text without pictures is above my mental abilities. Can you
please illustrate it? For example, from the #4, I (hopefully correctly)
realized that:

Inline frame format:
   0                                                    60       62  63
 +---------------------------------------------------------------------+
 |              No idea what happens here             |    Lidx    | X |
 +---------------------------------------------------------------------+
 63    : X      :  RAZ : Reserved for Xarray.
 60-62 : Lidx   : 0..5 : Largest element index. 
                     6 : Reserved
                     7 : Invalid handler

> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + *    on 8 bytes, so its bits 2..0 can be used to store the size class:
> + *     - 0 for 128 bytes
> + *     - 1 for 16
> + *     - 2 for 32
> + *     - 4 for 64.
> + *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + *         largest_idx : i6
> + *        r_tags[0..N] : i4 x N
> + *     r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + *    out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + *    the input buffer length. To obtain the pointer to the input buffer, unset
> + *    bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + *    buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + *    Calculate the removed largest element of @r_sizes:
> + *      largest = 256 - sum(r_sizes)
> + *    and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + *    @r_sizes[i] times.
> + */
> +

Because of the size, I believe this comment is worth to put in Docs,
moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
Why don't you add an 'MTE Compression' section in there?

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

Nit: Alphabetic order?

Andy is right, bitmap.h includes bit.h, no need to include both. And if
your code will get broken one day, it's a bitmap maintainers' work to fix.

> +
> +#include <asm/mtecomp.h>
> +
> +/* The handle must fit into an Xarray value. */
> +#define HANDLE_MASK GENMASK_ULL(62, 0)
> +
> +/* Out-of-line handles have 0b111 in bits 62..60. */
> +#define NOINLINE_MASK GENMASK_ULL(62, 60)
> +
> +/* Cache index is stored in the lowest pointer bits. */
> +#define CACHE_ID_MASK GENMASK_ULL(2, 0)
> +
> +/*
> + * Four separate caches to store out-of-line data:
> + *  0: mte-tags-128
> + *  1: mte-tags-16
> + *  2: mte-tags-32
> + *  3: mte-tags-64
> + */
> +#define NUM_CACHES 4
> +static struct kmem_cache *mtecomp_caches[NUM_CACHES];
> +
> +/*
> + * Sizes of compressed values.
> + */
> +#define BITS_PER_TAG 4
> +#define BITS_PER_SIZE 7
> +#define BITS_PER_LARGEST_IDX_INLINE 4
> +#define BITS_PER_LARGEST_IDX 6

But in the comment you say that largest index in inline frame is 3-bits long.

> +
> +/* Translate allocation size into mtecomp_caches[] index. */
> +static int ea0_size_to_cache_id(int len)

Here and everywhere, do you need signed values? If not, unsigned int.

> +{
> +	if (len < 128)
> +		return fls(len >> 4);
> +	return 0;
> +}
> +
> +/* Translate mtecomp_caches[] index into allocation size. */
> +static int ea0_cache_id_to_size(int id)
> +{
> +	if (id == 0)
> +		return 128;
> +	return 8 << id;
> +}
> +
> +/* Transform tags into tag ranges. */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> +	u8 prev_tag = U8_MAX;
> +	int cur_idx = -1;
> +	u8 cur_tag;
> +	int i, j;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;
> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}
> +EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
> +
> +/* Transform tag ranges back into tags. */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> +	int i, j, pos = 0;
> +	u8 prev;
> +
> +	for (i = 0; i < r_len; i++) {
> +		for (j = 0; j < r_sizes[i]; j++) {
> +			if (pos % 2)
> +				tags[pos / 2] = (prev << 4) | r_tags[i];
> +			else
> +				prev = r_tags[i];
> +			pos++;
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);

Because I didn't understand the compressed frame format, not sure I
can understand this logic...

> +
> +/* Translate @num_ranges into the allocation size needed to hold them. */
> +static int ea0_alloc_size(int num_ranges)
> +{
> +	if (num_ranges <= 6)
> +		return 8;
> +	if (num_ranges <= 11)
> +		return 16;
> +	if (num_ranges <= 23)
> +		return 32;
> +	if (num_ranges <= 46)
> +		return 64;
> +	return 128;
> +}
> +
> +/* Translate allocation size into maximum number of ranges that it can hold. */
> +static int ea0_size_to_ranges(int size)
> +{
> +	switch (size) {
> +	case 8:
> +		return 6;
> +	case 16:
> +		return 11;
> +	case 32:
> +		return 23;
> +	case 64:
> +		return 46;
> +	default:
> +		return 0;
> +	}
> +}

I wonder if there's a math formula here? Can you explain where from
those numbers come?

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Maybe

 #define RANGES_INLINE (6)

> +
> +/* Is the data stored inline in the handle itself? */
> +static bool ea0_is_inline(unsigned long handle)
> +{
> +	return (handle & NOINLINE_MASK) != NOINLINE_MASK;
> +}
> +
> +/* Get the size of the buffer backing @handle. */
> +int ea0_storage_size(unsigned long handle)
> +{
> +	if (ea0_is_inline(handle))
> +		return 8;
> +	return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
> +}
> +EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
> +
> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> +			 unsigned long *pos, unsigned long bits)

Please don't steal prefixes. But the idea is good. For the next
iteration, let's rename bitmap_set_value() to bitmap_write()?
So that your function will be an mte_bitmap_write().

Thanks,
Yury

> +{
> +	bitmap_set_value(bitmap, value, *pos, bits);
> +	*pos += bits;
> +}
> +
> +/* Compress ranges into the buffer that can accommodate up to max_ranges. */
> +static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
> +				unsigned long *bitmap, int max_ranges)
> +{
> +	unsigned long bit_pos = 0, l_bits;
> +	int largest_idx = -1, i;
> +	short largest = 0;
> +
> +	for (i = 0; i < len; i++) {
> +		if (sizes[i] > largest) {
> +			largest = sizes[i];
> +			largest_idx = i;
> +		}
> +	}
> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;
> +	bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
> +	for (i = 0; i < len; i++)
> +		bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
> +	for (i = len; i < max_ranges; i++)
> +		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
> +	for (i = 0; i < len; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
> +	}
> +	for (i = len; i < max_ranges; i++)
> +		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
> +}
> +
> +/* Compress @tags and return a handle. */
> +unsigned long ea0_compress(u8 *tags)
> +{
> +	int alloc_size, cache_id;
> +	struct kmem_cache *cache;
> +	short r_sizes[256];
> +	u8 r_tags[256];
> +	int r_len = ARRAY_SIZE(r_tags);
> +	unsigned long *storage;
> +	/*
> +	 * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
> +	 * will read. But when the tags are stored in the handle itself, it must
> +	 * have all its bits initialized.
> +	 */
> +	unsigned long result = 0;
> +
> +	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
> +	alloc_size = ea0_alloc_size(r_len);
> +	if (alloc_size == 8) {
> +		ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
> +				    RANGES_INLINE);
> +		return result;
> +	}
> +	cache_id = ea0_size_to_cache_id(alloc_size);
> +	cache = mtecomp_caches[cache_id];
> +	storage = kmem_cache_alloc(cache, GFP_KERNEL);
> +	if (alloc_size < 128) {
> +		/* alloc_size is always a multiple of sizeof(unsigned long). */
> +		ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
> +				    ea0_size_to_ranges(alloc_size));
> +		return ((unsigned long)storage | cache_id) & HANDLE_MASK;
> +	}
> +	memcpy(storage, tags, alloc_size);
> +	return (unsigned long)storage & HANDLE_MASK;
> +}
> +EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
> +
> +static unsigned long bitmap_read(const unsigned long *bitmap,
> +				 unsigned long *pos, unsigned long bits)
> +{
> +	unsigned long result;
> +
> +	result = bitmap_get_value(bitmap, *pos, bits);
> +	*pos += bits;
> +	return result;
> +}
> +
> +/* Decompress the contents of the given buffer into @tags. */
> +static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
> +				    u8 *tags)
> +{
> +	int largest_idx, i;
> +	short r_sizes[46], sum = 0;
> +	u8 r_tags[46];
> +	unsigned long bit_pos = 0, l_bits;
> +
> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;
> +	largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
> +	for (i = 0; i < max_ranges; i++)
> +		r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
> +	for (i = 0; i < max_ranges; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
> +		if (!r_sizes[i]) {
> +			max_ranges = i;
> +			break;
> +		}
> +		sum += r_sizes[i];
> +	}
> +	if (sum >= 256)
> +		return false;
> +	r_sizes[largest_idx] = 256 - sum;
> +	ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
> +	return true;
> +}
> +
> +/* Get pointer to the out-of-line storage from a handle. */
> +static void *ea0_storage(unsigned long handle)
> +{
> +	if (ea0_is_inline(handle))
> +		return NULL;
> +	return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
> +}
> +
> +/* Decompress tags from the buffer referenced by @handle. */
> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> +	unsigned long *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +
> +	if (size == 128) {
> +		memcpy(tags, storage, size);
> +		return true;
> +	}
> +	if (size == 8)
> +		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
> +	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}
> +EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
> +
> +/* Release the memory referenced by @handle. */
> +void ea0_release_handle(unsigned long handle)
> +{
> +	void *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +	struct kmem_cache *c;
> +
> +	if (!storage)
> +		return;
> +
> +	c = mtecomp_caches[ea0_size_to_cache_id(size)];
> +	kmem_cache_free(c, storage);
> +}
> +EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
> +
> +/* Set up mtecomp_caches[]. */
> +static int mtecomp_init(void)
> +{
> +	char name[16];
> +	int size;
> +	int i;
> +
> +	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
> +	for (i = 0; i < NUM_CACHES; i++) {
> +		size = ea0_cache_id_to_size(i);
> +		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
> +		mtecomp_caches[i] =
> +			kmem_cache_create(name, size, size, 0, NULL);
> +	}
> +	return 0;
> +}
> +module_init(mtecomp_init);
> -- 
> 2.41.0.255.g8b1d071c50-goog

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()
  2023-07-18 17:03         ` Yury Norov
  2023-07-18 17:20           ` Andy Shevchenko
@ 2023-07-19  9:00           ` Alexander Potapenko
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-19  9:00 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andy Shevchenko, catalin.marinas, will, pcc, andreyknvl, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

>
> Thanks to GENMASK() implementation, you'll be warned by GENMASK_INPUT_CHECK()
> if nbits is a compile-time variable. In case of runtime, it's a pure undef,
> not mentioning useless, expensive and dangerous fetch.
>
> > >  - we anyway need a dynamic check for the case nbits is not constant
> > > (for both bitmap_get_value() and bitmap_set_value(), I assume).
> > >
> > > What do you think?
>
> I think that instead of speculations, it's better to cover nbits == 0
> with the explicit tests for run- and compile-time. That way you're
> always on a safe side.

You are right. I added tests for these cases.

> bitmap_get_val(NULL, 0, 0) shouldn't crash the kernel.

Haha, the compiler is smart enough to not crash the kernel in this case.
But passing zero via a volatile variable did the trick.

>
> > The idea behind is to eliminate the code completely for the cases nbits != 0.
> > In your case the dynamic check will be there. That's what we want to avoid.
>
> Alexander is right - we can't avoid testing against 0 if we need to
> test for 0... In case of other functions we have inline and outline
> implementations, controlled by small_const_nbits().
>
> As you can see, the small_const_nbits() tests against 0 explicitly,
> although it's free at compile time. But if nbits == 0, we pick
> outline version of a function regardless.
>
> On their turn, outline versions again do their test against nbits == 0,
> but most of the time implicitly.
>
> In case of bitmap_set_val, we are touching at max 2 words, and there's
> no reason for outline version, so we have to test nbits against 0
> inside inline code.
>
> Having all that in mind, and because nbits == 0 is most likely an
> error we'd follow the following rules:
>  - no memory must be touched as we're potentially in error condition,
>    and pointer may be corrupted;
>  - the cost of the check must be as minimal as possible.
>
> So I suggest:
>
>         if (unlikely(nbits == 0))
>                 return;

Sounds good, I'll add unlikely() around the check.
Thanks for the explanation!

>
> For readers that would literally mean: we don't expect that, and we find
> it suspicious, but we'll handle that as correct as we can.
>
> By the way, Alexander, please drop that 'const' things. Those are for
> pointers or some global variables, not for inline functions with 4
> lines of code. (If you think it helps the code to be safe than no - it's
> unsafe even with consts.)

These consts are from the original Syed's patch and were probably
added for consistency with bitmap_{set,get}_value8().
But, okay, I'll remove them.




--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-18 17:17       ` Andy Shevchenko
@ 2023-07-19 12:16         ` Alexander Potapenko
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-19 12:16 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, linux, yury.norov,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Tue, Jul 18, 2023 at 7:18 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:

> > However it doesn't seem to be very picky.
> >
> >   $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h
> >
> > warns about e.g. parameter name mismatch, but does not care about the
> > missing return section.
>
> -Wreturn is missing

Nice, adding it (or -Wall) uncovered more problems.

> ...
>
> > > Also
> > > why you put the descriptions in to the header file? It's a bit unusual for the
> > > exported ones.
> >
> > https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> > this, and I thought anyone wanting to understand how an interface
> > works would prefer reading the header rather than the implementation.
> > I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> > better place for them.
>
> With the kernel doc in the C file you may also comment the internal ones and
> generate documentation only for exported ones. This is not possible with h.

I see.
After some hesitation and looking at the existing practices in the
kernel, I moved the kernel doc comments to mtecomp.c



> > but all those values are >= 0
> > and <= 256, so there should be no problems.
> > (shorts could have been ints as well, we are just saving some stack
> > space in ea0_compress()/ea0_decompress()).
>
> Then why they are signed? Please, justify that.
> Signdness prone to subtle and hard to hunt errors due to integer promotions.

Switched to unsigned shorts.

> ...
>
> > > > +#include <linux/bits.h>
> > > > +#include <linux/bitmap.h>
> > >
> > > bitmap guarantees that bits.h will be included.
> >
> > I am following the IWYU principle here, and I believe it's a good thing to do.
> > I've seen cases where these transitive dependencies rotted after some
> > refactoring, but the fact was only noticed in certain configurations.
> > Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> > by optimizing headers
> > (https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
> > relies on explicit dependencies which are easier to untangle.
>
> Yes, but we have some guarantees. E.g., we don't include compiler*.h
> when types.h is included, because of the guarantees.
Ok, I removed bits.h

> Otherwise your code misses _a lot_ of headers.

... and tried to expand the list of headers where applicable.

>
> ...
>
> > > Can you make it unsigned and start from 0?
> >
> > Changed to start with 0, but I am a bit hesitant about making it
> > unsigned: it is now no more special than a loop variable.
>
> Signdness is a beast in C, needs always an additional justification.

Changed all ints to unsigned, because, well, why not :)


> > Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> > The intuitive answer is "no",
>
> Correct.
>
> > but then there should be some difference between those and RANGES_INLINE?
>
> Yes, one is widely used constant and one is a _localized_ helper.

Ack

>
> ...
>
> > > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > > +                      unsigned long *pos, unsigned long bits)
> > >
> > > Please, don't use reserved namespace. Yours is ea0, use it:
> > > ea0_bitmap_write()! Same to other similarly named functions.
> >
> > Done.
> > However there are two parallel namespaces now: "ea0" for the
> > compression algorithm, and "memcomp" for the module initialization and
> > data structures.
> > Dunno if it makes sense to merge them (and rename the .c file accordingly).
>
> Your choice. Mu point, just do prefix it with something meaningful.
Fine, I'll go with "ea0" for the algorithm and "memcomp" for the
module-specific stuff.
Let's also see what ARM maintainers say.

> ...
>
> > > > +     u8 r_tags[256];
> > > > +     int r_len = ARRAY_SIZE(r_tags);
> > >
> > No, it is the length of the arrays (both r_tags and r_sizes).
> > Below you make a good point it will spare us a kernel.h dependency,
> > but for the sake of keeping the code error-prone we probably shouldn't
> > assume r_tags is a byte array.
>
> It's a common practice even outside of Linux kernel to use sizeof() against
> char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

Fixed.

> ...
>
> > > > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> You see, if you grep for similar calls I'm pretty sure the order of 2 of power
> of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
> occurs at least once.

You are right, it is 3700+ vs. 66 :)
Changed ARRAY_SIZE() to sizeof() here as well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-19  6:09   ` Yury Norov
@ 2023-07-19 14:00     ` Alexander Potapenko
  2023-07-19 21:06       ` Yury Norov
  2023-07-19 20:32     ` Evgenii Stepanov
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-19 14:00 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Wed, Jul 19, 2023 at 8:09 AM Yury Norov <yury.norov@gmail.com> wrote:
...
> > +
> > +/* Functions below are exported for testing purposes. */
>
> Then declare them in a separate local header or simply in the test, but
> please not in a public header.

Fair enough, moved them to the test.

> > +
> > +/*
> > + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> > + * compression algorithms.
>
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.
>
> For the purpose of namespacing, the mte_compress/mte_decompress would
> sound better.

This indeed makes sense; "EA0" is not widely recognizable, and we are
unlikely to have more than one MTE compression algorithm anyway.
I changed "ea0" to "mte" in the patch series.



> > + *
> > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > + *       largest_idx : i4
> > + *      r_tags[0..5] : i4 x 6
> > + *     r_sizes[0..4] : i7 x 5
> > + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > + *
> > + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > + *    distinguished from a kernel pointer.
>
> I honestly tried to understand... For example, what the
>         'r_sizes[0..4] : i7 x 5'
> means? An array of 5 elements, 17 bits each? But it's alone greater
> than size of pointer... Oh gosh...

iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
integer type.
There's no such thing in C syntax, and describing the data layout
using bitfields would be painful.
Would it help if I add a comment explaining that iN is an N-bit
integer? Or do you prefer something like

  r_sizes[0..4] : 5 x 7 bits

?

>
> Moreover, MTE tags are all 4-bits.
>
> It seems like text without pictures is above my mental abilities. Can you
> please illustrate it? For example, from the #4, I (hopefully correctly)
> realized that:
>
> Inline frame format:
>    0                                                    60       62  63
>  +---------------------------------------------------------------------+
I think it's more natural to number bits from 63 to 0

>  |              No idea what happens here             |    Lidx    | X |
>  +---------------------------------------------------------------------+
>  63    : X      :  RAZ : Reserved for Xarray.
>  60-62 : Lidx   : 0..5 : Largest element index.

There's some mismatch now between this picture, where Lidx is i3, and
the implementation that treats it as an i4, merging bit 63 into it.
Maybe we can avoid this by not splitting off bit 63?

>                      6 : Reserved
>                      7 : Invalid handler

No, 7 means "treat this handle as an out-of-line one". It is still
valid, but instead of tag data it contains a pointer.

>
> Because of the size, I believe this comment is worth to put in Docs,
> moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
> Why don't you add an 'MTE Compression' section in there?

Good idea, will do!

>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/swab.h>
> > +#include <linux/string.h>
> > +#include <linux/types.h>
>
> Nit: Alphabetic order?
Ack.

>
> Andy is right, bitmap.h includes bit.h, no need to include both. And if
> your code will get broken one day, it's a bitmap maintainers' work to fix.
Ack.


> > +
> > +/*
> > + * Sizes of compressed values.
> > + */
> > +#define BITS_PER_TAG 4
> > +#define BITS_PER_SIZE 7
> > +#define BITS_PER_LARGEST_IDX_INLINE 4
> > +#define BITS_PER_LARGEST_IDX 6
>
> But in the comment you say that largest index in inline frame is 3-bits long.

The comment says "i4" (see my comments above)

> > +
> > +/* Translate allocation size into mtecomp_caches[] index. */
> > +static int ea0_size_to_cache_id(int len)
>
> Here and everywhere, do you need signed values? If not, unsigned int.

Done as part of fixing Andy's comments

> > +
> > +/* Transform tag ranges back into tags. */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > +     int i, j, pos = 0;
> > +     u8 prev;
> > +
> > +     for (i = 0; i < r_len; i++) {
> > +             for (j = 0; j < r_sizes[i]; j++) {
> > +                     if (pos % 2)
> > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > +                     else
> > +                             prev = r_tags[i];
> > +                     pos++;
> > +             }
> > +     }
> > +}
> > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
>
> Because I didn't understand the compressed frame format, not sure I
> can understand this logic...
Hope the above comments will help, if not - please let me know.

> > +
> > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > +static int ea0_alloc_size(int num_ranges)
> > +{
> > +     if (num_ranges <= 6)
> > +             return 8;
> > +     if (num_ranges <= 11)
> > +             return 16;
> > +     if (num_ranges <= 23)
> > +             return 32;
> > +     if (num_ranges <= 46)
> > +             return 64;
> > +     return 128;
> > +}
> > +
> > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > +static int ea0_size_to_ranges(int size)
> > +{
> > +     switch (size) {
> > +     case 8:
> > +             return 6;
> > +     case 16:
> > +             return 11;
> > +     case 32:
> > +             return 23;
> > +     case 64:
> > +             return 46;
> > +     default:
> > +             return 0;
> > +     }
> > +}
>
> I wonder if there's a math formula here? Can you explain where from
> those numbers come?

I'll add a comment there.
Basically, for the inline case it is the biggest N such as 4 + 4*N +
7*(N-1) <= 63
(not 64, because Xarray steals one bit from us)

For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
<= array size in bits (128, 256, or 512).

>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Maybe
>
>  #define RANGES_INLINE (6)

Fair enough.


> > +
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > +                      unsigned long *pos, unsigned long bits)
>
> Please don't steal prefixes. But the idea is good. For the next
> iteration, let's rename bitmap_set_value() to bitmap_write()?
> So that your function will be an mte_bitmap_write().

I don't object, but it diverges from bitmap_set_value8() now.
Will do.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-19  6:09   ` Yury Norov
  2023-07-19 14:00     ` Alexander Potapenko
@ 2023-07-19 20:32     ` Evgenii Stepanov
  1 sibling, 0 replies; 37+ messages in thread
From: Evgenii Stepanov @ 2023-07-19 20:32 UTC (permalink / raw)
  To: Yury Norov
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, linux, linux-kernel, linux-arm-kernel,
	syednwaris, william.gray

On Tue, Jul 18, 2023 at 11:09 PM Yury Norov <yury.norov@gmail.com> wrote:
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.

Thank you for your kind words :)
Indeed, Suggested-by is more than enough.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-19 14:00     ` Alexander Potapenko
@ 2023-07-19 21:06       ` Yury Norov
  2023-07-20 12:00         ` Alexander Potapenko
  0 siblings, 1 reply; 37+ messages in thread
From: Yury Norov @ 2023-07-19 21:06 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Wed, Jul 19, 2023 at 04:00:14PM +0200, Alexander Potapenko wrote:
> > > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > > + *       largest_idx : i4
> > > + *      r_tags[0..5] : i4 x 6
> > > + *     r_sizes[0..4] : i7 x 5
> > > + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > > + *
> > > + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > > + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > > + *    distinguished from a kernel pointer.
> >
> > I honestly tried to understand... For example, what the
> >         'r_sizes[0..4] : i7 x 5'
> > means? An array of 5 elements, 17 bits each? But it's alone greater
> > than size of pointer... Oh gosh...
> 
> iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
> integer type.
> There's no such thing in C syntax, and describing the data layout
> using bitfields would be painful.
> Would it help if I add a comment explaining that iN is an N-bit
> integer? Or do you prefer something like
> 
>   r_sizes[0..4] : 5 x 7 bits
> 
> ?

Yes, that would help.
 
> >
> > Moreover, MTE tags are all 4-bits.
> >
> > It seems like text without pictures is above my mental abilities. Can you
> > please illustrate it? For example, from the #4, I (hopefully correctly)
> > realized that:
> >
> > Inline frame format:
> >    0                                                    60       62  63
> >  +---------------------------------------------------------------------+
> I think it's more natural to number bits from 63 to 0
> 
> >  |              No idea what happens here             |    Lidx    | X |
> >  +---------------------------------------------------------------------+
> >  63    : X      :  RAZ : Reserved for Xarray.
> >  60-62 : Lidx   : 0..5 : Largest element index.
> 
> There's some mismatch now between this picture, where Lidx is i3, and
> the implementation that treats it as an i4, merging bit 63 into it.
> Maybe we can avoid this by not splitting off bit 63?
> 
> >                      6 : Reserved
> >                      7 : Invalid handler
> 
> No, 7 means "treat this handle as an out-of-line one". It is still
> valid, but instead of tag data it contains a pointer.

So, it's invalid combination for _inline_ handler, right? Anyways, I'm
waiting for an updated docs, so it will hopefully bring some light.

> > > +
> > > +/* Transform tag ranges back into tags. */
> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > +{
> > > +     int i, j, pos = 0;
> > > +     u8 prev;
> > > +
> > > +     for (i = 0; i < r_len; i++) {
> > > +             for (j = 0; j < r_sizes[i]; j++) {
> > > +                     if (pos % 2)
> > > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > > +                     else
> > > +                             prev = r_tags[i];
> > > +                     pos++;

This code flushes tags at every 2nd iteration. Is that true that
there's always an even number of iterations, i.e. rsizes is always
even, assuming r_len can be 1?

If not, it's possible to loose a tag, consider rlen == 1 and
rsizes[0] == 1.

If yes, you can simplify:

     for (i = 0; i < r_len; i++)
             for (j = 0; j < r_sizes[i]; j++)
                     tags[pos++] = (r_tags[i] << 4) | r_tags[i];

Anyways, in the test can you run all possible combinations?

> > > +             }
> > > +     }
> > > +}
> > > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
> >
> > Because I didn't understand the compressed frame format, not sure I
> > can understand this logic...
> Hope the above comments will help, if not - please let me know.
> 
> > > +
> > > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > > +static int ea0_alloc_size(int num_ranges)
> > > +{
> > > +     if (num_ranges <= 6)
> > > +             return 8;
> > > +     if (num_ranges <= 11)
> > > +             return 16;
> > > +     if (num_ranges <= 23)
> > > +             return 32;
> > > +     if (num_ranges <= 46)
> > > +             return 64;
> > > +     return 128;
> > > +}
> > > +
> > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > +static int ea0_size_to_ranges(int size)
> > > +{
> > > +     switch (size) {
> > > +     case 8:
> > > +             return 6;
> > > +     case 16:
> > > +             return 11;
> > > +     case 32:
> > > +             return 23;
> > > +     case 64:
> > > +             return 46;
> > > +     default:
> > > +             return 0;
> > > +     }
> > > +}
> >
> > I wonder if there's a math formula here? Can you explain where from
> > those numbers come?
> 
> I'll add a comment there.
> Basically, for the inline case it is the biggest N such as 4 + 4*N +
> 7*(N-1) <= 63
>
> (not 64, because Xarray steals one bit from us)
> 
> For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> <= array size in bits (128, 256, or 512).

So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)

Doesn't look like a rocket science...

Why don't you code the formula instead of results? Are there any
difficulties? Things may change. For example, in next spec they
may bring 3- or 5-bit tags, and your compressor will crack loudly
with hardcoded numbers.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-07-19 21:06       ` Yury Norov
@ 2023-07-20 12:00         ` Alexander Potapenko
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Potapenko @ 2023-07-20 12:00 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko, linux,
	linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray

On Wed, Jul 19, 2023 at 11:06 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> > Would it help if I add a comment explaining that iN is an N-bit
> > integer? Or do you prefer something like
> >
> >   r_sizes[0..4] : 5 x 7 bits
> >
> > ?
>
> Yes, that would help.
Ok.


> > >                      7 : Invalid handler
> >
> > No, 7 means "treat this handle as an out-of-line one". It is still
> > valid, but instead of tag data it contains a pointer.
>
> So, it's invalid combination for _inline_ handler, right?
Yes, that's right.

> Anyways, I'm
> waiting for an updated docs, so it will hopefully bring some light.
It's on the way.

> > > > +
> > > > +/* Transform tag ranges back into tags. */
> > > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > > +{
> > > > +     int i, j, pos = 0;
> > > > +     u8 prev;
> > > > +
> > > > +     for (i = 0; i < r_len; i++) {
> > > > +             for (j = 0; j < r_sizes[i]; j++) {
> > > > +                     if (pos % 2)
> > > > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > > > +                     else
> > > > +                             prev = r_tags[i];
> > > > +                     pos++;
>
> This code flushes tags at every 2nd iteration. Is that true that
> there's always an even number of iterations, i.e. rsizes is always
> even, assuming r_len can be 1?
>
> If not, it's possible to loose a tag, consider rlen == 1 and
> rsizes[0] == 1.

By design, ea0_ranges_to_tags() only accepts ranges that sum up to 256
(as produced by ea0_tags_to_ranges())
We could add a check for that, but given that it is an internal
helper, a comment should suffice.

But r_sizes[i] does not have to be even (otherwise we could've used
this fact for a more efficient compression).
For example, if the array of tags is {0xab, 0x0, 0x0, 0x0...}, then
r_len = 3, r_sizes[] = {1, 1, 254}, r_tags = {0x0a, 0x0b, 0x0}.

>
> If yes, you can simplify:
>
>      for (i = 0; i < r_len; i++)
>              for (j = 0; j < r_sizes[i]; j++)
>                      tags[pos++] = (r_tags[i] << 4) | r_tags[i];
>
> Anyways, in the test can you run all possible combinations?

I will extract code from compress_range_helper() to generate different
numbers of ranges and test against those.


> > > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > > +static int ea0_size_to_ranges(int size)
> > > > +{
> > > > +     switch (size) {
> > > > +     case 8:
> > > > +             return 6;
> > > > +     case 16:
> > > > +             return 11;
> > > > +     case 32:
> > > > +             return 23;
> > > > +     case 64:
> > > > +             return 46;
> > > > +     default:
> > > > +             return 0;
> > > > +     }
> > > > +}
> > >
> > > I wonder if there's a math formula here? Can you explain where from
> > > those numbers come?
> >
> > I'll add a comment there.
> > Basically, for the inline case it is the biggest N such as 4 + 4*N +
> > 7*(N-1) <= 63
> >
> > (not 64, because Xarray steals one bit from us)
> >
> > For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> > <= array size in bits (128, 256, or 512).
>
> So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)
>
> Doesn't look like a rocket science...

Note that the size of largest_idx is picked based on the largest N for
64-byte buffers.
And the size of r_sizes[] depends on the number of tags that fit into
128 bytes: if we throw away the largest range, each of the remaining
ones will fit into 7 bits.
But this will not be the case for e.g. 3-bit tags (we'll need 8-bit sizes then).

Changing mte_size_to_ranges() to "(size * 8 + MTE_BITS_PER_SIZE -
largest_bits) / (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE)" indeed looks
better than hardcoded numbers, but we should keep in mind that the
parameters depend on each other and cannot be easily changed.

>
> Why don't you code the formula instead of results? Are there any
> difficulties? Things may change. For example, in next spec they
> may bring 3- or 5-bit tags, and your compressor will crack loudly
> with hardcoded numbers.

The main reason for the compressor to crack loudly would be the
hardcoded assumption of the tags consisting of 4 bits.
I don't think it is practical to account on that number to change (and
e.g. use bitmap_read/bitmap_write to convert tags into ranges) - we'd
rather have a static assert for that.


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-07-20 12:01 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-17 11:37 [PATCH v3 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
2023-07-17 11:37 ` [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value() Alexander Potapenko
2023-07-17 13:01   ` Andy Shevchenko
2023-07-17 14:14     ` Alexander Potapenko
2023-07-17 14:29       ` Andy Shevchenko
2023-07-17 14:31         ` Andy Shevchenko
2023-07-17 16:15           ` Yury Norov
2023-07-17 14:53       ` Alexander Potapenko
2023-07-17 15:03         ` Andy Shevchenko
2023-07-17 16:29           ` Alexander Potapenko
2023-07-17 15:50   ` Yury Norov
2023-07-18  9:30     ` Alexander Potapenko
2023-07-18 14:01       ` Andy Shevchenko
2023-07-18 17:03         ` Yury Norov
2023-07-18 17:20           ` Andy Shevchenko
2023-07-19  9:00           ` Alexander Potapenko
2023-07-17 11:37 ` [PATCH v3 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value() Alexander Potapenko
2023-07-17 13:04   ` Andy Shevchenko
2023-07-18 10:19     ` Alexander Potapenko
2023-07-17 16:11   ` Yury Norov
2023-07-17 16:28     ` Andy Shevchenko
2023-07-17 16:42     ` Alexander Potapenko
2023-07-17 11:37 ` [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
2023-07-17 13:49   ` Andy Shevchenko
2023-07-18 15:33     ` Alexander Potapenko
2023-07-18 17:17       ` Andy Shevchenko
2023-07-19 12:16         ` Alexander Potapenko
2023-07-19  6:09   ` Yury Norov
2023-07-19 14:00     ` Alexander Potapenko
2023-07-19 21:06       ` Yury Norov
2023-07-20 12:00         ` Alexander Potapenko
2023-07-19 20:32     ` Evgenii Stepanov
2023-07-17 11:37 ` [PATCH v3 4/5] arm64: mte: add a test for MTE tags compression Alexander Potapenko
2023-07-17 11:37 ` [PATCH v3 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
2023-07-17 13:53   ` Andy Shevchenko
2023-07-18 10:48     ` Alexander Potapenko
2023-07-18 14:13       ` Andy Shevchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).