All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/5] Implement MTE tag compression for swapped pages
@ 2023-10-06 13:45 ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying PAGE_SIZE/32 bytes per page. This is especially
problematic for devices that use zram-backed in-memory swap, because
tags stored uncompressed in the heap effectively reduce the available
amount of swap memory.

The RLE-based algorithm suggested by Evgenii Stepanov and implemented in
this patch series is able to efficiently compress fixed-size tag buffers,
resulting in practical compression ratio between 2.5x and 4x. In most
cases it is possible to store the compressed data in 63-bit Xarray values,
resulting in no extra memory allocations.

Our measurements show that the proposed algorithm provides better
compression than existing kernel compression algorithms (LZ4, LZO,
LZ4HC, ZSTD) can offer.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to read/write bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v6:
 - fixed comments by Yury Norov
 - fixed handling of sizes divisible by MTE_GRANULES_PER_PAGE / 2
   (caught while testing on a real device)

v5:
 - fixed comments by Andy Shevchenko, Catalin Marinas, and Yury Norov
 - added support for 16K- and 64K pages
 - more efficient bitmap_write() implementation

v4:
 - fixed a bunch of comments by Andy Shevchenko and Yury Norov
 - added Documentation/arch/arm64/mte-tag-compression.rst

v3:
 - as suggested by Andy Shevchenko, use
   bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
 - switched to unsigned long to reduce typecasts
 - simplified the compression code

v2:
 - as suggested by Yuri Norov, replace the poorly implemented struct
   bitq with <linux/bitmap.h>



Alexander Potapenko (4):
  lib/test_bitmap: add tests for bitmap_{read,write}()
  arm64: mte: implement CONFIG_ARM64_MTE_COMP
  arm64: mte: add a test for MTE tags compression
  arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
  lib/bitmap: add bitmap_{read,write}()

 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 266 +++++++++
 arch/arm64/Kconfig                            |  21 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   7 +
 arch/arm64/mm/mtecomp.c                       | 524 ++++++++++++++++++
 arch/arm64/mm/mtecomp.h                       |  12 +
 arch/arm64/mm/mteswap.c                       |  20 +-
 arch/arm64/mm/mteswap.h                       |  12 +
 arch/arm64/mm/mteswap_comp.c                  |  60 ++
 arch/arm64/mm/mteswap_nocomp.c                |  38 ++
 arch/arm64/mm/test_mtecomp.c                  | 377 +++++++++++++
 include/linux/bitmap.h                        |  68 +++
 lib/test_bitmap.c                             | 118 ++++
 14 files changed, 1526 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mtecomp.h
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c
 create mode 100644 arch/arm64/mm/test_mtecomp.c

-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v6 0/5] Implement MTE tag compression for swapped pages
@ 2023-10-06 13:45 ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying PAGE_SIZE/32 bytes per page. This is especially
problematic for devices that use zram-backed in-memory swap, because
tags stored uncompressed in the heap effectively reduce the available
amount of swap memory.

The RLE-based algorithm suggested by Evgenii Stepanov and implemented in
this patch series is able to efficiently compress fixed-size tag buffers,
resulting in practical compression ratio between 2.5x and 4x. In most
cases it is possible to store the compressed data in 63-bit Xarray values,
resulting in no extra memory allocations.

Our measurements show that the proposed algorithm provides better
compression than existing kernel compression algorithms (LZ4, LZO,
LZ4HC, ZSTD) can offer.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to read/write bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v6:
 - fixed comments by Yury Norov
 - fixed handling of sizes divisible by MTE_GRANULES_PER_PAGE / 2
   (caught while testing on a real device)

v5:
 - fixed comments by Andy Shevchenko, Catalin Marinas, and Yury Norov
 - added support for 16K- and 64K pages
 - more efficient bitmap_write() implementation

v4:
 - fixed a bunch of comments by Andy Shevchenko and Yury Norov
 - added Documentation/arch/arm64/mte-tag-compression.rst

v3:
 - as suggested by Andy Shevchenko, use
   bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
 - switched to unsigned long to reduce typecasts
 - simplified the compression code

v2:
 - as suggested by Yuri Norov, replace the poorly implemented struct
   bitq with <linux/bitmap.h>



Alexander Potapenko (4):
  lib/test_bitmap: add tests for bitmap_{read,write}()
  arm64: mte: implement CONFIG_ARM64_MTE_COMP
  arm64: mte: add a test for MTE tags compression
  arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
  lib/bitmap: add bitmap_{read,write}()

 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 266 +++++++++
 arch/arm64/Kconfig                            |  21 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   7 +
 arch/arm64/mm/mtecomp.c                       | 524 ++++++++++++++++++
 arch/arm64/mm/mtecomp.h                       |  12 +
 arch/arm64/mm/mteswap.c                       |  20 +-
 arch/arm64/mm/mteswap.h                       |  12 +
 arch/arm64/mm/mteswap_comp.c                  |  60 ++
 arch/arm64/mm/mteswap_nocomp.c                |  38 ++
 arch/arm64/mm/test_mtecomp.c                  | 377 +++++++++++++
 include/linux/bitmap.h                        |  68 +++
 lib/test_bitmap.c                             | 118 ++++
 14 files changed, 1526 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mtecomp.h
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c
 create mode 100644 arch/arm64/mm/test_mtecomp.c

-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 13:45 ` Alexander Potapenko
@ 2023-10-06 13:45   ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow reading/writing values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a number of changes and simplifications:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value());
 - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
   and bitmap_write();
 - some redundant computations are omitted.

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
This patch was previously called "lib/bitmap: add
bitmap_{set,get}_value()"
(https://lore.kernel.org/lkml/20230720173956.3674987-2-glider@google.com/)

v6:
 - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
   return 0.

v5:
 - Address comments by Yury Norov:
   - updated code comments and patch title/description
   - replace GENMASK(nbits - 1, 0) with BITMAP_LAST_WORD_MASK(nbits)
   - more compact bitmap_write() implementation

v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 68 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..e72c054d21d48 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,43 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value to write, clamped to nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG.
+ *
+ * bitmap_write() behaves similarly to @nbits calls of assign_bit(), i.e. bits
+ * beyond @nbits are ignored:
+ *
+ *   for (bit = 0; bit < nbits; bit++)
+ *           assign_bit(start + bit, bitmap, val & BIT(bit));
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long mask;
+
+	if (unlikely(!nbits))
+		return;
+	mask = BITMAP_LAST_WORD_MASK(nbits);
+	value &= mask;
+	if (space >= nbits) {
+		map[index] &= ~(mask << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-06 13:45   ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris,
	william.gray, Arnd Bergmann

From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow reading/writing values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a number of changes and simplifications:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value());
 - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
   and bitmap_write();
 - some redundant computations are omitted.

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
This patch was previously called "lib/bitmap: add
bitmap_{set,get}_value()"
(https://lore.kernel.org/lkml/20230720173956.3674987-2-glider@google.com/)

v6:
 - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
   return 0.

v5:
 - Address comments by Yury Norov:
   - updated code comments and patch title/description
   - replace GENMASK(nbits - 1, 0) with BITMAP_LAST_WORD_MASK(nbits)
   - more compact bitmap_write() implementation

v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 68 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..e72c054d21d48 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,43 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value to write, clamped to nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG.
+ *
+ * bitmap_write() behaves similarly to @nbits calls of assign_bit(), i.e. bits
+ * beyond @nbits are ignored:
+ *
+ *   for (bit = 0; bit < nbits; bit++)
+ *           assign_bit(start + bit, bitmap, val & BIT(bit));
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long mask;
+
+	if (unlikely(!nbits))
+		return;
+	mask = BITMAP_LAST_WORD_MASK(nbits);
+	value &= mask;
+	if (space >= nbits) {
+		map[index] &= ~(mask << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()
  2023-10-06 13:45 ` Alexander Potapenko
@ 2023-10-06 13:45   ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value()"
(https://lore.kernel.org/lkml/20230720173956.3674987-3-glider@google.com/)
and
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
(https://lore.kernel.org/lkml/20230713125706.2884502-3-glider@google.com/)

v6:
 - use bitmap API to initialize test bitmaps
 - as requested by Yury Norov, do not check the return value of
   bitmap_read(..., 0)
 - fix a compiler warning on 32-bit systems

v5:
 - update patch title
 - address Yury Norov's comments:
   - rename the test cases
   - factor out test_bitmap_write_helper() to test writing over
     different background patterns;
   - add a test case copying a nontrivial value bit-by-bit;
   - drop volatile

v4:
 - Address comments by Andy Shevchenko: added Reviewed-by: and a link to
   the previous discussion
 - Address comments by Yury Norov:
   - expand the bitmap to catch more corner cases
   - add code testing that bitmap_set_value() does not touch adjacent
     bits
   - add code testing the nbits==0 case
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()

v3:
 - switch to using bitmap_{set,get}_value()
 - change the expected bit pattern in test_set_get_value(),
   as the test was incorrectly assuming 0 is the LSB.
---
 lib/test_bitmap.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..172ecf3292e46 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
 	return true;
 }
 
+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+		 const unsigned long exp_ulong, unsigned long x)
+{
+	if (exp_ulong != x) {
+		pr_err("[%s:%u] expected %lu, got %lu\n",
+			srcfile, line, exp_ulong, x);
+		return false;
+	}
+	return true;
+}
 
 static bool __init
 __check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
 	})
 
 #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
 #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
 #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
 #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,111 @@ static void __init test_bitmap_const_eval(void)
 	BUILD_BUG_ON(~var != ~BIT(25));
 }
 
+/*
+ * Test bitmap should be big enough to include the cases when start is not in
+ * the first word, and start+nbits lands in the following word.
+ */
+#define TEST_BIT_LEN (1000)
+
+/*
+ * Helper function to test bitmap_write() overwriting the chosen byte pattern.
+ */
+static void __init test_bitmap_write_helper(const char *pattern)
+{
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(pat_bitmap, TEST_BIT_LEN);
+	unsigned long w, r, bit;
+	int i, n, nbits;
+
+	/*
+	 * Only parse the pattern once and store the result in the intermediate
+	 * bitmap.
+	 */
+	bitmap_parselist(pattern, pat_bitmap, TEST_BIT_LEN);
+
+	/*
+	 * Check that setting a single bit does not accidentally touch the
+	 * adjacent bits.
+	 */
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+		bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+		for (bit = 0; bit <= 1; bit++) {
+			bitmap_write(bitmap, bit, i, 1);
+			__assign_bit(i, exp_bitmap, bit);
+			expect_eq_bitmap(exp_bitmap, bitmap,
+					 TEST_BIT_LEN);
+		}
+	}
+
+	/* Ensure setting 0 bits does not change anything. */
+	bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+	bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_write(bitmap, ~0UL, i, 0);
+		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+	}
+
+	for (nbits = BITS_PER_LONG; nbits >= 1; nbits--) {
+		w = IS_ENABLED(CONFIG_64BIT) ? 0xdeadbeefdeadbeefUL
+					     : 0xdeadbeefUL;
+		w >>= (BITS_PER_LONG - nbits);
+		for (i = 0; i <= TEST_BIT_LEN - nbits; i++) {
+			bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+			bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+			for (n = 0; n < nbits; n++)
+				__assign_bit(i + n, exp_bitmap, w & BIT(n));
+			bitmap_write(bitmap, w, i, nbits);
+			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+			r = bitmap_read(bitmap, i, nbits);
+			expect_eq_ulong(r, w);
+		}
+	}
+}
+
+static void __init test_bitmap_read_write(void)
+{
+	unsigned char *pattern[3] = {"", "all:1/2", "all"};
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	unsigned long zero_bits = 0;
+	unsigned long val;
+	int i, pi;
+
+	/*
+	 * Setting/getting zero bytes should not crash the kernel.
+	 * READ_ONCE() prevents constant folding.
+	 */
+	bitmap_write(NULL, 0, 0, READ_ONCE(zero_bits));
+	/* Return value of bitmap_read() is undefined here. */
+	bitmap_read(NULL, 0, READ_ONCE(zero_bits));
+
+	/*
+	 * Ensure that bitmap_read() reads the same value that was previously
+	 * written, and two consequent values are correctly merged.
+	 * The resulting bit pattern is asymmetric to rule out possible issues
+	 * with bit numeration order.
+	 */
+	for (i = 0; i < TEST_BIT_LEN - 7; i++) {
+		bitmap_zero(bitmap, TEST_BIT_LEN);
+
+		bitmap_write(bitmap, 0b10101UL, i, 5);
+		val = bitmap_read(bitmap, i, 5);
+		expect_eq_ulong(0b10101UL, val);
+
+		bitmap_write(bitmap, 0b101UL, i + 5, 3);
+		val = bitmap_read(bitmap, i + 5, 3);
+		expect_eq_ulong(0b101UL, val);
+
+		val = bitmap_read(bitmap, i, 8);
+		expect_eq_ulong(0b10110101UL, val);
+	}
+
+	for (pi = 0; pi < ARRAY_SIZE(pattern); pi++)
+		test_bitmap_write_helper(pattern[pi]);
+}
+#undef TEST_BIT_LEN
+
 static void __init selftest(void)
 {
 	test_zero_clear();
@@ -1237,6 +1354,7 @@ static void __init selftest(void)
 	test_bitmap_cut();
 	test_bitmap_print_buf();
 	test_bitmap_const_eval();
+	test_bitmap_read_write();
 
 	test_find_nth_bit();
 	test_for_each_set_bit();
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()
@ 2023-10-06 13:45   ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value()"
(https://lore.kernel.org/lkml/20230720173956.3674987-3-glider@google.com/)
and
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
(https://lore.kernel.org/lkml/20230713125706.2884502-3-glider@google.com/)

v6:
 - use bitmap API to initialize test bitmaps
 - as requested by Yury Norov, do not check the return value of
   bitmap_read(..., 0)
 - fix a compiler warning on 32-bit systems

v5:
 - update patch title
 - address Yury Norov's comments:
   - rename the test cases
   - factor out test_bitmap_write_helper() to test writing over
     different background patterns;
   - add a test case copying a nontrivial value bit-by-bit;
   - drop volatile

v4:
 - Address comments by Andy Shevchenko: added Reviewed-by: and a link to
   the previous discussion
 - Address comments by Yury Norov:
   - expand the bitmap to catch more corner cases
   - add code testing that bitmap_set_value() does not touch adjacent
     bits
   - add code testing the nbits==0 case
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()

v3:
 - switch to using bitmap_{set,get}_value()
 - change the expected bit pattern in test_set_get_value(),
   as the test was incorrectly assuming 0 is the LSB.
---
 lib/test_bitmap.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..172ecf3292e46 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
 	return true;
 }
 
+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+		 const unsigned long exp_ulong, unsigned long x)
+{
+	if (exp_ulong != x) {
+		pr_err("[%s:%u] expected %lu, got %lu\n",
+			srcfile, line, exp_ulong, x);
+		return false;
+	}
+	return true;
+}
 
 static bool __init
 __check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
 	})
 
 #define expect_eq_uint(...)		__expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...)		__expect_eq(ulong, ##__VA_ARGS__)
 #define expect_eq_bitmap(...)		__expect_eq(bitmap, ##__VA_ARGS__)
 #define expect_eq_pbl(...)		__expect_eq(pbl, ##__VA_ARGS__)
 #define expect_eq_u32_array(...)	__expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,111 @@ static void __init test_bitmap_const_eval(void)
 	BUILD_BUG_ON(~var != ~BIT(25));
 }
 
+/*
+ * Test bitmap should be big enough to include the cases when start is not in
+ * the first word, and start+nbits lands in the following word.
+ */
+#define TEST_BIT_LEN (1000)
+
+/*
+ * Helper function to test bitmap_write() overwriting the chosen byte pattern.
+ */
+static void __init test_bitmap_write_helper(const char *pattern)
+{
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
+	DECLARE_BITMAP(pat_bitmap, TEST_BIT_LEN);
+	unsigned long w, r, bit;
+	int i, n, nbits;
+
+	/*
+	 * Only parse the pattern once and store the result in the intermediate
+	 * bitmap.
+	 */
+	bitmap_parselist(pattern, pat_bitmap, TEST_BIT_LEN);
+
+	/*
+	 * Check that setting a single bit does not accidentally touch the
+	 * adjacent bits.
+	 */
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+		bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+		for (bit = 0; bit <= 1; bit++) {
+			bitmap_write(bitmap, bit, i, 1);
+			__assign_bit(i, exp_bitmap, bit);
+			expect_eq_bitmap(exp_bitmap, bitmap,
+					 TEST_BIT_LEN);
+		}
+	}
+
+	/* Ensure setting 0 bits does not change anything. */
+	bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+	bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+	for (i = 0; i < TEST_BIT_LEN; i++) {
+		bitmap_write(bitmap, ~0UL, i, 0);
+		expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+	}
+
+	for (nbits = BITS_PER_LONG; nbits >= 1; nbits--) {
+		w = IS_ENABLED(CONFIG_64BIT) ? 0xdeadbeefdeadbeefUL
+					     : 0xdeadbeefUL;
+		w >>= (BITS_PER_LONG - nbits);
+		for (i = 0; i <= TEST_BIT_LEN - nbits; i++) {
+			bitmap_copy(bitmap, pat_bitmap, TEST_BIT_LEN);
+			bitmap_copy(exp_bitmap, pat_bitmap, TEST_BIT_LEN);
+			for (n = 0; n < nbits; n++)
+				__assign_bit(i + n, exp_bitmap, w & BIT(n));
+			bitmap_write(bitmap, w, i, nbits);
+			expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+			r = bitmap_read(bitmap, i, nbits);
+			expect_eq_ulong(r, w);
+		}
+	}
+}
+
+static void __init test_bitmap_read_write(void)
+{
+	unsigned char *pattern[3] = {"", "all:1/2", "all"};
+	DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+	unsigned long zero_bits = 0;
+	unsigned long val;
+	int i, pi;
+
+	/*
+	 * Setting/getting zero bytes should not crash the kernel.
+	 * READ_ONCE() prevents constant folding.
+	 */
+	bitmap_write(NULL, 0, 0, READ_ONCE(zero_bits));
+	/* Return value of bitmap_read() is undefined here. */
+	bitmap_read(NULL, 0, READ_ONCE(zero_bits));
+
+	/*
+	 * Ensure that bitmap_read() reads the same value that was previously
+	 * written, and two consequent values are correctly merged.
+	 * The resulting bit pattern is asymmetric to rule out possible issues
+	 * with bit numeration order.
+	 */
+	for (i = 0; i < TEST_BIT_LEN - 7; i++) {
+		bitmap_zero(bitmap, TEST_BIT_LEN);
+
+		bitmap_write(bitmap, 0b10101UL, i, 5);
+		val = bitmap_read(bitmap, i, 5);
+		expect_eq_ulong(0b10101UL, val);
+
+		bitmap_write(bitmap, 0b101UL, i + 5, 3);
+		val = bitmap_read(bitmap, i + 5, 3);
+		expect_eq_ulong(0b101UL, val);
+
+		val = bitmap_read(bitmap, i, 8);
+		expect_eq_ulong(0b10110101UL, val);
+	}
+
+	for (pi = 0; pi < ARRAY_SIZE(pattern); pi++)
+		test_bitmap_write_helper(pattern[pi]);
+}
+#undef TEST_BIT_LEN
+
 static void __init selftest(void)
 {
 	test_zero_clear();
@@ -1237,6 +1354,7 @@ static void __init selftest(void)
 	test_bitmap_cut();
 	test_bitmap_print_buf();
 	test_bitmap_const_eval();
+	test_bitmap_read_write();
 
 	test_find_nth_bit();
 	test_for_each_set_bit();
-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
  2023-10-06 13:45 ` Alexander Potapenko
@ 2023-10-06 13:45   ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets buffers of tags
corresponding to a single page. In the common case a buffer can be
compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v6:
  - shuffle bits in inline handles so that they can't be confused with
    canonical pointers;
  - use kmem_cache_zalloc() to allocate compressed storage
  - correctly handle range size overflow
  - minor documentation fixes, clarify the special cases

 v5:
  - make code PAGE_SIZE-agnostic, remove hardcoded constants, updated
    the docs
  - implement debugfs interface
  - Address comments by Andy Shevchenko:
    - update description of mtecomp.c
    - remove redundant assignments, simplify mte_tags_to_ranges()
    - various code simplifications
    - introduce mtecomp.h
    - add :export: to Documentation/arch/arm64/mte-tag-compression.rst

 v4:
  - Addressed comments by Andy Shevchenko:
    - expanded "MTE" to "Memory Tagging Extension" in Kconfig
    - fixed kernel-doc comments, moved them to C source
    - changed variables to unsigned where applicable
    - some code simplifications, fewer unnecessary assignments
    - added the mte_largest_idx_bits() helper
    - added namespace prefixes to all functions
    - added missing headers (but removed bits.h)
  - Addressed comments by Yury Norov:
    - removed test-only functions from mtecomp.h
    - dropped the algoritm name (all functions are now prefixed with
      "mte")
    - added more comments
    - got rid of MTE_RANGES_INLINE
    - renamed bitmap_{get,set}_value() to bitmap_{read,write}()
    - moved the big comment explaining the algorithm to
      Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
      add a link to it from Documentation/arch/arm64/index.rst
    - removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yury Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 266 +++++++++
 arch/arm64/Kconfig                            |  11 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   1 +
 arch/arm64/mm/mtecomp.c                       | 524 ++++++++++++++++++
 arch/arm64/mm/mtecomp.h                       |  12 +
 7 files changed, 828 insertions(+)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mtecomp.h

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
     legacy_instructions
     memory
     memory-tagging-extension
+    mte-tag-compression
     perf
     pointer-authentication
     ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..05022c3f67d74
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,266 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule), for 16K pages -
+512 bytes, for 64K pages - 2048 bytes. By default, MTE carves out 3.125% (1/16)
+of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory. To reduce memory waste,
+``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in compressed
+form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress an array of ``MTE_PAGE_TAG_STORAGE``
+tag bytes into a byte sequence that can be stored in one of the smaller size
+class allocations (for 4K pages those are 16-, 32-, or 64-byte allocations).
+A special case is storing the tags inline in an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to uncompressed
+buffers allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given ``MTE_PAGE_TAG_STORAGE``-byte ``tags``
+  buffer, allocates storage for it, and returns an opaque handle addressing
+  that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+  and fills the ``MTE_PAGE_TAG_STORAGE``-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+  mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+  by ``handle``.
+
+Depending on the size of compressed data, mte_compress() stores it in one of
+the size classes backed by kmem caches: ``mte-tags-{16,32,64,128}`` for the
+4K-page case (``mte-tags-128`` being used for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by mte_compress() is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+   63  62    60  ...   2         0
+  +---+--------+-----+------------+
+  | 0 | INLINE | ... |  SIZE_LOG  |
+  +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size for out-of-line allocations::
+
+  size = 16 << (handle & 0b111)
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows (we'll be considering 4K pages and 128-byte tag buffers, but the same
+approach scales to 16- and 64K pages):
+
+1. The input array of 128 (``MTE_PAGE_TAG_STORAGE``) bytes is transformed into
+   tag ranges (two arrays: ``r_tags[]`` containing tag values and ``r_sizes[]``
+   containing range lengths) by mte_tags_to_ranges(). Note that
+   ``r_sizes[]`` sums up to 256 (``MTE_GRANULES_PER_PAGE``).
+
+   If ``r_sizes[]`` consists of a single element (``MTE_GRANULES_PER_PAGE``),
+   the corresponding range is split into two halves, i.e.::
+
+     r_sizes_new[2] = { MTE_GRANULES_PER_PAGE / 2, MTE_GRANULES_PER_PAGE / 2 };
+     r_tags_new[2] = { r_tags[0], r_tags[0] };
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+   ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+   because it can be reconstructed from the sum of the remaining elements. Note
+   that now none of the remaining ``r_sizes[]`` elements exceeds
+   ``MTE_GRANULES_PER_PAGE / 2``.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+           N <= 6:  8 bytes (inline case, no allocation required);
+      6 < N <= 11: 16 bytes
+     11 < N <= 23: 32 bytes
+     23 < N <= 46: 64 bytes
+     46 < N:       128 bytes (no compression will be performed)
+
+  (See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+   handle (``i<size>`` means a ``<size>``-bit unsigned integer) treated as a
+   bitmap (see ``include/linux/bitmap.h``)::
+
+      largest_idx : i4
+     r_tags[0..5] : i4 x 6
+    r_sizes[0..4] : i7 x 5
+
+   (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+   values)
+
+   Bitmaps are written from lower bits to higher ones, so the order of the
+   records in the actual 8-byte handle is reversed. Therefore an additional
+   transformation is applied to bring largest_idx to bits ``63..60``.
+   Because ``largest_idx`` is <= 5, bit 63 of the handle is now always 0 (so the
+   handle can be stored in an Xarray), and bits ``62..60`` cannot all be 1 (so
+   the handle can be distinguished from a kernel pointer).
+
+5. For the out-of-line case, the storage is allocated from one of the
+   ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+   on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+   Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+        largest_idx : i6
+       r_tags[0..N] : i4 x N
+    r_sizes[0..N-1] : i7 x (N-1)
+
+   Range size of ``MTE_GRANULES_PER_PAGE / 2`` (at most one) does not fit into
+   i7 and will be written as 0. This case is handled separately by the
+   decompressing procedure.
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+   or out-of line.
+
+2. For the inline case:
+  1. Move the top four bits back to the end of the handle.
+  2. Treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+   the input buffer length. To obtain the pointer to the input buffer, unset
+   bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+   buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+   input buffer.
+
+   If ``largest_idx`` is zero, and all ``r_sizes[]`` are zero, set
+   ``r_sizes[0] = MTE_GRANULES_PER_PAGE / 2``.
+
+   Calculate the removed largest element of ``r_sizes[]`` as
+   ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+   position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+   buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed ``MTE_PAGE_TAG_STORAGE``, those can be packed into
+``S = ilog2(MTE_PAGE_TAG_STORAGE)`` bits, whereas a single tag occupies
+4 bits, and ``largest_idx`` cannot take more than
+``Lmax = ilog2(MTE_GRANULES_PER_PAGE)`` bits.
+
+Now, for each ``B``-byte size class it is possible to find the maximal number
+``M`` such as ``Lmax + 4 * M + S * (M - 1) <= 8 * B``,
+i.e. ``M = (8 * B - 1) / 11``::
+
+ 4K pages: S = 7
+ +-------------+----+--------------+
+ | Buffer size |  M | Storage bits |
+ +-------------+----+--------------+
+ |          8  |  5 |          56  |
+ |         16  | 11 |         122  |
+ |         32  | 23 |         254  |
+ |         64  | 46 |         507  |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+  4K pages: S = 7, L_i = 4, L_o = 6
+  +-------------+----+--------------+
+  | Buffer size |  N | Storage bits |
+  +-------------+----+--------------+
+  |          8  |  6 |          63  |
+  |         16  | 11 |         120  |
+  |         32  | 23 |         252  |
+  |         64  | 46 |         505  |
+  +-------------+----+--------------+
+
+Similarly, for other page sizes::
+
+  16K pages: S = 9, L_i = 4, L_o = 8
+  +-------------+-----+--------------+
+  | Buffer size |  N  | Storage bits |
+  +-------------+-----+--------------+
+  |          8  |   5 |          60  |
+  |         16  |   9 |         116  |
+  |         32  |  19 |         246  |
+  |         64  |  39 |         506  |
+  |        128  |  78 |        1013  |
+  |        256  | 157 |        2040  |
+  +-------------+-----+--------------+
+
+  64K pages: S = 11, L_i = 4, L_o = 10
+  +-------------+-----+--------------+
+  | Buffer size |  N  | Storage bits |
+  +-------------+-----+--------------+
+  |          8  |   4 |          53  |
+  |         16  |   8 |         119  |
+  |         32  |  17 |         254  |
+  |         64  |  34 |         509  |
+  |        128  |  68 |        1019  |
+  |        256  | 136 |        2039  |
+  |        512  | 273 |        4094  |
+  |       1024  | 546 |        8189  |
+  +-------------+-----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+   :export:
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..d4fb3b8d11d77 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,17 @@ config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 Memory Tagging Extension"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 Memory Tagging Extension.
+
+	  Tag buffers corresponding to swapped RAM pages are compressed using
+	  RLE to conserve heap memory. In the common case compressed tags
+	  occupy 2.5x less memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..73e319f1f30da
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,524 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+#include "mtecomp.h"
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/* Caches start at mte-tags-16 and go up to mte-tags-MTE_PAGE_TAG_STORAGE. */
+#define MTECOMP_NUM_CACHES ilog2(MTE_PAGE_TAG_STORAGE / 8)
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+/*
+ * [0] - store the numbers of created/released inline handles;
+ * [1..MTECOMP_NUM_CACHES] - store the number of allocations/deallocations from
+ *                           mtecomp_caches.
+ */
+static atomic_long_t alloc_counters[MTECOMP_NUM_CACHES + 1];
+static atomic_long_t dealloc_counters[MTECOMP_NUM_CACHES + 1];
+
+/*
+ * Largest number of ranges, for which compressed data fits into 63 bits, can
+ * be encoded with 4 bits.
+ */
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+/*
+ * In the worst case every tag is different, then largest index can be up to
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_LARGEST_IDX ilog2(MTE_GRANULES_PER_PAGE)
+/* Range size cannot exceed MTE_GRANULES_PER_PAGE / 2. */
+#define MTE_BITS_PER_SIZE (MTE_BITS_PER_LARGEST_IDX - 1)
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+	return fls(len) - 5;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+	return 16 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array containing MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len)
+{
+	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+	unsigned int cur_idx = 0, i, j;
+	u8 cur_tag;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	out_tags[cur_idx] = prev_tag;
+	for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
+		j = i % 2;
+		cur_tag = j ? (tags[i / 2] % 16) : (tags[i / 2] / 16);
+		if (cur_tag == prev_tag) {
+			out_sizes[cur_idx]++;
+		} else {
+			cur_idx++;
+			prev_tag = cur_tag;
+			out_tags[cur_idx] = prev_tag;
+			out_sizes[cur_idx] = 1;
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags)
+{
+	unsigned int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ *   MTE_BITS_PER_LARGEST_IDX_INLINE + MTE_TAG_SIZE * N
+ *                                   + MTE_BITS_PER_SIZE * (N-1) <= 63 bits,
+ * for the inline case, or
+ *   MTE_BITS_PER_LARGEST_IDX
+ *     + MTE_TAG_SIZE * N + MTE_BITS_PER_SIZE * (N-1) <= tag array size in bits,
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+	size_t largest_bits;
+
+	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+				     MTE_BITS_PER_LARGEST_IDX;
+	return (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+	       (MTE_TAG_SIZE + MTE_BITS_PER_SIZE);
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+	size_t size = 8;
+
+	while (size < (1 << MTE_BITS_PER_SIZE)) {
+		if (num_ranges <= mte_size_to_ranges(size))
+			return size;
+		size <<= 1;
+	}
+	return size;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+	return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ *
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return 8;
+	return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+			     unsigned long *pos, unsigned long bits)
+{
+	bitmap_write(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+	if (size == 8)
+		return MTE_BITS_PER_LARGEST_IDX_INLINE;
+	return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+				unsigned long *bitmap, size_t size)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	unsigned short largest = 0;
+	size_t max_ranges;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = mte_largest_idx_bits(size);
+	max_ranges = mte_size_to_ranges(size);
+	mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_TAG_SIZE);
+	if (len == 1) {
+		/*
+		 * We are compressing MTE_GRANULES_PER_PAGE of identical tags.
+		 * Split it into two ranges containing
+		 * MTE_GRANULES_PER_PAGE / 2 tags, so that it falls into the
+		 * special case described below.
+		 */
+		mte_bitmap_write(bitmap, tags[0], &bit_pos, MTE_TAG_SIZE);
+		i = 2;
+	} else {
+		i = len;
+	}
+	for (; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_TAG_SIZE);
+	/*
+	 * Most of the time sizes[i] fits into MTE_BITS_PER_SIZE, apart from a
+	 * special case when:
+	 *   len = 2;
+	 *   sizes = { MTE_GRANULES_PER_PAGE / 2, MTE_GRANULES_PER_PAGE / 2};
+	 * In this case largest_idx will be set to 0, and the size written to
+	 * the bitmap will be also 0.
+	 */
+	for (i = 0; i < len; i++) {
+		if (i != largest_idx)
+			mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+					 MTE_BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/*
+ * When storing tags inline in the 64-bit pointer, we want to be able to
+ * distinguish the resulting values from pointers to out-of-line heap buffers.
+ * Assuming the latter are canonical kernel pointers, we want largest_idx to be
+ * stored in the top bits of the handle, whereas it is easier for
+ * mte_compress_to_buf() to store them at the bottom.
+ *
+ * mte_inline_bits_to_handle() rearranges the bits in a 64-bit inline storage so
+ * that it can never be confused with a canonical pointer.
+ * mte_handle_to_inline_bits() performs the reverse transformation.
+ * In the future these can be extended to account for tagged kernel pointers.
+ */
+static unsigned long mte_inline_bits_to_handle(unsigned long bits)
+{
+	return ror64(bits, MTE_BITS_PER_LARGEST_IDX_INLINE);
+}
+
+static unsigned long mte_handle_to_inline_bits(unsigned long handle)
+{
+	return rol64(handle, MTE_BITS_PER_LARGEST_IDX_INLINE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ *
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+	struct kmem_cache *cache;
+	unsigned short *r_sizes;
+	unsigned long *storage;
+	unsigned int cache_id;
+	size_t alloc_size;
+	u8 *r_tags;
+	size_t r_len;
+	/*
+	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	r_sizes = kmalloc_array(MTE_GRANULES_PER_PAGE, sizeof(unsigned short),
+				GFP_KERNEL);
+	r_tags = kmalloc(MTE_GRANULES_PER_PAGE, GFP_KERNEL);
+	if (!r_sizes || !r_tags)
+		goto ret;
+	r_len = MTE_GRANULES_PER_PAGE;
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = mte_alloc_size(r_len);
+	if (alloc_size == 8) {
+		mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    alloc_size);
+		result = mte_inline_bits_to_handle(result);
+		atomic_long_inc(&alloc_counters[0]);
+		goto ret;
+	}
+	cache_id = mte_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_zalloc(cache, GFP_KERNEL);
+	atomic_long_inc(&alloc_counters[cache_id + 1]);
+	if (!storage) {
+		result = 0;
+		goto ret;
+	}
+	if (alloc_size < MTE_PAGE_TAG_STORAGE) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    alloc_size);
+		result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+		goto ret;
+	}
+	memcpy(storage, tags, alloc_size);
+	result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+ret:
+	kfree(r_tags);
+	kfree(r_sizes);
+	return result;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+				     unsigned long *pos, unsigned long bits)
+{
+	unsigned long start = *pos;
+
+	*pos += bits;
+	return bitmap_read(bitmap, start, bits);
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+				    u8 *tags)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned short *r_sizes, sum;
+	unsigned int largest_idx, i;
+	bool result = true;
+	size_t max_ranges;
+	u8 *r_tags;
+
+	max_ranges = mte_size_to_ranges(size);
+	l_bits = mte_largest_idx_bits(size);
+	largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+	if (largest_idx >= max_ranges)
+		return false;
+	r_sizes = kmalloc_array(max_ranges, sizeof(unsigned short), GFP_KERNEL);
+	r_tags = kmalloc(max_ranges, GFP_KERNEL);
+	if (!r_sizes || !r_tags) {
+		result = false;
+		goto ret;
+	}
+
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_TAG_SIZE);
+	for (i = 0, sum = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] =
+			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+		/*
+		 * Special case: tag array consists of two ranges of
+		 * `MTE_GRANULES_PER_PAGE / 2` tags.
+		 */
+		if ((largest_idx == 0) && (i == 1) && (r_sizes[i] == 0))
+			r_sizes[i] = MTE_GRANULES_PER_PAGE / 2;
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= MTE_GRANULES_PER_PAGE) {
+		result = false;
+		goto ret;
+	}
+	r_sizes[largest_idx] = MTE_GRANULES_PER_PAGE - sum;
+	mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	result = true;
+ret:
+	kfree(r_sizes);
+	kfree(r_tags);
+	return result;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ *
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage;
+	size_t size;
+
+	if ((handle & MTE_HANDLE_MASK) != handle)
+		return false;
+
+	size = mte_storage_size(handle);
+	storage = mte_storage(handle);
+	switch (size) {
+	case 8:
+		handle = mte_handle_to_inline_bits(handle);
+		return mte_decompress_from_buf(&handle, size, tags);
+	case MTE_PAGE_TAG_STORAGE:
+		memcpy(tags, storage, size);
+		return true;
+	default:
+		return mte_decompress_from_buf(storage, size, tags);
+	}
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+	unsigned int cache_id;
+	struct kmem_cache *c;
+	void *storage;
+	size_t size;
+
+	storage = mte_storage(handle);
+	if (!storage) {
+		atomic_long_inc(&dealloc_counters[0]);
+		return;
+	}
+
+	size = mte_storage_size(handle);
+	cache_id = mte_size_to_cache_id(size);
+	c = mtecomp_caches[cache_id];
+	kmem_cache_free(c, storage);
+	atomic_long_inc(&dealloc_counters[cache_id + 1]);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* DebugFS interface. */
+static int stats_show(struct seq_file *seq, void *v)
+{
+	unsigned long total_mem_alloc = 0, total_mem_dealloc = 0;
+	unsigned long total_num_alloc = 0, total_num_dealloc = 0;
+	unsigned long size = 8;
+	long alloc, dealloc;
+	int i;
+
+	for (i = 0; i <= MTECOMP_NUM_CACHES; i++) {
+		alloc = atomic_long_read(&alloc_counters[i]);
+		dealloc = atomic_long_read(&dealloc_counters[i]);
+		total_num_alloc += alloc;
+		total_num_dealloc += dealloc;
+		/*
+		 * Do not count 8-byte buffers towards compressed tag storage
+		 * size.
+		 */
+		if (i) {
+			total_mem_alloc += (size * alloc);
+			total_mem_dealloc += (size * dealloc);
+		}
+		seq_printf(seq,
+			   "%lu bytes: %lu allocations, %lu deallocations\n",
+			   size, alloc, dealloc);
+		size <<= 1;
+	}
+	seq_printf(seq, "uncompressed tag storage size: %lu\n",
+		   (total_num_alloc - total_num_dealloc) *
+			   MTE_PAGE_TAG_STORAGE);
+	seq_printf(seq, "compressed tag storage size: %lu\n",
+		   total_mem_alloc - total_mem_dealloc);
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(stats);
+
+static int mtecomp_debugfs_init(void)
+{
+	struct dentry *mtecomp_dir;
+
+	mtecomp_dir = debugfs_create_dir("mtecomp", NULL);
+	debugfs_create_file("stats", 0444, mtecomp_dir, NULL, &stats_fops);
+	return 0;
+}
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	unsigned int i;
+	char name[16];
+	size_t size;
+
+	static_assert(MTECOMP_NUM_CACHES <= (MTE_CACHE_ID_MASK + 1));
+	for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+		size = mte_cache_id_to_size(i);
+		snprintf(name, sizeof(name), "mte-tags-%ld", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	mtecomp_debugfs_init();
+	return 0;
+}
+module_init(mtecomp_init);
diff --git a/arch/arm64/mm/mtecomp.h b/arch/arm64/mm/mtecomp.h
new file mode 100644
index 0000000000000..b94cf0384f2af
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTECOMP_H_
+#define ARCH_ARM64_MM_MTECOMP_H_
+
+/* Functions exported from mtecomp.c for test_mtecomp.c. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags);
+
+#endif  // ARCH_ARM64_MM_TEST_MTECOMP_H_
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
@ 2023-10-06 13:45   ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets buffers of tags
corresponding to a single page. In the common case a buffer can be
compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v6:
  - shuffle bits in inline handles so that they can't be confused with
    canonical pointers;
  - use kmem_cache_zalloc() to allocate compressed storage
  - correctly handle range size overflow
  - minor documentation fixes, clarify the special cases

 v5:
  - make code PAGE_SIZE-agnostic, remove hardcoded constants, updated
    the docs
  - implement debugfs interface
  - Address comments by Andy Shevchenko:
    - update description of mtecomp.c
    - remove redundant assignments, simplify mte_tags_to_ranges()
    - various code simplifications
    - introduce mtecomp.h
    - add :export: to Documentation/arch/arm64/mte-tag-compression.rst

 v4:
  - Addressed comments by Andy Shevchenko:
    - expanded "MTE" to "Memory Tagging Extension" in Kconfig
    - fixed kernel-doc comments, moved them to C source
    - changed variables to unsigned where applicable
    - some code simplifications, fewer unnecessary assignments
    - added the mte_largest_idx_bits() helper
    - added namespace prefixes to all functions
    - added missing headers (but removed bits.h)
  - Addressed comments by Yury Norov:
    - removed test-only functions from mtecomp.h
    - dropped the algoritm name (all functions are now prefixed with
      "mte")
    - added more comments
    - got rid of MTE_RANGES_INLINE
    - renamed bitmap_{get,set}_value() to bitmap_{read,write}()
    - moved the big comment explaining the algorithm to
      Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
      add a link to it from Documentation/arch/arm64/index.rst
    - removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yury Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 Documentation/arch/arm64/index.rst            |   1 +
 .../arch/arm64/mte-tag-compression.rst        | 266 +++++++++
 arch/arm64/Kconfig                            |  11 +
 arch/arm64/include/asm/mtecomp.h              |  13 +
 arch/arm64/mm/Makefile                        |   1 +
 arch/arm64/mm/mtecomp.c                       | 524 ++++++++++++++++++
 arch/arm64/mm/mtecomp.h                       |  12 +
 7 files changed, 828 insertions(+)
 create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
 create mode 100644 arch/arm64/mm/mtecomp.h

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
     legacy_instructions
     memory
     memory-tagging-extension
+    mte-tag-compression
     perf
     pointer-authentication
     ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..05022c3f67d74
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,266 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule), for 16K pages -
+512 bytes, for 64K pages - 2048 bytes. By default, MTE carves out 3.125% (1/16)
+of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory. To reduce memory waste,
+``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in compressed
+form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress an array of ``MTE_PAGE_TAG_STORAGE``
+tag bytes into a byte sequence that can be stored in one of the smaller size
+class allocations (for 4K pages those are 16-, 32-, or 64-byte allocations).
+A special case is storing the tags inline in an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to uncompressed
+buffers allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given ``MTE_PAGE_TAG_STORAGE``-byte ``tags``
+  buffer, allocates storage for it, and returns an opaque handle addressing
+  that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+  and fills the ``MTE_PAGE_TAG_STORAGE``-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+  mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+  by ``handle``.
+
+Depending on the size of compressed data, mte_compress() stores it in one of
+the size classes backed by kmem caches: ``mte-tags-{16,32,64,128}`` for the
+4K-page case (``mte-tags-128`` being used for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by mte_compress() is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+   63  62    60  ...   2         0
+  +---+--------+-----+------------+
+  | 0 | INLINE | ... |  SIZE_LOG  |
+  +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size for out-of-line allocations::
+
+  size = 16 << (handle & 0b111)
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows (we'll be considering 4K pages and 128-byte tag buffers, but the same
+approach scales to 16- and 64K pages):
+
+1. The input array of 128 (``MTE_PAGE_TAG_STORAGE``) bytes is transformed into
+   tag ranges (two arrays: ``r_tags[]`` containing tag values and ``r_sizes[]``
+   containing range lengths) by mte_tags_to_ranges(). Note that
+   ``r_sizes[]`` sums up to 256 (``MTE_GRANULES_PER_PAGE``).
+
+   If ``r_sizes[]`` consists of a single element (``MTE_GRANULES_PER_PAGE``),
+   the corresponding range is split into two halves, i.e.::
+
+     r_sizes_new[2] = { MTE_GRANULES_PER_PAGE / 2, MTE_GRANULES_PER_PAGE / 2 };
+     r_tags_new[2] = { r_tags[0], r_tags[0] };
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+   ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+   because it can be reconstructed from the sum of the remaining elements. Note
+   that now none of the remaining ``r_sizes[]`` elements exceeds
+   ``MTE_GRANULES_PER_PAGE / 2``.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+           N <= 6:  8 bytes (inline case, no allocation required);
+      6 < N <= 11: 16 bytes
+     11 < N <= 23: 32 bytes
+     23 < N <= 46: 64 bytes
+     46 < N:       128 bytes (no compression will be performed)
+
+  (See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+   handle (``i<size>`` means a ``<size>``-bit unsigned integer) treated as a
+   bitmap (see ``include/linux/bitmap.h``)::
+
+      largest_idx : i4
+     r_tags[0..5] : i4 x 6
+    r_sizes[0..4] : i7 x 5
+
+   (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+   values)
+
+   Bitmaps are written from lower bits to higher ones, so the order of the
+   records in the actual 8-byte handle is reversed. Therefore an additional
+   transformation is applied to bring largest_idx to bits ``63..60``.
+   Because ``largest_idx`` is <= 5, bit 63 of the handle is now always 0 (so the
+   handle can be stored in an Xarray), and bits ``62..60`` cannot all be 1 (so
+   the handle can be distinguished from a kernel pointer).
+
+5. For the out-of-line case, the storage is allocated from one of the
+   ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+   on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+   Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+        largest_idx : i6
+       r_tags[0..N] : i4 x N
+    r_sizes[0..N-1] : i7 x (N-1)
+
+   Range size of ``MTE_GRANULES_PER_PAGE / 2`` (at most one) does not fit into
+   i7 and will be written as 0. This case is handled separately by the
+   decompressing procedure.
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+   or out-of line.
+
+2. For the inline case:
+  1. Move the top four bits back to the end of the handle.
+  2. Treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+   the input buffer length. To obtain the pointer to the input buffer, unset
+   bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+   buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+   input buffer.
+
+   If ``largest_idx`` is zero, and all ``r_sizes[]`` are zero, set
+   ``r_sizes[0] = MTE_GRANULES_PER_PAGE / 2``.
+
+   Calculate the removed largest element of ``r_sizes[]`` as
+   ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+   position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+   buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed ``MTE_PAGE_TAG_STORAGE``, those can be packed into
+``S = ilog2(MTE_PAGE_TAG_STORAGE)`` bits, whereas a single tag occupies
+4 bits, and ``largest_idx`` cannot take more than
+``Lmax = ilog2(MTE_GRANULES_PER_PAGE)`` bits.
+
+Now, for each ``B``-byte size class it is possible to find the maximal number
+``M`` such as ``Lmax + 4 * M + S * (M - 1) <= 8 * B``,
+i.e. ``M = (8 * B - 1) / 11``::
+
+ 4K pages: S = 7
+ +-------------+----+--------------+
+ | Buffer size |  M | Storage bits |
+ +-------------+----+--------------+
+ |          8  |  5 |          56  |
+ |         16  | 11 |         122  |
+ |         32  | 23 |         254  |
+ |         64  | 46 |         507  |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+  4K pages: S = 7, L_i = 4, L_o = 6
+  +-------------+----+--------------+
+  | Buffer size |  N | Storage bits |
+  +-------------+----+--------------+
+  |          8  |  6 |          63  |
+  |         16  | 11 |         120  |
+  |         32  | 23 |         252  |
+  |         64  | 46 |         505  |
+  +-------------+----+--------------+
+
+Similarly, for other page sizes::
+
+  16K pages: S = 9, L_i = 4, L_o = 8
+  +-------------+-----+--------------+
+  | Buffer size |  N  | Storage bits |
+  +-------------+-----+--------------+
+  |          8  |   5 |          60  |
+  |         16  |   9 |         116  |
+  |         32  |  19 |         246  |
+  |         64  |  39 |         506  |
+  |        128  |  78 |        1013  |
+  |        256  | 157 |        2040  |
+  +-------------+-----+--------------+
+
+  64K pages: S = 11, L_i = 4, L_o = 10
+  +-------------+-----+--------------+
+  | Buffer size |  N  | Storage bits |
+  +-------------+-----+--------------+
+  |          8  |   4 |          53  |
+  |         16  |   8 |         119  |
+  |         32  |  17 |         254  |
+  |         64  |  34 |         509  |
+  |        128  |  68 |        1019  |
+  |        256  | 136 |        2039  |
+  |        512  | 273 |        4094  |
+  |       1024  | 546 |        8189  |
+  +-------------+-----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+   :export:
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..d4fb3b8d11d77 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,17 @@ config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 Memory Tagging Extension"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 Memory Tagging Extension.
+
+	  Tag buffers corresponding to swapped RAM pages are compressed using
+	  RLE to conserve heap memory. In the common case compressed tags
+	  occupy 2.5x less memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..73e319f1f30da
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,524 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+#include "mtecomp.h"
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/* Caches start at mte-tags-16 and go up to mte-tags-MTE_PAGE_TAG_STORAGE. */
+#define MTECOMP_NUM_CACHES ilog2(MTE_PAGE_TAG_STORAGE / 8)
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+/*
+ * [0] - store the numbers of created/released inline handles;
+ * [1..MTECOMP_NUM_CACHES] - store the number of allocations/deallocations from
+ *                           mtecomp_caches.
+ */
+static atomic_long_t alloc_counters[MTECOMP_NUM_CACHES + 1];
+static atomic_long_t dealloc_counters[MTECOMP_NUM_CACHES + 1];
+
+/*
+ * Largest number of ranges, for which compressed data fits into 63 bits, can
+ * be encoded with 4 bits.
+ */
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+/*
+ * In the worst case every tag is different, then largest index can be up to
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_LARGEST_IDX ilog2(MTE_GRANULES_PER_PAGE)
+/* Range size cannot exceed MTE_GRANULES_PER_PAGE / 2. */
+#define MTE_BITS_PER_SIZE (MTE_BITS_PER_LARGEST_IDX - 1)
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+	return fls(len) - 5;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+	return 16 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array containing MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len)
+{
+	u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+	unsigned int cur_idx = 0, i, j;
+	u8 cur_tag;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	out_tags[cur_idx] = prev_tag;
+	for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
+		j = i % 2;
+		cur_tag = j ? (tags[i / 2] % 16) : (tags[i / 2] / 16);
+		if (cur_tag == prev_tag) {
+			out_sizes[cur_idx]++;
+		} else {
+			cur_idx++;
+			prev_tag = cur_tag;
+			out_tags[cur_idx] = prev_tag;
+			out_sizes[cur_idx] = 1;
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags)
+{
+	unsigned int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ *   MTE_BITS_PER_LARGEST_IDX_INLINE + MTE_TAG_SIZE * N
+ *                                   + MTE_BITS_PER_SIZE * (N-1) <= 63 bits,
+ * for the inline case, or
+ *   MTE_BITS_PER_LARGEST_IDX
+ *     + MTE_TAG_SIZE * N + MTE_BITS_PER_SIZE * (N-1) <= tag array size in bits,
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+	size_t largest_bits;
+
+	largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+				     MTE_BITS_PER_LARGEST_IDX;
+	return (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+	       (MTE_TAG_SIZE + MTE_BITS_PER_SIZE);
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+	size_t size = 8;
+
+	while (size < (1 << MTE_BITS_PER_SIZE)) {
+		if (num_ranges <= mte_size_to_ranges(size))
+			return size;
+		size <<= 1;
+	}
+	return size;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+	return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ *
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return 8;
+	return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+			     unsigned long *pos, unsigned long bits)
+{
+	bitmap_write(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+	if (size == 8)
+		return MTE_BITS_PER_LARGEST_IDX_INLINE;
+	return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+				unsigned long *bitmap, size_t size)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned int largest_idx, i;
+	unsigned short largest = 0;
+	size_t max_ranges;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = mte_largest_idx_bits(size);
+	max_ranges = mte_size_to_ranges(size);
+	mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_TAG_SIZE);
+	if (len == 1) {
+		/*
+		 * We are compressing MTE_GRANULES_PER_PAGE of identical tags.
+		 * Split it into two ranges containing
+		 * MTE_GRANULES_PER_PAGE / 2 tags, so that it falls into the
+		 * special case described below.
+		 */
+		mte_bitmap_write(bitmap, tags[0], &bit_pos, MTE_TAG_SIZE);
+		i = 2;
+	} else {
+		i = len;
+	}
+	for (; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_TAG_SIZE);
+	/*
+	 * Most of the time sizes[i] fits into MTE_BITS_PER_SIZE, apart from a
+	 * special case when:
+	 *   len = 2;
+	 *   sizes = { MTE_GRANULES_PER_PAGE / 2, MTE_GRANULES_PER_PAGE / 2};
+	 * In this case largest_idx will be set to 0, and the size written to
+	 * the bitmap will be also 0.
+	 */
+	for (i = 0; i < len; i++) {
+		if (i != largest_idx)
+			mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+					 MTE_BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/*
+ * When storing tags inline in the 64-bit pointer, we want to be able to
+ * distinguish the resulting values from pointers to out-of-line heap buffers.
+ * Assuming the latter are canonical kernel pointers, we want largest_idx to be
+ * stored in the top bits of the handle, whereas it is easier for
+ * mte_compress_to_buf() to store them at the bottom.
+ *
+ * mte_inline_bits_to_handle() rearranges the bits in a 64-bit inline storage so
+ * that it can never be confused with a canonical pointer.
+ * mte_handle_to_inline_bits() performs the reverse transformation.
+ * In the future these can be extended to account for tagged kernel pointers.
+ */
+static unsigned long mte_inline_bits_to_handle(unsigned long bits)
+{
+	return ror64(bits, MTE_BITS_PER_LARGEST_IDX_INLINE);
+}
+
+static unsigned long mte_handle_to_inline_bits(unsigned long handle)
+{
+	return rol64(handle, MTE_BITS_PER_LARGEST_IDX_INLINE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ *
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+	struct kmem_cache *cache;
+	unsigned short *r_sizes;
+	unsigned long *storage;
+	unsigned int cache_id;
+	size_t alloc_size;
+	u8 *r_tags;
+	size_t r_len;
+	/*
+	 * mte_compress_to_buf() only initializes the bits that mte_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	r_sizes = kmalloc_array(MTE_GRANULES_PER_PAGE, sizeof(unsigned short),
+				GFP_KERNEL);
+	r_tags = kmalloc(MTE_GRANULES_PER_PAGE, GFP_KERNEL);
+	if (!r_sizes || !r_tags)
+		goto ret;
+	r_len = MTE_GRANULES_PER_PAGE;
+	mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = mte_alloc_size(r_len);
+	if (alloc_size == 8) {
+		mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    alloc_size);
+		result = mte_inline_bits_to_handle(result);
+		atomic_long_inc(&alloc_counters[0]);
+		goto ret;
+	}
+	cache_id = mte_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_zalloc(cache, GFP_KERNEL);
+	atomic_long_inc(&alloc_counters[cache_id + 1]);
+	if (!storage) {
+		result = 0;
+		goto ret;
+	}
+	if (alloc_size < MTE_PAGE_TAG_STORAGE) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    alloc_size);
+		result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+		goto ret;
+	}
+	memcpy(storage, tags, alloc_size);
+	result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+ret:
+	kfree(r_tags);
+	kfree(r_sizes);
+	return result;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+				     unsigned long *pos, unsigned long bits)
+{
+	unsigned long start = *pos;
+
+	*pos += bits;
+	return bitmap_read(bitmap, start, bits);
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+				    u8 *tags)
+{
+	unsigned long bit_pos = 0, l_bits;
+	unsigned short *r_sizes, sum;
+	unsigned int largest_idx, i;
+	bool result = true;
+	size_t max_ranges;
+	u8 *r_tags;
+
+	max_ranges = mte_size_to_ranges(size);
+	l_bits = mte_largest_idx_bits(size);
+	largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+	if (largest_idx >= max_ranges)
+		return false;
+	r_sizes = kmalloc_array(max_ranges, sizeof(unsigned short), GFP_KERNEL);
+	r_tags = kmalloc(max_ranges, GFP_KERNEL);
+	if (!r_sizes || !r_tags) {
+		result = false;
+		goto ret;
+	}
+
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_TAG_SIZE);
+	for (i = 0, sum = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] =
+			mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+		/*
+		 * Special case: tag array consists of two ranges of
+		 * `MTE_GRANULES_PER_PAGE / 2` tags.
+		 */
+		if ((largest_idx == 0) && (i == 1) && (r_sizes[i] == 0))
+			r_sizes[i] = MTE_GRANULES_PER_PAGE / 2;
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= MTE_GRANULES_PER_PAGE) {
+		result = false;
+		goto ret;
+	}
+	r_sizes[largest_idx] = MTE_GRANULES_PER_PAGE - sum;
+	mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	result = true;
+ret:
+	kfree(r_sizes);
+	kfree(r_tags);
+	return result;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+	if (mte_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ *
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage;
+	size_t size;
+
+	if ((handle & MTE_HANDLE_MASK) != handle)
+		return false;
+
+	size = mte_storage_size(handle);
+	storage = mte_storage(handle);
+	switch (size) {
+	case 8:
+		handle = mte_handle_to_inline_bits(handle);
+		return mte_decompress_from_buf(&handle, size, tags);
+	case MTE_PAGE_TAG_STORAGE:
+		memcpy(tags, storage, size);
+		return true;
+	default:
+		return mte_decompress_from_buf(storage, size, tags);
+	}
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+	unsigned int cache_id;
+	struct kmem_cache *c;
+	void *storage;
+	size_t size;
+
+	storage = mte_storage(handle);
+	if (!storage) {
+		atomic_long_inc(&dealloc_counters[0]);
+		return;
+	}
+
+	size = mte_storage_size(handle);
+	cache_id = mte_size_to_cache_id(size);
+	c = mtecomp_caches[cache_id];
+	kmem_cache_free(c, storage);
+	atomic_long_inc(&dealloc_counters[cache_id + 1]);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* DebugFS interface. */
+static int stats_show(struct seq_file *seq, void *v)
+{
+	unsigned long total_mem_alloc = 0, total_mem_dealloc = 0;
+	unsigned long total_num_alloc = 0, total_num_dealloc = 0;
+	unsigned long size = 8;
+	long alloc, dealloc;
+	int i;
+
+	for (i = 0; i <= MTECOMP_NUM_CACHES; i++) {
+		alloc = atomic_long_read(&alloc_counters[i]);
+		dealloc = atomic_long_read(&dealloc_counters[i]);
+		total_num_alloc += alloc;
+		total_num_dealloc += dealloc;
+		/*
+		 * Do not count 8-byte buffers towards compressed tag storage
+		 * size.
+		 */
+		if (i) {
+			total_mem_alloc += (size * alloc);
+			total_mem_dealloc += (size * dealloc);
+		}
+		seq_printf(seq,
+			   "%lu bytes: %lu allocations, %lu deallocations\n",
+			   size, alloc, dealloc);
+		size <<= 1;
+	}
+	seq_printf(seq, "uncompressed tag storage size: %lu\n",
+		   (total_num_alloc - total_num_dealloc) *
+			   MTE_PAGE_TAG_STORAGE);
+	seq_printf(seq, "compressed tag storage size: %lu\n",
+		   total_mem_alloc - total_mem_dealloc);
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(stats);
+
+static int mtecomp_debugfs_init(void)
+{
+	struct dentry *mtecomp_dir;
+
+	mtecomp_dir = debugfs_create_dir("mtecomp", NULL);
+	debugfs_create_file("stats", 0444, mtecomp_dir, NULL, &stats_fops);
+	return 0;
+}
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	unsigned int i;
+	char name[16];
+	size_t size;
+
+	static_assert(MTECOMP_NUM_CACHES <= (MTE_CACHE_ID_MASK + 1));
+	for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+		size = mte_cache_id_to_size(i);
+		snprintf(name, sizeof(name), "mte-tags-%ld", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	mtecomp_debugfs_init();
+	return 0;
+}
+module_init(mtecomp_init);
diff --git a/arch/arm64/mm/mtecomp.h b/arch/arm64/mm/mtecomp.h
new file mode 100644
index 0000000000000..b94cf0384f2af
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTECOMP_H_
+#define ARCH_ARM64_MM_MTECOMP_H_
+
+/* Functions exported from mtecomp.c for test_mtecomp.c. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+			size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+			u8 *tags);
+
+#endif  // ARCH_ARM64_MM_TEST_MTECOMP_H_
-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 4/5] arm64: mte: add a test for MTE tags compression
  2023-10-06 13:45 ` Alexander Potapenko
@ 2023-10-06 13:45   ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v6:
  - add test_decompress_invalid() to ensure invalid handles are ignored;
  - add test_upper_bits(), which is a regression test for a case where
    an inline handle looked like an out-of-line one;
  - add test_compress_nonzero() to ensure a full nonzero tag array is
    compressed correctly;
  - add test_two_ranges() to test cases when the input buffer is divided
    into two ranges.

 v5:
  - remove hardcoded constants, added test setup/teardown;
  - support 16- and 64K pages;
  - replace nested if-clauses with expected_size_from_ranges();
  - call mte_release_handle() after tests that perform
    compression/decompression;
  - address comments by Andy Shevchenko:
    - fix include order;
    - use mtecomp.h instead of function prototypes.

 v4:
  - addressed comments by Andy Shevchenko:
    - expanded MTE to "Memory Tagging Extension" in Kconfig
    - changed signed variables to unsigned where applicable
    - added missing header dependencies

  - addressed comments by Yury Norov:
    - moved test-only declarations from mtecomp.h into this test
    - switched to the new "mte"-prefixed function names, dropped the
      mentions of "EA0"
    - added test_tag_to_ranges_n()

 v3:
  - addressed comments by Andy Shevchenko in another patch:
   - switched from u64 to unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
   - fixed includes order
---
 arch/arm64/Kconfig           |  10 +
 arch/arm64/mm/Makefile       |   1 +
 arch/arm64/mm/test_mtecomp.c | 377 +++++++++++++++++++++++++++++++++++
 3 files changed, 388 insertions(+)
 create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d4fb3b8d11d77..ffe3bec89df82 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2104,6 +2104,16 @@ config ARM64_MTE_COMP
 	  RLE to conserve heap memory. In the common case compressed tags
 	  occupy 2.5x less memory.
 
+config ARM64_MTE_COMP_KUNIT_TEST
+	tristate "Test tag compression for ARM64 Memory Tagging Extension" if !KUNIT_ALL_TESTS
+	default KUNIT_ALL_TESTS
+	depends on KUNIT && ARM64_MTE_COMP
+	help
+	  Test MTE compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+	  Ensure that tag sequences containing alternating values are compressed
+	  to buffers of expected size and correctly decompressed afterwards.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..8fe50a214b38c
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,377 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for MTE tags compression algorithm.
+ */
+
+#include <linux/bits.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <kunit/test.h>
+
+#include <asm/mtecomp.h>
+
+#include "mtecomp.h"
+
+/* Per-test storage allocated in mtecomp_test_init(). */
+struct test_data {
+	u8 *tags, *dtags;
+	unsigned short *r_sizes;
+	size_t r_len;
+	u8 *r_tags;
+};
+
+/*
+ * Split td->tags to ranges stored in td->r_tags, td->r_sizes, td->r_len,
+ * then convert those ranges back to tags stored in td->dtags.
+ */
+static void tags_to_ranges_to_tags_helper(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	mte_tags_to_ranges(td->tags, td->r_tags, td->r_sizes, &td->r_len);
+	mte_ranges_to_tags(td->r_tags, td->r_sizes, td->r_len, td->dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(td->tags, td->dtags, MTE_PAGE_TAG_STORAGE),
+			0);
+}
+
+/*
+ * Test that mte_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, 1);
+	KUNIT_EXPECT_EQ(test, td->r_tags[0], 0);
+	KUNIT_EXPECT_EQ(test, td->r_sizes[0], MTE_GRANULES_PER_PAGE);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+	const unsigned short ex_sizes[] = { 1, 2, 2, 1,
+					    MTE_GRANULES_PER_PAGE - 6 };
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	td->tags[0] = 0xa0;
+	td->tags[1] = 0x0a;
+	td->tags[2] = 0xab;
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, 5);
+	KUNIT_EXPECT_EQ(test, memcmp(td->r_tags, ex_tags, sizeof(ex_tags)), 0);
+	KUNIT_EXPECT_EQ(test, memcmp(td->r_sizes, ex_sizes, sizeof(ex_sizes)),
+			0);
+}
+
+/* Test that repeated 0xa0 byte produces MTE_GRANULES_PER_PAGE ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	memset(td->tags, 0xa0, MTE_PAGE_TAG_STORAGE);
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, MTE_GRANULES_PER_PAGE);
+}
+
+/* Generate a buffer that will contain @nranges of tag ranges. */
+static void gen_tag_range_helper(u8 *tags, int nranges)
+{
+	unsigned int i;
+
+	memset(tags, 0, MTE_PAGE_TAG_STORAGE);
+	if (nranges > 1) {
+		nranges--;
+		for (i = 0; i < nranges / 2; i++)
+			tags[i] = 0xab;
+		if (nranges % 2)
+			tags[nranges / 2] = 0xa0;
+	}
+}
+
+/*
+ * Test that mte_tags_to_ranges()/mte_ranges_to_tags() work for various
+ * r_len values.
+ */
+static void test_tag_to_ranges_n(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned int i, j, sum;
+
+	for (i = 1; i <= MTE_GRANULES_PER_PAGE; i++) {
+		gen_tag_range_helper(td->tags, i);
+		tags_to_ranges_to_tags_helper(test);
+		sum = 0;
+		for (j = 0; j < td->r_len; j++)
+			sum += td->r_sizes[j];
+		KUNIT_EXPECT_EQ(test, sum, MTE_GRANULES_PER_PAGE);
+	}
+}
+
+/*
+ * Check that the tag buffer in test->priv can be compressed and decompressed
+ * without changes.
+ */
+static unsigned long compress_decompress_helper(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	handle = mte_compress(td->tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, td->dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(td->tags, td->dtags, MTE_PAGE_TAG_STORAGE),
+			0);
+	return handle;
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/* Test that a 0xaa-filled array is compressed into inline storage. */
+static void test_compress_nonzero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0xaa, MTE_PAGE_TAG_STORAGE);
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/*
+ * Test that two tag ranges are compressed into inline storage.
+ *
+ * This also covers a special case where both ranges contain
+ * `MTE_GRANULES_PER_PAGE / 2` tags and overflow the designated range size.
+ */
+static void test_two_ranges(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+	unsigned int i;
+	size_t r_len = 2;
+	unsigned char r_tags[2] = { 0xe, 0x0 };
+	unsigned short r_sizes[2];
+
+	for (i = 1; i < MTE_GRANULES_PER_PAGE; i++) {
+		r_sizes[0] = i;
+		r_sizes[1] = MTE_GRANULES_PER_PAGE - i;
+		mte_ranges_to_tags(r_tags, r_sizes, r_len, td->tags);
+		handle = compress_decompress_helper(test);
+		KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	}
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	td->tags[0] = 0xa0;
+	td->tags[1] = 0x0a;
+
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/*
+ * Test that a buffer containing @nranges ranges compresses into @exp_size
+ * bytes and decompresses into the original tag sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges,
+				  size_t exp_size)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	gen_tag_range_helper(td->tags, nranges);
+	handle = compress_decompress_helper(test);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), exp_size);
+	mte_release_handle(handle);
+}
+
+static size_t expected_size_from_ranges(unsigned int ranges)
+{
+#if defined CONFIG_ARM64_4K_PAGES
+	unsigned int range_exp[4] = { 6, 11, 23, 46 };
+#elif defined(CONFIG_ARM64_16K_PAGES)
+	unsigned int range_exp[6] = { 5, 9, 19, 39, 78, 157 };
+#elif defined(CONFIG_ARM64_64K_PAGES)
+	unsigned int range_exp[8] = { 4, 8, 17, 34, 68, 136, 273, 546 };
+#endif
+	unsigned int i;
+	size_t size = 8;
+
+	for (i = 0; i < ARRAY_SIZE(range_exp); i++) {
+		if (ranges <= range_exp[i])
+			return size;
+		size <<= 1;
+	}
+	return size;
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+	size_t exp_size;
+	unsigned int i;
+
+	for (i = 1; i <= MTE_GRANULES_PER_PAGE; i++) {
+		exp_size = expected_size_from_ranges(i);
+		compress_range_helper(test, i, exp_size);
+	}
+}
+
+/*
+ * Test that invalid handles are ignored by mte_decompress().
+ */
+static void test_decompress_invalid(struct kunit *test)
+{
+	unsigned long handle1 = 0xeb0b0b010080402f;
+	unsigned long handle2 = 0x6b0b0b010080402f;
+	struct test_data *td = test->priv;
+
+	/* handle1 has bit 63 set to 1. */
+	KUNIT_EXPECT_FALSE(test, mte_decompress(handle1, td->dtags));
+	/*
+	 * handle2 is an inline handle, but its largest_idx (bits 60..62)
+	 * is out of bounds for the inline storage.
+	 */
+	KUNIT_EXPECT_FALSE(test, mte_decompress(handle2, td->dtags));
+}
+
+/*
+ * Test that compressed inline tags cannot be confused with out-of-line
+ * pointers.
+ *
+ * Compressed values are written from bit 0 to bit 63, so the size of the last
+ * tag range initially ends up in the upper bits of the inline representation.
+ * Make sure mte_compress() rearranges the bits so that the resulting handle does
+ * not have 0b0111 as the upper four bits.
+ */
+static void test_upper_bits(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+	unsigned char r_tags[6] = { 7, 0, 7, 0, 7, 0 };
+	unsigned short r_sizes[6] = { 1, 1, 1, 1, 1, 1 };
+	size_t r_len;
+
+	/* Maximum number of ranges that can be encoded inline. */
+	r_len = IS_ENABLED(CONFIG_ARM64_4K_PAGES)  ? 6 :
+		IS_ENABLED(CONFIG_ARM64_16K_PAGES) ? 5 :
+						     4;
+	/* Maximum range size possible, will be omitted. */
+	r_sizes[0] = MTE_GRANULES_PER_PAGE / 2 - 1;
+	/* A number close to r_sizes[0] that has most of its bits set. */
+	r_sizes[r_len - 1] = MTE_GRANULES_PER_PAGE - r_sizes[0] - r_len + 2;
+
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, td->tags);
+	handle = compress_decompress_helper(test);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+}
+
+static void mtecomp_dealloc_testdata(struct test_data *td)
+{
+	kfree(td->tags);
+	kfree(td->dtags);
+	kfree(td->r_sizes);
+	kfree(td->r_tags);
+}
+
+static int mtecomp_test_init(struct kunit *test)
+{
+	struct test_data *td;
+
+	td = kmalloc(sizeof(struct test_data), GFP_KERNEL);
+	if (!td)
+		return 1;
+	td->tags = kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	if (!td->tags)
+		goto error;
+	td->dtags = kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	if (!td->dtags)
+		goto error;
+	td->r_len = MTE_GRANULES_PER_PAGE;
+	td->r_sizes = kmalloc_array(MTE_GRANULES_PER_PAGE,
+				    sizeof(unsigned short), GFP_KERNEL);
+	if (!td->r_sizes)
+		goto error;
+	td->r_tags = kmalloc(MTE_GRANULES_PER_PAGE, GFP_KERNEL);
+	if (!td->r_tags)
+		goto error;
+	test->priv = (void *)td;
+	return 0;
+error:
+	mtecomp_dealloc_testdata(td);
+	return 1;
+}
+
+static void mtecomp_test_exit(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	mtecomp_dealloc_testdata(td);
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+	KUNIT_CASE(test_tags_to_ranges_zero),
+	KUNIT_CASE(test_tags_to_ranges_simple),
+	KUNIT_CASE(test_tags_to_ranges_repeated),
+	KUNIT_CASE(test_tag_to_ranges_n),
+	KUNIT_CASE(test_compress_zero),
+	KUNIT_CASE(test_compress_nonzero),
+	KUNIT_CASE(test_two_ranges),
+	KUNIT_CASE(test_compress_simple),
+	KUNIT_CASE(test_compress_ranges),
+	KUNIT_CASE(test_decompress_invalid),
+	KUNIT_CASE(test_upper_bits),
+	{}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+	.name = "mtecomp",
+	.init = mtecomp_test_init,
+	.exit = mtecomp_test_exit,
+	.test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <glider@google.com>");
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 4/5] arm64: mte: add a test for MTE tags compression
@ 2023-10-06 13:45   ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v6:
  - add test_decompress_invalid() to ensure invalid handles are ignored;
  - add test_upper_bits(), which is a regression test for a case where
    an inline handle looked like an out-of-line one;
  - add test_compress_nonzero() to ensure a full nonzero tag array is
    compressed correctly;
  - add test_two_ranges() to test cases when the input buffer is divided
    into two ranges.

 v5:
  - remove hardcoded constants, added test setup/teardown;
  - support 16- and 64K pages;
  - replace nested if-clauses with expected_size_from_ranges();
  - call mte_release_handle() after tests that perform
    compression/decompression;
  - address comments by Andy Shevchenko:
    - fix include order;
    - use mtecomp.h instead of function prototypes.

 v4:
  - addressed comments by Andy Shevchenko:
    - expanded MTE to "Memory Tagging Extension" in Kconfig
    - changed signed variables to unsigned where applicable
    - added missing header dependencies

  - addressed comments by Yury Norov:
    - moved test-only declarations from mtecomp.h into this test
    - switched to the new "mte"-prefixed function names, dropped the
      mentions of "EA0"
    - added test_tag_to_ranges_n()

 v3:
  - addressed comments by Andy Shevchenko in another patch:
   - switched from u64 to unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
   - fixed includes order
---
 arch/arm64/Kconfig           |  10 +
 arch/arm64/mm/Makefile       |   1 +
 arch/arm64/mm/test_mtecomp.c | 377 +++++++++++++++++++++++++++++++++++
 3 files changed, 388 insertions(+)
 create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d4fb3b8d11d77..ffe3bec89df82 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2104,6 +2104,16 @@ config ARM64_MTE_COMP
 	  RLE to conserve heap memory. In the common case compressed tags
 	  occupy 2.5x less memory.
 
+config ARM64_MTE_COMP_KUNIT_TEST
+	tristate "Test tag compression for ARM64 Memory Tagging Extension" if !KUNIT_ALL_TESTS
+	default KUNIT_ALL_TESTS
+	depends on KUNIT && ARM64_MTE_COMP
+	help
+	  Test MTE compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+	  Ensure that tag sequences containing alternating values are compressed
+	  to buffers of expected size and correctly decompressed afterwards.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..8fe50a214b38c
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,377 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for MTE tags compression algorithm.
+ */
+
+#include <linux/bits.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <kunit/test.h>
+
+#include <asm/mtecomp.h>
+
+#include "mtecomp.h"
+
+/* Per-test storage allocated in mtecomp_test_init(). */
+struct test_data {
+	u8 *tags, *dtags;
+	unsigned short *r_sizes;
+	size_t r_len;
+	u8 *r_tags;
+};
+
+/*
+ * Split td->tags to ranges stored in td->r_tags, td->r_sizes, td->r_len,
+ * then convert those ranges back to tags stored in td->dtags.
+ */
+static void tags_to_ranges_to_tags_helper(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	mte_tags_to_ranges(td->tags, td->r_tags, td->r_sizes, &td->r_len);
+	mte_ranges_to_tags(td->r_tags, td->r_sizes, td->r_len, td->dtags);
+	KUNIT_EXPECT_EQ(test, memcmp(td->tags, td->dtags, MTE_PAGE_TAG_STORAGE),
+			0);
+}
+
+/*
+ * Test that mte_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, 1);
+	KUNIT_EXPECT_EQ(test, td->r_tags[0], 0);
+	KUNIT_EXPECT_EQ(test, td->r_sizes[0], MTE_GRANULES_PER_PAGE);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+	const unsigned short ex_sizes[] = { 1, 2, 2, 1,
+					    MTE_GRANULES_PER_PAGE - 6 };
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	td->tags[0] = 0xa0;
+	td->tags[1] = 0x0a;
+	td->tags[2] = 0xab;
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, 5);
+	KUNIT_EXPECT_EQ(test, memcmp(td->r_tags, ex_tags, sizeof(ex_tags)), 0);
+	KUNIT_EXPECT_EQ(test, memcmp(td->r_sizes, ex_sizes, sizeof(ex_sizes)),
+			0);
+}
+
+/* Test that repeated 0xa0 byte produces MTE_GRANULES_PER_PAGE ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	memset(td->tags, 0xa0, MTE_PAGE_TAG_STORAGE);
+	tags_to_ranges_to_tags_helper(test);
+
+	KUNIT_EXPECT_EQ(test, td->r_len, MTE_GRANULES_PER_PAGE);
+}
+
+/* Generate a buffer that will contain @nranges of tag ranges. */
+static void gen_tag_range_helper(u8 *tags, int nranges)
+{
+	unsigned int i;
+
+	memset(tags, 0, MTE_PAGE_TAG_STORAGE);
+	if (nranges > 1) {
+		nranges--;
+		for (i = 0; i < nranges / 2; i++)
+			tags[i] = 0xab;
+		if (nranges % 2)
+			tags[nranges / 2] = 0xa0;
+	}
+}
+
+/*
+ * Test that mte_tags_to_ranges()/mte_ranges_to_tags() work for various
+ * r_len values.
+ */
+static void test_tag_to_ranges_n(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned int i, j, sum;
+
+	for (i = 1; i <= MTE_GRANULES_PER_PAGE; i++) {
+		gen_tag_range_helper(td->tags, i);
+		tags_to_ranges_to_tags_helper(test);
+		sum = 0;
+		for (j = 0; j < td->r_len; j++)
+			sum += td->r_sizes[j];
+		KUNIT_EXPECT_EQ(test, sum, MTE_GRANULES_PER_PAGE);
+	}
+}
+
+/*
+ * Check that the tag buffer in test->priv can be compressed and decompressed
+ * without changes.
+ */
+static unsigned long compress_decompress_helper(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	handle = mte_compress(td->tags);
+	KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+	KUNIT_EXPECT_TRUE(test, mte_decompress(handle, td->dtags));
+	KUNIT_EXPECT_EQ(test, memcmp(td->tags, td->dtags, MTE_PAGE_TAG_STORAGE),
+			0);
+	return handle;
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/* Test that a 0xaa-filled array is compressed into inline storage. */
+static void test_compress_nonzero(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0xaa, MTE_PAGE_TAG_STORAGE);
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/*
+ * Test that two tag ranges are compressed into inline storage.
+ *
+ * This also covers a special case where both ranges contain
+ * `MTE_GRANULES_PER_PAGE / 2` tags and overflow the designated range size.
+ */
+static void test_two_ranges(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+	unsigned int i;
+	size_t r_len = 2;
+	unsigned char r_tags[2] = { 0xe, 0x0 };
+	unsigned short r_sizes[2];
+
+	for (i = 1; i < MTE_GRANULES_PER_PAGE; i++) {
+		r_sizes[0] = i;
+		r_sizes[1] = MTE_GRANULES_PER_PAGE - i;
+		mte_ranges_to_tags(r_tags, r_sizes, r_len, td->tags);
+		handle = compress_decompress_helper(test);
+		KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	}
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	memset(td->tags, 0, MTE_PAGE_TAG_STORAGE);
+	td->tags[0] = 0xa0;
+	td->tags[1] = 0x0a;
+
+	handle = compress_decompress_helper(test);
+	/* Tags are stored inline. */
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+	mte_release_handle(handle);
+}
+
+/*
+ * Test that a buffer containing @nranges ranges compresses into @exp_size
+ * bytes and decompresses into the original tag sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges,
+				  size_t exp_size)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+
+	gen_tag_range_helper(td->tags, nranges);
+	handle = compress_decompress_helper(test);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), exp_size);
+	mte_release_handle(handle);
+}
+
+static size_t expected_size_from_ranges(unsigned int ranges)
+{
+#if defined CONFIG_ARM64_4K_PAGES
+	unsigned int range_exp[4] = { 6, 11, 23, 46 };
+#elif defined(CONFIG_ARM64_16K_PAGES)
+	unsigned int range_exp[6] = { 5, 9, 19, 39, 78, 157 };
+#elif defined(CONFIG_ARM64_64K_PAGES)
+	unsigned int range_exp[8] = { 4, 8, 17, 34, 68, 136, 273, 546 };
+#endif
+	unsigned int i;
+	size_t size = 8;
+
+	for (i = 0; i < ARRAY_SIZE(range_exp); i++) {
+		if (ranges <= range_exp[i])
+			return size;
+		size <<= 1;
+	}
+	return size;
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+	size_t exp_size;
+	unsigned int i;
+
+	for (i = 1; i <= MTE_GRANULES_PER_PAGE; i++) {
+		exp_size = expected_size_from_ranges(i);
+		compress_range_helper(test, i, exp_size);
+	}
+}
+
+/*
+ * Test that invalid handles are ignored by mte_decompress().
+ */
+static void test_decompress_invalid(struct kunit *test)
+{
+	unsigned long handle1 = 0xeb0b0b010080402f;
+	unsigned long handle2 = 0x6b0b0b010080402f;
+	struct test_data *td = test->priv;
+
+	/* handle1 has bit 63 set to 1. */
+	KUNIT_EXPECT_FALSE(test, mte_decompress(handle1, td->dtags));
+	/*
+	 * handle2 is an inline handle, but its largest_idx (bits 60..62)
+	 * is out of bounds for the inline storage.
+	 */
+	KUNIT_EXPECT_FALSE(test, mte_decompress(handle2, td->dtags));
+}
+
+/*
+ * Test that compressed inline tags cannot be confused with out-of-line
+ * pointers.
+ *
+ * Compressed values are written from bit 0 to bit 63, so the size of the last
+ * tag range initially ends up in the upper bits of the inline representation.
+ * Make sure mte_compress() rearranges the bits so that the resulting handle does
+ * not have 0b0111 as the upper four bits.
+ */
+static void test_upper_bits(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+	unsigned long handle;
+	unsigned char r_tags[6] = { 7, 0, 7, 0, 7, 0 };
+	unsigned short r_sizes[6] = { 1, 1, 1, 1, 1, 1 };
+	size_t r_len;
+
+	/* Maximum number of ranges that can be encoded inline. */
+	r_len = IS_ENABLED(CONFIG_ARM64_4K_PAGES)  ? 6 :
+		IS_ENABLED(CONFIG_ARM64_16K_PAGES) ? 5 :
+						     4;
+	/* Maximum range size possible, will be omitted. */
+	r_sizes[0] = MTE_GRANULES_PER_PAGE / 2 - 1;
+	/* A number close to r_sizes[0] that has most of its bits set. */
+	r_sizes[r_len - 1] = MTE_GRANULES_PER_PAGE - r_sizes[0] - r_len + 2;
+
+	mte_ranges_to_tags(r_tags, r_sizes, r_len, td->tags);
+	handle = compress_decompress_helper(test);
+	KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+}
+
+static void mtecomp_dealloc_testdata(struct test_data *td)
+{
+	kfree(td->tags);
+	kfree(td->dtags);
+	kfree(td->r_sizes);
+	kfree(td->r_tags);
+}
+
+static int mtecomp_test_init(struct kunit *test)
+{
+	struct test_data *td;
+
+	td = kmalloc(sizeof(struct test_data), GFP_KERNEL);
+	if (!td)
+		return 1;
+	td->tags = kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	if (!td->tags)
+		goto error;
+	td->dtags = kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	if (!td->dtags)
+		goto error;
+	td->r_len = MTE_GRANULES_PER_PAGE;
+	td->r_sizes = kmalloc_array(MTE_GRANULES_PER_PAGE,
+				    sizeof(unsigned short), GFP_KERNEL);
+	if (!td->r_sizes)
+		goto error;
+	td->r_tags = kmalloc(MTE_GRANULES_PER_PAGE, GFP_KERNEL);
+	if (!td->r_tags)
+		goto error;
+	test->priv = (void *)td;
+	return 0;
+error:
+	mtecomp_dealloc_testdata(td);
+	return 1;
+}
+
+static void mtecomp_test_exit(struct kunit *test)
+{
+	struct test_data *td = test->priv;
+
+	mtecomp_dealloc_testdata(td);
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+	KUNIT_CASE(test_tags_to_ranges_zero),
+	KUNIT_CASE(test_tags_to_ranges_simple),
+	KUNIT_CASE(test_tags_to_ranges_repeated),
+	KUNIT_CASE(test_tag_to_ranges_n),
+	KUNIT_CASE(test_compress_zero),
+	KUNIT_CASE(test_compress_nonzero),
+	KUNIT_CASE(test_two_ranges),
+	KUNIT_CASE(test_compress_simple),
+	KUNIT_CASE(test_compress_ranges),
+	KUNIT_CASE(test_decompress_invalid),
+	KUNIT_CASE(test_upper_bits),
+	{}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+	.name = "mtecomp",
+	.init = mtecomp_test_init,
+	.exit = mtecomp_test_exit,
+	.test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <glider@google.com>");
-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 5/5] arm64: mte: add compression support to mteswap.c
  2023-10-06 13:45 ` Alexander Potapenko
@ 2023-10-06 13:45   ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Define the internal mteswap.h interface:
 - _mte_alloc_and_save_tags()
 - _mte_free_saved_tags()
 - _mte_restore_tags()

, that encapsulates saving tags for a struct page (together with memory
allocation), restoring tags, and deleting the storage allocated for them.

These functions accept opaque pointers, which may point to tag buffers
of size MTE_PAGE_TAG_STORAGE, as well as smaller buffers containing
compressed tags, or may have compressed tags stored directly in the
pointers.

The existing code from mteswap.c operating with uncompressed tags is split
away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
provides compression support. The latter implementation
is picked if CONFIG_ARM64_MTE_COMP=y.

Soon after booting Android, tag compression saves ~2.5x memory previously
spent by mteswap.c on tag allocations. With the growing uptime, the
savings may reach 4x and even more. On a moderately loaded device with
~20% pages using tagging, the compressed tags for swapped pages only
occupied 2.9Mb instead of 16.5Mb:

  8 bytes: 118277 allocations, 16089 deallocations
  16 bytes: 10762 allocations, 6738 deallocations
  32 bytes: 10748 allocations, 6823 deallocations
  64 bytes: 10510 allocations, 6644 deallocations
  128 bytes: 68375 allocations, 47378 deallocations
  uncompressed tag storage size: 17280000
  compressed tag storage size: 3125024

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v5:
  - drop a dead variable from _mte_free_saved_tags() in mteswap_comp.c
  - ensure MTE compression works with arbitrary page sizes
  - update patch description

 v4:
  - minor code simplifications suggested by Andy Shevchenko, added
    missing header dependencies
  - changed compression API names to reflect modifications made to
    memcomp.h (as suggested by Yury Norov)

 v3:
  - Addressed comments by Andy Shevchenko in another patch:
   - fixed includes order
   - replaced u64 with unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
---
 arch/arm64/mm/Makefile         |  5 +++
 arch/arm64/mm/mteswap.c        | 20 +++++-------
 arch/arm64/mm/mteswap.h        | 12 +++++++
 arch/arm64/mm/mteswap_comp.c   | 60 ++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap_nocomp.c | 38 +++++++++++++++++++++
 5 files changed, 124 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 170dc62b010b9..46a798e2b67cb 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+ifdef CONFIG_ARM64_MTE_COMP
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_comp.o
+else
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_nocomp.o
+endif
 obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1b..9d8f87fd191a2 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,8 +5,11 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+
 #include <asm/mte.h>
 
+#include "mteswap.h"
+
 static DEFINE_XARRAY(mte_pages);
 
 void *mte_allocate_tag_storage(void)
@@ -27,20 +30,18 @@ int mte_save_tags(struct page *page)
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
+	tag_storage = _mte_alloc_and_save_tags(page);
 	if (!tag_storage)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
-
 	/* page_private contains the swap entry.val set in do_swap_page */
 	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		_mte_free_saved_tags(tag_storage);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		_mte_free_saved_tags(ret);
 	}
 
 	return 0;
@@ -53,10 +54,7 @@ void mte_restore_tags(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
-	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
-		set_page_mte_tagged(page);
-	}
+	_mte_restore_tags(tags, page);
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
@@ -64,7 +62,7 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	swp_entry_t entry = swp_entry(type, offset);
 	void *tags = xa_erase(&mte_pages, entry.val);
 
-	mte_free_tag_storage(tags);
+	_mte_free_saved_tags(tags);
 }
 
 void mte_invalidate_tags_area(int type)
@@ -78,7 +76,7 @@ void mte_invalidate_tags_area(int type)
 	xa_lock(&mte_pages);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
 		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		_mte_free_saved_tags(tags);
 	}
 	xa_unlock(&mte_pages);
 }
diff --git a/arch/arm64/mm/mteswap.h b/arch/arm64/mm/mteswap.h
new file mode 100644
index 0000000000000..4c576b76785d1
--- /dev/null
+++ b/arch/arm64/mm/mteswap.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTESWAP_H_
+#define ARCH_ARM64_MM_MTESWAP_H_
+
+struct page;
+
+void *_mte_alloc_and_save_tags(struct page *page);
+void _mte_free_saved_tags(void *tags);
+void _mte_restore_tags(void *tags, struct page *page);
+
+#endif // ARCH_ARM64_MM_MTESWAP_H_
diff --git a/arch/arm64/mm/mteswap_comp.c b/arch/arm64/mm/mteswap_comp.c
new file mode 100644
index 0000000000000..4c628405822ce
--- /dev/null
+++ b/arch/arm64/mm/mteswap_comp.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management with compression. */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+#include <asm/mtecomp.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	unsigned long handle;
+	u8 *tags;
+
+	tags = mte_allocate_tag_storage();
+	if (!tags)
+		return xa_mk_value(0);
+	mte_save_page_tags(page_address(page), tags);
+	handle = mte_compress(tags);
+	mte_free_tag_storage(tags);
+	return xa_mk_value(handle);
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	unsigned long handle;
+
+	handle = xa_to_value(storage);
+	if (!handle)
+		return;
+	mte_release_handle(handle);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	unsigned long handle;
+	u8 *tags_decomp;
+
+	handle = xa_to_value(tags);
+	if (!handle)
+		return;
+	if (!try_page_mte_tagging(page))
+		return;
+	tags_decomp = mte_allocate_tag_storage();
+	if (!tags_decomp)
+		return;
+	if (!mte_decompress(handle, tags_decomp))
+		return;
+	mte_restore_page_tags(page_address(page), tags_decomp);
+	set_page_mte_tagged(page);
+	mte_free_tag_storage(tags_decomp);
+}
+MODULE_IMPORT_NS(MTECOMP);
diff --git a/arch/arm64/mm/mteswap_nocomp.c b/arch/arm64/mm/mteswap_nocomp.c
new file mode 100644
index 0000000000000..1e665a4b5f940
--- /dev/null
+++ b/arch/arm64/mm/mteswap_nocomp.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management without compression support. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	void *storage;
+
+	storage = mte_allocate_tag_storage();
+	if (!storage)
+		return NULL;
+
+	mte_save_page_tags(page_address(page), storage);
+	return storage;
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	mte_free_tag_storage(storage);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	if (!try_page_mte_tagging(page))
+		return;
+	mte_restore_page_tags(page_address(page), tags);
+	set_page_mte_tagged(page);
+}
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 5/5] arm64: mte: add compression support to mteswap.c
@ 2023-10-06 13:45   ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-06 13:45 UTC (permalink / raw)
  To: glider, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux, yury.norov
  Cc: linux-kernel, linux-arm-kernel, eugenis, syednwaris, william.gray

Define the internal mteswap.h interface:
 - _mte_alloc_and_save_tags()
 - _mte_free_saved_tags()
 - _mte_restore_tags()

, that encapsulates saving tags for a struct page (together with memory
allocation), restoring tags, and deleting the storage allocated for them.

These functions accept opaque pointers, which may point to tag buffers
of size MTE_PAGE_TAG_STORAGE, as well as smaller buffers containing
compressed tags, or may have compressed tags stored directly in the
pointers.

The existing code from mteswap.c operating with uncompressed tags is split
away into mteswap_nocomp.c, and the newly introduced mteswap_comp.c
provides compression support. The latter implementation
is picked if CONFIG_ARM64_MTE_COMP=y.

Soon after booting Android, tag compression saves ~2.5x memory previously
spent by mteswap.c on tag allocations. With the growing uptime, the
savings may reach 4x and even more. On a moderately loaded device with
~20% pages using tagging, the compressed tags for swapped pages only
occupied 2.9Mb instead of 16.5Mb:

  8 bytes: 118277 allocations, 16089 deallocations
  16 bytes: 10762 allocations, 6738 deallocations
  32 bytes: 10748 allocations, 6823 deallocations
  64 bytes: 10510 allocations, 6644 deallocations
  128 bytes: 68375 allocations, 47378 deallocations
  uncompressed tag storage size: 17280000
  compressed tag storage size: 3125024

Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v5:
  - drop a dead variable from _mte_free_saved_tags() in mteswap_comp.c
  - ensure MTE compression works with arbitrary page sizes
  - update patch description

 v4:
  - minor code simplifications suggested by Andy Shevchenko, added
    missing header dependencies
  - changed compression API names to reflect modifications made to
    memcomp.h (as suggested by Yury Norov)

 v3:
  - Addressed comments by Andy Shevchenko in another patch:
   - fixed includes order
   - replaced u64 with unsigned long
   - added MODULE_IMPORT_NS(MTECOMP)
---
 arch/arm64/mm/Makefile         |  5 +++
 arch/arm64/mm/mteswap.c        | 20 +++++-------
 arch/arm64/mm/mteswap.h        | 12 +++++++
 arch/arm64/mm/mteswap_comp.c   | 60 ++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mteswap_nocomp.c | 38 +++++++++++++++++++++
 5 files changed, 124 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/mm/mteswap.h
 create mode 100644 arch/arm64/mm/mteswap_comp.c
 create mode 100644 arch/arm64/mm/mteswap_nocomp.c

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 170dc62b010b9..46a798e2b67cb 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
 obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
+ifdef CONFIG_ARM64_MTE_COMP
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_comp.o
+else
+obj-$(CONFIG_ARM64_MTE)		+= mteswap_nocomp.o
+endif
 obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1b..9d8f87fd191a2 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,8 +5,11 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+
 #include <asm/mte.h>
 
+#include "mteswap.h"
+
 static DEFINE_XARRAY(mte_pages);
 
 void *mte_allocate_tag_storage(void)
@@ -27,20 +30,18 @@ int mte_save_tags(struct page *page)
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
+	tag_storage = _mte_alloc_and_save_tags(page);
 	if (!tag_storage)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
-
 	/* page_private contains the swap entry.val set in do_swap_page */
 	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		_mte_free_saved_tags(tag_storage);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		_mte_free_saved_tags(ret);
 	}
 
 	return 0;
@@ -53,10 +54,7 @@ void mte_restore_tags(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
-	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
-		set_page_mte_tagged(page);
-	}
+	_mte_restore_tags(tags, page);
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
@@ -64,7 +62,7 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	swp_entry_t entry = swp_entry(type, offset);
 	void *tags = xa_erase(&mte_pages, entry.val);
 
-	mte_free_tag_storage(tags);
+	_mte_free_saved_tags(tags);
 }
 
 void mte_invalidate_tags_area(int type)
@@ -78,7 +76,7 @@ void mte_invalidate_tags_area(int type)
 	xa_lock(&mte_pages);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
 		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		_mte_free_saved_tags(tags);
 	}
 	xa_unlock(&mte_pages);
 }
diff --git a/arch/arm64/mm/mteswap.h b/arch/arm64/mm/mteswap.h
new file mode 100644
index 0000000000000..4c576b76785d1
--- /dev/null
+++ b/arch/arm64/mm/mteswap.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTESWAP_H_
+#define ARCH_ARM64_MM_MTESWAP_H_
+
+struct page;
+
+void *_mte_alloc_and_save_tags(struct page *page);
+void _mte_free_saved_tags(void *tags);
+void _mte_restore_tags(void *tags, struct page *page);
+
+#endif // ARCH_ARM64_MM_MTESWAP_H_
diff --git a/arch/arm64/mm/mteswap_comp.c b/arch/arm64/mm/mteswap_comp.c
new file mode 100644
index 0000000000000..4c628405822ce
--- /dev/null
+++ b/arch/arm64/mm/mteswap_comp.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management with compression. */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+#include <asm/mtecomp.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	unsigned long handle;
+	u8 *tags;
+
+	tags = mte_allocate_tag_storage();
+	if (!tags)
+		return xa_mk_value(0);
+	mte_save_page_tags(page_address(page), tags);
+	handle = mte_compress(tags);
+	mte_free_tag_storage(tags);
+	return xa_mk_value(handle);
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	unsigned long handle;
+
+	handle = xa_to_value(storage);
+	if (!handle)
+		return;
+	mte_release_handle(handle);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	unsigned long handle;
+	u8 *tags_decomp;
+
+	handle = xa_to_value(tags);
+	if (!handle)
+		return;
+	if (!try_page_mte_tagging(page))
+		return;
+	tags_decomp = mte_allocate_tag_storage();
+	if (!tags_decomp)
+		return;
+	if (!mte_decompress(handle, tags_decomp))
+		return;
+	mte_restore_page_tags(page_address(page), tags_decomp);
+	set_page_mte_tagged(page);
+	mte_free_tag_storage(tags_decomp);
+}
+MODULE_IMPORT_NS(MTECOMP);
diff --git a/arch/arm64/mm/mteswap_nocomp.c b/arch/arm64/mm/mteswap_nocomp.c
new file mode 100644
index 0000000000000..1e665a4b5f940
--- /dev/null
+++ b/arch/arm64/mm/mteswap_nocomp.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* MTE tag storage management without compression support. */
+
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/xarray.h>
+
+#include <asm/mte.h>
+
+#include "mteswap.h"
+
+void *_mte_alloc_and_save_tags(struct page *page)
+{
+	void *storage;
+
+	storage = mte_allocate_tag_storage();
+	if (!storage)
+		return NULL;
+
+	mte_save_page_tags(page_address(page), storage);
+	return storage;
+}
+
+void _mte_free_saved_tags(void *storage)
+{
+	mte_free_tag_storage(storage);
+}
+
+void _mte_restore_tags(void *tags, struct page *page)
+{
+	if (!try_page_mte_tagging(page))
+		return;
+	mte_restore_page_tags(page_address(page), tags);
+	set_page_mte_tagged(page);
+}
-- 
2.42.0.609.gbb76f46606-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 13:45   ` Alexander Potapenko
@ 2023-10-06 14:47     ` Andy Shevchenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Andy Shevchenko @ 2023-10-06 14:47 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> From: Syed Nayyar Waris <syednwaris@gmail.com>
> 
> The two new functions allow reading/writing values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a number of changes and simplifications:
>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value());
>  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
>    and bitmap_write();
>  - some redundant computations are omitted.

...

> v6:
>  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
>    return 0.

Hmm... See below.

...

>   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
>   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst

With the grouping as below I would add a blank line here. But was the intention
to group _arrXX() to these groups?

>   *  bitmap_get_value8(map, start)               Get 8bit value from map at start
> + *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
> + *                                              map at start
>   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> + *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
> + *                                              map at start

...

> +static inline unsigned long bitmap_read(const unsigned long *map,
> +					unsigned long start,
> +					unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long value_low, value_high;

> +	if (unlikely(!nbits))
> +		return 0;

Hmm... I didn't get was the comment to add or to remove these checks?

> +	if (space >= nbits)
> +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);

And don't you want to replace this GENMASK() as well?

> +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> +	return (value_low >> offset) | (value_high << space);
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-06 14:47     ` Andy Shevchenko
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Shevchenko @ 2023-10-06 14:47 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> From: Syed Nayyar Waris <syednwaris@gmail.com>
> 
> The two new functions allow reading/writing values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a number of changes and simplifications:
>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value());
>  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
>    and bitmap_write();
>  - some redundant computations are omitted.

...

> v6:
>  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
>    return 0.

Hmm... See below.

...

>   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
>   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst

With the grouping as below I would add a blank line here. But was the intention
to group _arrXX() to these groups?

>   *  bitmap_get_value8(map, start)               Get 8bit value from map at start
> + *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
> + *                                              map at start
>   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> + *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
> + *                                              map at start

...

> +static inline unsigned long bitmap_read(const unsigned long *map,
> +					unsigned long start,
> +					unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long value_low, value_high;

> +	if (unlikely(!nbits))
> +		return 0;

Hmm... I didn't get was the comment to add or to remove these checks?

> +	if (space >= nbits)
> +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);

And don't you want to replace this GENMASK() as well?

> +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> +	return (value_low >> offset) | (value_high << space);
> +}

-- 
With Best Regards,
Andy Shevchenko



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 14:47     ` Andy Shevchenko
@ 2023-10-06 16:53       ` Yury Norov
  -1 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2023-10-06 16:53 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 05:47:49PM +0300, Andy Shevchenko wrote:
> On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> > From: Syed Nayyar Waris <syednwaris@gmail.com>
> > 
> > The two new functions allow reading/writing values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> > 
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a number of changes and simplifications:
> >  - instead of using roundup(), which adds an unnecessary dependency
> >    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> >  - indentation is reduced by not using else-clauses (suggested by
> >    checkpatch for bitmap_get_value());
> >  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
> >    and bitmap_write();
> >  - some redundant computations are omitted.
> 
> ...
> 
> > v6:
> >  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
> >    return 0.
> 
> Hmm... See below.

 [...]
 
> > +static inline unsigned long bitmap_read(const unsigned long *map,
> > +					unsigned long start,
> > +					unsigned long nbits)
> > +{
> > +	size_t index = BIT_WORD(start);
> > +	unsigned long offset = start % BITS_PER_LONG;
> > +	unsigned long space = BITS_PER_LONG - offset;
> > +	unsigned long value_low, value_high;
> 
> > +	if (unlikely(!nbits))
> > +		return 0;
> 
> Hmm... I didn't get was the comment to add or to remove these checks?

The sentence relates to the test, and the comment that confused you
should to to the 2nd patch.

I.e., bitmap_read(..., 0) is not required to return 0, and it's purely
an implementation details.

> 
> > +	if (space >= nbits)
> > +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> 
> And don't you want to replace this GENMASK() as well?

+1

> > +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> > +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> > +	return (value_low >> offset) | (value_high << space);
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-06 16:53       ` Yury Norov
  0 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2023-10-06 16:53 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Alexander Potapenko, catalin.marinas, will, pcc, andreyknvl,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 05:47:49PM +0300, Andy Shevchenko wrote:
> On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> > From: Syed Nayyar Waris <syednwaris@gmail.com>
> > 
> > The two new functions allow reading/writing values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> > 
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a number of changes and simplifications:
> >  - instead of using roundup(), which adds an unnecessary dependency
> >    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> >  - indentation is reduced by not using else-clauses (suggested by
> >    checkpatch for bitmap_get_value());
> >  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
> >    and bitmap_write();
> >  - some redundant computations are omitted.
> 
> ...
> 
> > v6:
> >  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
> >    return 0.
> 
> Hmm... See below.

 [...]
 
> > +static inline unsigned long bitmap_read(const unsigned long *map,
> > +					unsigned long start,
> > +					unsigned long nbits)
> > +{
> > +	size_t index = BIT_WORD(start);
> > +	unsigned long offset = start % BITS_PER_LONG;
> > +	unsigned long space = BITS_PER_LONG - offset;
> > +	unsigned long value_low, value_high;
> 
> > +	if (unlikely(!nbits))
> > +		return 0;
> 
> Hmm... I didn't get was the comment to add or to remove these checks?

The sentence relates to the test, and the comment that confused you
should to to the 2nd patch.

I.e., bitmap_read(..., 0) is not required to return 0, and it's purely
an implementation details.

> 
> > +	if (space >= nbits)
> > +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> 
> And don't you want to replace this GENMASK() as well?

+1

> > +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> > +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> > +	return (value_low >> offset) | (value_high << space);
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 13:45   ` Alexander Potapenko
@ 2023-10-06 22:35     ` Yury Norov
  -1 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2023-10-06 22:35 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> From: Syed Nayyar Waris <syednwaris@gmail.com>
> 
> The two new functions allow reading/writing values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a number of changes and simplifications:
>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value());
>  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
>    and bitmap_write();
>  - some redundant computations are omitted.
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
> Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Co-developed-by: Alexander Potapenko <glider@google.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> 
> ---
> This patch was previously called "lib/bitmap: add
> bitmap_{set,get}_value()"
> (https://lore.kernel.org/lkml/20230720173956.3674987-2-glider@google.com/)
> 
> v6:
>  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
>    return 0.
> 
> v5:
>  - Address comments by Yury Norov:
>    - updated code comments and patch title/description
>    - replace GENMASK(nbits - 1, 0) with BITMAP_LAST_WORD_MASK(nbits)
>    - more compact bitmap_write() implementation
> 
> v4:
>  - Address comments by Andy Shevchenko and Yury Norov:
>    - prevent passing values >= 64 to GENMASK()
>    - fix commit authorship
>    - change comments
>    - check for unlikely(nbits==0)
>    - drop unnecessary const declarations
>    - fix kernel-doc comments
>    - rename bitmap_{get,set}_value() to bitmap_{read,write}()
> ---
>  include/linux/bitmap.h | 68 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 03644237e1efb..e72c054d21d48 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -76,7 +76,11 @@ struct device;
>   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
>   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
>   *  bitmap_get_value8(map, start)               Get 8bit value from map at start
> + *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
> + *                                              map at start
>   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> + *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
> + *                                              map at start
>   *
>   * Note, bitmap_zero() and bitmap_fill() operate over the region of
>   * unsigned longs, that is, bits behind bitmap till the unsigned long
> @@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
>  	return (map[index] >> offset) & 0xFF;
>  }
>  
> +/**
> + * bitmap_read - read a value of n-bits from the memory region
> + * @map: address to the bitmap memory region
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG
> + *
> + * Returns: value of nbits located at the @start bit offset within the @map
> + * memory region.
> + */
> +static inline unsigned long bitmap_read(const unsigned long *map,
> +					unsigned long start,
> +					unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long value_low, value_high;
> +
> +	if (unlikely(!nbits))
> +		return 0;
> +	if (space >= nbits)
> +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> +	return (value_low >> offset) | (value_high << space);
> +}
> +
>  /**
>   * bitmap_set_value8 - set an 8-bit value within a memory region
>   * @map: address to the bitmap memory region
> @@ -599,6 +630,43 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
>  	map[index] |= value << offset;
>  }
>  
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value to write, clamped to nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG.
> + *
> + * bitmap_write() behaves similarly to @nbits calls of assign_bit(), i.e. bits
> + * beyond @nbits are ignored:
> + *
> + *   for (bit = 0; bit < nbits; bit++)
> + *           assign_bit(start + bit, bitmap, val & BIT(bit));

__assign_bit()

> + */

'behaves similarly' sounds like an understatement. I think, it behaves
much faster because it can assign up to 64 bits at once, not mentioning
the pressure on cache lines traffic.

How faster - that's a good question. I'd be really pleased if you add
a performance test for bitmap_write/read. Or I can do it myself later.
You can find examples in the same lib/test_bitmap.c.

> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long mask;
> +
> +	if (unlikely(!nbits))
> +		return;

can you please add more empty lines to separate blocks visually?

> +	mask = BITMAP_LAST_WORD_MASK(nbits);
> +	value &= mask;
> +	if (space >= nbits) {
> +		map[index] &= ~(mask << offset);
> +		map[index] |= value << offset;
> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

I compiled the below fix on spark64 BE machine:

--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -608,7 +608,7 @@ static inline unsigned long bitmap_read(const unsigned long *map,
        if (unlikely(!nbits))
                return 0;
        if (space >= nbits)
-               return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+               return (map[index] >> offset) & BITMAP_LAST_WORD_MASK(nbits);
        value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
        value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
        return (value_low >> offset) | (value_high << space);
@@ -661,9 +661,9 @@ static inline void bitmap_write(unsigned long *map,
                map[index] |= value << offset;
                return;
        }
-       map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+       map[index] &= BITMAP_LAST_WORD_MASK(start);
        map[index] |= value << offset;
-       map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+       map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
 }

All the tests are passed just as before, and there's no any difference
reported by bloat-o-meter. Can you please use non-negation versions as
they are more straightforward?

> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __LINUX_BITMAP_H */
> -- 
> 2.42.0.609.gbb76f46606-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-06 22:35     ` Yury Norov
  0 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2023-10-06 22:35 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> From: Syed Nayyar Waris <syednwaris@gmail.com>
> 
> The two new functions allow reading/writing values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
> 
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a number of changes and simplifications:
>  - instead of using roundup(), which adds an unnecessary dependency
>    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
>  - indentation is reduced by not using else-clauses (suggested by
>    checkpatch for bitmap_get_value());
>  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
>    and bitmap_write();
>  - some redundant computations are omitted.
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
> Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Co-developed-by: Alexander Potapenko <glider@google.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> 
> ---
> This patch was previously called "lib/bitmap: add
> bitmap_{set,get}_value()"
> (https://lore.kernel.org/lkml/20230720173956.3674987-2-glider@google.com/)
> 
> v6:
>  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
>    return 0.
> 
> v5:
>  - Address comments by Yury Norov:
>    - updated code comments and patch title/description
>    - replace GENMASK(nbits - 1, 0) with BITMAP_LAST_WORD_MASK(nbits)
>    - more compact bitmap_write() implementation
> 
> v4:
>  - Address comments by Andy Shevchenko and Yury Norov:
>    - prevent passing values >= 64 to GENMASK()
>    - fix commit authorship
>    - change comments
>    - check for unlikely(nbits==0)
>    - drop unnecessary const declarations
>    - fix kernel-doc comments
>    - rename bitmap_{get,set}_value() to bitmap_{read,write}()
> ---
>  include/linux/bitmap.h | 68 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 03644237e1efb..e72c054d21d48 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -76,7 +76,11 @@ struct device;
>   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
>   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
>   *  bitmap_get_value8(map, start)               Get 8bit value from map at start
> + *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
> + *                                              map at start
>   *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
> + *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
> + *                                              map at start
>   *
>   * Note, bitmap_zero() and bitmap_fill() operate over the region of
>   * unsigned longs, that is, bits behind bitmap till the unsigned long
> @@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
>  	return (map[index] >> offset) & 0xFF;
>  }
>  
> +/**
> + * bitmap_read - read a value of n-bits from the memory region
> + * @map: address to the bitmap memory region
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG
> + *
> + * Returns: value of nbits located at the @start bit offset within the @map
> + * memory region.
> + */
> +static inline unsigned long bitmap_read(const unsigned long *map,
> +					unsigned long start,
> +					unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long value_low, value_high;
> +
> +	if (unlikely(!nbits))
> +		return 0;
> +	if (space >= nbits)
> +		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> +	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> +	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> +	return (value_low >> offset) | (value_high << space);
> +}
> +
>  /**
>   * bitmap_set_value8 - set an 8-bit value within a memory region
>   * @map: address to the bitmap memory region
> @@ -599,6 +630,43 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
>  	map[index] |= value << offset;
>  }
>  
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value to write, clamped to nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, nonzero, up to BITS_PER_LONG.
> + *
> + * bitmap_write() behaves similarly to @nbits calls of assign_bit(), i.e. bits
> + * beyond @nbits are ignored:
> + *
> + *   for (bit = 0; bit < nbits; bit++)
> + *           assign_bit(start + bit, bitmap, val & BIT(bit));

__assign_bit()

> + */

'behaves similarly' sounds like an understatement. I think, it behaves
much faster because it can assign up to 64 bits at once, not mentioning
the pressure on cache lines traffic.

How faster - that's a good question. I'd be really pleased if you add
a performance test for bitmap_write/read. Or I can do it myself later.
You can find examples in the same lib/test_bitmap.c.

> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +	unsigned long mask;
> +
> +	if (unlikely(!nbits))
> +		return;

can you please add more empty lines to separate blocks visually?

> +	mask = BITMAP_LAST_WORD_MASK(nbits);
> +	value &= mask;
> +	if (space >= nbits) {
> +		map[index] &= ~(mask << offset);
> +		map[index] |= value << offset;
> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

I compiled the below fix on spark64 BE machine:

--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -608,7 +608,7 @@ static inline unsigned long bitmap_read(const unsigned long *map,
        if (unlikely(!nbits))
                return 0;
        if (space >= nbits)
-               return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+               return (map[index] >> offset) & BITMAP_LAST_WORD_MASK(nbits);
        value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
        value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
        return (value_low >> offset) | (value_high << space);
@@ -661,9 +661,9 @@ static inline void bitmap_write(unsigned long *map,
                map[index] |= value << offset;
                return;
        }
-       map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+       map[index] &= BITMAP_LAST_WORD_MASK(start);
        map[index] |= value << offset;
-       map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+       map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
 }

All the tests are passed just as before, and there's no any difference
reported by bloat-o-meter. Can you please use non-negation versions as
they are more straightforward?

> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __LINUX_BITMAP_H */
> -- 
> 2.42.0.609.gbb76f46606-goog

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 14:47     ` Andy Shevchenko
@ 2023-10-10  8:16       ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  8:16 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 6, 2023 at 4:48 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> > From: Syed Nayyar Waris <syednwaris@gmail.com>
> >
> > The two new functions allow reading/writing values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> >
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a number of changes and simplifications:
> >  - instead of using roundup(), which adds an unnecessary dependency
> >    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> >  - indentation is reduced by not using else-clauses (suggested by
> >    checkpatch for bitmap_get_value());
> >  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
> >    and bitmap_write();
> >  - some redundant computations are omitted.
>
> ...
>
> > v6:
> >  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
> >    return 0.
>
> Hmm... See below.
>
> ...
>
> >   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
> >   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
>
> With the grouping as below I would add a blank line here. But was the intention
> to group _arrXX() to these groups?

Note that there's no single blank line in this long list.
What if I swap bitmap_read with bitmap_set_value8(), would the
grouping become more logical?

I.e.
*  bitmap_get_value8(map, start)               Get 8bit value from map at start
*  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
*  bitmap_read(map, start, nbits)              Read an nbits-sized value from
*                                              map at start
*  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
*                                              map at start

> > +     if (unlikely(!nbits))
> > +             return 0;
>
> Hmm... I didn't get was the comment to add or to remove these checks?

As Yury said, we should not require the return value to be 0, so I
added "nonzero" to the descriptions of the @nbits parameter.
The check stays in place, but the users relying on it is now a mistake.

>
> > +     if (space >= nbits)
> > +             return (map[index] >> offset) & GENMASK(nbits - 1, 0);
>
> And don't you want to replace this GENMASK() as well?

See my next reply to Yury, tl;dr this is a stale code version :(

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-10  8:16       ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  8:16 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On Fri, Oct 6, 2023 at 4:48 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Fri, Oct 06, 2023 at 03:45:25PM +0200, Alexander Potapenko wrote:
> > From: Syed Nayyar Waris <syednwaris@gmail.com>
> >
> > The two new functions allow reading/writing values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> >
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a number of changes and simplifications:
> >  - instead of using roundup(), which adds an unnecessary dependency
> >    on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> >  - indentation is reduced by not using else-clauses (suggested by
> >    checkpatch for bitmap_get_value());
> >  - bitmap_get_value()/bitmap_set_value() are renamed to bitmap_read()
> >    and bitmap_write();
> >  - some redundant computations are omitted.
>
> ...
>
> > v6:
> >  - As suggested by Yury Norov, do not require bitmap_read(..., 0) to
> >    return 0.
>
> Hmm... See below.
>
> ...
>
> >   *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
> >   *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
>
> With the grouping as below I would add a blank line here. But was the intention
> to group _arrXX() to these groups?

Note that there's no single blank line in this long list.
What if I swap bitmap_read with bitmap_set_value8(), would the
grouping become more logical?

I.e.
*  bitmap_get_value8(map, start)               Get 8bit value from map at start
*  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
*  bitmap_read(map, start, nbits)              Read an nbits-sized value from
*                                              map at start
*  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
*                                              map at start

> > +     if (unlikely(!nbits))
> > +             return 0;
>
> Hmm... I didn't get was the comment to add or to remove these checks?

As Yury said, we should not require the return value to be 0, so I
added "nonzero" to the descriptions of the @nbits parameter.
The check stays in place, but the users relying on it is now a mistake.

>
> > +     if (space >= nbits)
> > +             return (map[index] >> offset) & GENMASK(nbits - 1, 0);
>
> And don't you want to replace this GENMASK() as well?

See my next reply to Yury, tl;dr this is a stale code version :(

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-06 22:35     ` Yury Norov
@ 2023-10-10  9:17       ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  9:17 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

> > + *
> > + *   for (bit = 0; bit < nbits; bit++)
> > + *           assign_bit(start + bit, bitmap, val & BIT(bit));
>
> __assign_bit()

Ack

>
> > + */
>
> 'behaves similarly' sounds like an understatement. I think, it behaves
> much faster because it can assign up to 64 bits at once, not mentioning
> the pressure on cache lines traffic.

My intent was to describe the visible behavior, of course the
generated code is better, and the number of memory accesses lower.

How about the following description:

 * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
 * bits beyond @nbits are ignored:
 *
 *   for (bit = 0; bit < nbits; bit++)
 *           assign_bit(start + bit, bitmap, val & BIT(bit));

?

>
> How faster - that's a good question. I'd be really pleased if you add
> a performance test for bitmap_write/read. Or I can do it myself later.
> You can find examples in the same lib/test_bitmap.c.

I can add two separate functions doing some bitmap_read and
bitmap_write calls in a loop to measure their performance
independently - along the lines of what you did here:
https://lore.kernel.org/lkml/ZL9X0TZb%2FQhCbEiC@yury-ThinkPad/


> > +     if (unlikely(!nbits))
> > +             return;
>
> can you please add more empty lines to separate blocks visually?

Sure, will do.

>
> > +     mask = BITMAP_LAST_WORD_MASK(nbits);
> > +     value &= mask;
> > +     if (space >= nbits) {
> > +             map[index] &= ~(mask << offset);
> > +             map[index] |= value << offset;
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> I compiled the below fix on spark64 BE machine:
>
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -608,7 +608,7 @@ static inline unsigned long bitmap_read(const unsigned long *map,
>         if (unlikely(!nbits))
>                 return 0;
>         if (space >= nbits)
> -               return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> +               return (map[index] >> offset) & BITMAP_LAST_WORD_MASK(nbits);
>         value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
>         value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
>         return (value_low >> offset) | (value_high << space);
> @@ -661,9 +661,9 @@ static inline void bitmap_write(unsigned long *map,
>                 map[index] |= value << offset;
>                 return;
>         }
> -       map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> +       map[index] &= BITMAP_LAST_WORD_MASK(start);
>         map[index] |= value << offset;
> -       map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +       map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
>  }
>
> All the tests are passed just as before, and there's no any difference
> reported by bloat-o-meter. Can you please use non-negation versions as
> they are more straightforward?

I am terribly sorry for this, but this code version is completely
different from what we've discussed here:
https://lore.kernel.org/lkml/CAG_fn=VrPJj6YowHNki5RGAAs8qvwZpUVN4K9qw=cf4aW7Qw9A@mail.gmail.com/

The correct version roughly looks as follows:

void bitmap_write(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset, space, mask;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        mask = BITMAP_LAST_WORD_MASK(nbits);
        value &= mask;
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(mask << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}

Note that here replacing ~BITMAP_FIRST_WORD_MASK() with
BITMAP_LAST_WORD_MASK() costs us 25 bytes (116 bytes with
~BITMAP_FIRST_WORD_MASK() vs. 141 bytes without).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-10  9:17       ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  9:17 UTC (permalink / raw)
  To: Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux, linux-kernel, linux-arm-kernel,
	eugenis, syednwaris, william.gray, Arnd Bergmann

> > + *
> > + *   for (bit = 0; bit < nbits; bit++)
> > + *           assign_bit(start + bit, bitmap, val & BIT(bit));
>
> __assign_bit()

Ack

>
> > + */
>
> 'behaves similarly' sounds like an understatement. I think, it behaves
> much faster because it can assign up to 64 bits at once, not mentioning
> the pressure on cache lines traffic.

My intent was to describe the visible behavior, of course the
generated code is better, and the number of memory accesses lower.

How about the following description:

 * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
 * bits beyond @nbits are ignored:
 *
 *   for (bit = 0; bit < nbits; bit++)
 *           assign_bit(start + bit, bitmap, val & BIT(bit));

?

>
> How faster - that's a good question. I'd be really pleased if you add
> a performance test for bitmap_write/read. Or I can do it myself later.
> You can find examples in the same lib/test_bitmap.c.

I can add two separate functions doing some bitmap_read and
bitmap_write calls in a loop to measure their performance
independently - along the lines of what you did here:
https://lore.kernel.org/lkml/ZL9X0TZb%2FQhCbEiC@yury-ThinkPad/


> > +     if (unlikely(!nbits))
> > +             return;
>
> can you please add more empty lines to separate blocks visually?

Sure, will do.

>
> > +     mask = BITMAP_LAST_WORD_MASK(nbits);
> > +     value &= mask;
> > +     if (space >= nbits) {
> > +             map[index] &= ~(mask << offset);
> > +             map[index] |= value << offset;
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> I compiled the below fix on spark64 BE machine:
>
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -608,7 +608,7 @@ static inline unsigned long bitmap_read(const unsigned long *map,
>         if (unlikely(!nbits))
>                 return 0;
>         if (space >= nbits)
> -               return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> +               return (map[index] >> offset) & BITMAP_LAST_WORD_MASK(nbits);
>         value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
>         value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
>         return (value_low >> offset) | (value_high << space);
> @@ -661,9 +661,9 @@ static inline void bitmap_write(unsigned long *map,
>                 map[index] |= value << offset;
>                 return;
>         }
> -       map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> +       map[index] &= BITMAP_LAST_WORD_MASK(start);
>         map[index] |= value << offset;
> -       map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +       map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
>  }
>
> All the tests are passed just as before, and there's no any difference
> reported by bloat-o-meter. Can you please use non-negation versions as
> they are more straightforward?

I am terribly sorry for this, but this code version is completely
different from what we've discussed here:
https://lore.kernel.org/lkml/CAG_fn=VrPJj6YowHNki5RGAAs8qvwZpUVN4K9qw=cf4aW7Qw9A@mail.gmail.com/

The correct version roughly looks as follows:

void bitmap_write(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset, space, mask;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        mask = BITMAP_LAST_WORD_MASK(nbits);
        value &= mask;
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(mask << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}

Note that here replacing ~BITMAP_FIRST_WORD_MASK() with
BITMAP_LAST_WORD_MASK() costs us 25 bytes (116 bytes with
~BITMAP_FIRST_WORD_MASK() vs. 141 bytes without).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-10  8:16       ` Alexander Potapenko
@ 2023-10-10  9:43         ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  9:43 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

> >
> > > +     if (space >= nbits)
> > > +             return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> >
> > And don't you want to replace this GENMASK() as well?
>
> See my next reply to Yury, tl;dr this is a stale code version :(

Actually, no, we haven't discussed improvements to bitmap_read(), and
this is the correct version.
I'll fix this one.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-10  9:43         ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10  9:43 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: catalin.marinas, will, pcc, andreyknvl, aleksander.lobakin,
	linux, yury.norov, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

> >
> > > +     if (space >= nbits)
> > > +             return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> >
> > And don't you want to replace this GENMASK() as well?
>
> See my next reply to Yury, tl;dr this is a stale code version :(

Actually, no, we haven't discussed improvements to bitmap_read(), and
this is the correct version.
I'll fix this one.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-10  9:17       ` Alexander Potapenko
@ 2023-10-10 11:03         ` Rasmus Villemoes
  -1 siblings, 0 replies; 28+ messages in thread
From: Rasmus Villemoes @ 2023-10-10 11:03 UTC (permalink / raw)
  To: Alexander Potapenko, Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On 10/10/2023 11.17, Alexander Potapenko wrote:

>> 'behaves similarly' sounds like an understatement. I think, it behaves
>> much faster because it can assign up to 64 bits at once, not mentioning
>> the pressure on cache lines traffic.
> 
> My intent was to describe the visible behavior, of course the
> generated code is better, and the number of memory accesses lower.
> 
> How about the following description:
> 
>  * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
>  * bits beyond @nbits are ignored:
>  *
>  *   for (bit = 0; bit < nbits; bit++)
>  *           assign_bit(start + bit, bitmap, val & BIT(bit));
> 
> ?

C programmers should know the meaning of the term "as-if". Perhaps use
that. Smth like "bitmap_write() behaves as-if implemented as @nbits
calls of __assign_bit()".

Rasmus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-10 11:03         ` Rasmus Villemoes
  0 siblings, 0 replies; 28+ messages in thread
From: Rasmus Villemoes @ 2023-10-10 11:03 UTC (permalink / raw)
  To: Alexander Potapenko, Yury Norov
  Cc: catalin.marinas, will, pcc, andreyknvl, andriy.shevchenko,
	aleksander.lobakin, linux-kernel, linux-arm-kernel, eugenis,
	syednwaris, william.gray, Arnd Bergmann

On 10/10/2023 11.17, Alexander Potapenko wrote:

>> 'behaves similarly' sounds like an understatement. I think, it behaves
>> much faster because it can assign up to 64 bits at once, not mentioning
>> the pressure on cache lines traffic.
> 
> My intent was to describe the visible behavior, of course the
> generated code is better, and the number of memory accesses lower.
> 
> How about the following description:
> 
>  * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
>  * bits beyond @nbits are ignored:
>  *
>  *   for (bit = 0; bit < nbits; bit++)
>  *           assign_bit(start + bit, bitmap, val & BIT(bit));
> 
> ?

C programmers should know the meaning of the term "as-if". Perhaps use
that. Smth like "bitmap_write() behaves as-if implemented as @nbits
calls of __assign_bit()".

Rasmus


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
  2023-10-10 11:03         ` Rasmus Villemoes
@ 2023-10-10 12:14           ` Alexander Potapenko
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10 12:14 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Yury Norov, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux-kernel,
	linux-arm-kernel, eugenis, syednwaris, william.gray,
	Arnd Bergmann

On Tue, Oct 10, 2023 at 1:03 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 10/10/2023 11.17, Alexander Potapenko wrote:
>
> >> 'behaves similarly' sounds like an understatement. I think, it behaves
> >> much faster because it can assign up to 64 bits at once, not mentioning
> >> the pressure on cache lines traffic.
> >
> > My intent was to describe the visible behavior, of course the
> > generated code is better, and the number of memory accesses lower.
> >
> > How about the following description:
> >
> >  * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
> >  * bits beyond @nbits are ignored:
> >  *
> >  *   for (bit = 0; bit < nbits; bit++)
> >  *           assign_bit(start + bit, bitmap, val & BIT(bit));
> >
> > ?
>
> C programmers should know the meaning of the term "as-if". Perhaps use
> that. Smth like "bitmap_write() behaves as-if implemented as @nbits
> calls of __assign_bit()".

Good idea, I'll go with this one.
Thank you!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}()
@ 2023-10-10 12:14           ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-10-10 12:14 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Yury Norov, catalin.marinas, will, pcc, andreyknvl,
	andriy.shevchenko, aleksander.lobakin, linux-kernel,
	linux-arm-kernel, eugenis, syednwaris, william.gray,
	Arnd Bergmann

On Tue, Oct 10, 2023 at 1:03 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 10/10/2023 11.17, Alexander Potapenko wrote:
>
> >> 'behaves similarly' sounds like an understatement. I think, it behaves
> >> much faster because it can assign up to 64 bits at once, not mentioning
> >> the pressure on cache lines traffic.
> >
> > My intent was to describe the visible behavior, of course the
> > generated code is better, and the number of memory accesses lower.
> >
> > How about the following description:
> >
> >  * The result of bitmap_write() is similar to @nbits calls of assign_bit(), i.e.
> >  * bits beyond @nbits are ignored:
> >  *
> >  *   for (bit = 0; bit < nbits; bit++)
> >  *           assign_bit(start + bit, bitmap, val & BIT(bit));
> >
> > ?
>
> C programmers should know the meaning of the term "as-if". Perhaps use
> that. Smth like "bitmap_write() behaves as-if implemented as @nbits
> calls of __assign_bit()".

Good idea, I'll go with this one.
Thank you!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-10-10 12:16 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-06 13:45 [PATCH v6 0/5] Implement MTE tag compression for swapped pages Alexander Potapenko
2023-10-06 13:45 ` Alexander Potapenko
2023-10-06 13:45 ` [PATCH v6 1/5] lib/bitmap: add bitmap_{read,write}() Alexander Potapenko
2023-10-06 13:45   ` Alexander Potapenko
2023-10-06 14:47   ` Andy Shevchenko
2023-10-06 14:47     ` Andy Shevchenko
2023-10-06 16:53     ` Yury Norov
2023-10-06 16:53       ` Yury Norov
2023-10-10  8:16     ` Alexander Potapenko
2023-10-10  8:16       ` Alexander Potapenko
2023-10-10  9:43       ` Alexander Potapenko
2023-10-10  9:43         ` Alexander Potapenko
2023-10-06 22:35   ` Yury Norov
2023-10-06 22:35     ` Yury Norov
2023-10-10  9:17     ` Alexander Potapenko
2023-10-10  9:17       ` Alexander Potapenko
2023-10-10 11:03       ` Rasmus Villemoes
2023-10-10 11:03         ` Rasmus Villemoes
2023-10-10 12:14         ` Alexander Potapenko
2023-10-10 12:14           ` Alexander Potapenko
2023-10-06 13:45 ` [PATCH v6 2/5] lib/test_bitmap: add tests for bitmap_{read,write}() Alexander Potapenko
2023-10-06 13:45   ` Alexander Potapenko
2023-10-06 13:45 ` [PATCH v6 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP Alexander Potapenko
2023-10-06 13:45   ` Alexander Potapenko
2023-10-06 13:45 ` [PATCH v6 4/5] arm64: mte: add a test for MTE tags compression Alexander Potapenko
2023-10-06 13:45   ` Alexander Potapenko
2023-10-06 13:45 ` [PATCH v6 5/5] arm64: mte: add compression support to mteswap.c Alexander Potapenko
2023-10-06 13:45   ` Alexander Potapenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.