All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] sbitmap: NUMA node spreading
@ 2022-05-10 11:14 John Garry
  2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
  To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry

Hi Jens, guys,

I am sending this as an RFC to see if there is any future in it or ideas
on how to make better. I also need to improve some items (as mentioned in
2/2 commit message) and test a lot more.

The general idea is that we change from allocating a single array of
sbitmap words to allocating an sub-array per NUMA node. And then each CPU
in that node is hinted to use that sub-array

Initial performance looks decent.

Some figures:
System: 4-nodes (with memory on all nodes), 128 CPUs

null blk config block:
20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
hw_queue_depth=256

fio config:
bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
ioscheduler=none

Before:
7130K

After:
7630K

So a +7% IOPS gain.

Any comments welcome, thanks!.

Based on v5.18-rc6.

John Garry (2):
  sbitmap: Make sbitmap.map a double pointer
  sbitmap: Spread sbitmap word allocation over NUMA nodes

 include/linux/sbitmap.h | 16 +++++---
 lib/sbitmap.c           | 83 +++++++++++++++++++++++++++++++++--------
 2 files changed, 79 insertions(+), 20 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer
  2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
@ 2022-05-10 11:14 ` John Garry
  2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
  2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
  2 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
  To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry

Currently sbitmap.map points to a single array of sbitmap words.

To allow each word to be allocated individually, make it a double pointer.

This will mean that to access to a word will require an extra load.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 include/linux/sbitmap.h | 11 ++++++-----
 lib/sbitmap.c           | 34 ++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8f5a86e210b9..46268f391e32 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -68,7 +68,7 @@ struct sbitmap {
 	/**
 	 * @map: Allocated bitmap.
 	 */
-	struct sbitmap_word *map;
+	struct sbitmap_word **map;
 
 	/*
 	 * @alloc_hint: Cache of last successfully allocated or freed bit.
@@ -256,9 +256,10 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
 		unsigned int depth = min_t(unsigned int,
 					   __map_depth(sb, index) - nr,
 					   sb->depth - scanned);
-
+		struct sbitmap_word *map = sb->map[index];
 		scanned += depth;
-		word = sb->map[index].word & ~sb->map[index].cleared;
+
+		word = map->word & ~map->cleared;
 		if (!word)
 			goto next;
 
@@ -299,7 +300,7 @@ static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
 {
-	return &sb->map[SB_NR_TO_INDEX(sb, bitnr)].word;
+	return &sb->map[SB_NR_TO_INDEX(sb, bitnr)]->word;
 }
 
 /* Helpers equivalent to the operations in asm/bitops.h and linux/bitmap.h */
@@ -322,7 +323,7 @@ static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
  */
 static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
 {
-	unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+	unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)]->cleared;
 
 	set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
 }
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index ae4fd4de9ebe..64fb9800ed8c 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -85,6 +85,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      bool alloc_hint)
 {
 	unsigned int bits_per_word;
+	struct sbitmap_word *map;
+	int index;
 
 	if (shift < 0)
 		shift = sbitmap_calculate_shift(depth);
@@ -116,6 +118,17 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		return -ENOMEM;
 	}
 
+	map = kvzalloc_node(sb->map_nr * sizeof(**sb->map), flags, node);
+	if (!map)
+		return -ENOMEM;
+
+	for (index = 0; index < sb->map_nr; index++, map++) {
+		struct sbitmap_word **_map;
+
+		_map = &sb->map[index];
+		*_map = map;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(sbitmap_init_node);
@@ -126,7 +139,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 	unsigned int i;
 
 	for (i = 0; i < sb->map_nr; i++)
-		sbitmap_deferred_clear(&sb->map[i]);
+		sbitmap_deferred_clear(sb->map[i]);
 
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -170,7 +183,7 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
 				     unsigned int alloc_hint)
 {
-	struct sbitmap_word *map = &sb->map[index];
+	struct sbitmap_word *map = sb->map[index];
 	int nr;
 
 	do {
@@ -246,7 +259,7 @@ static int __sbitmap_get_shallow(struct sbitmap *sb,
 
 	for (i = 0; i < sb->map_nr; i++) {
 again:
-		nr = __sbitmap_get_word(&sb->map[index].word,
+		nr = __sbitmap_get_word(&sb->map[index]->word,
 					min_t(unsigned int,
 					      __map_depth(sb, index),
 					      shallow_depth),
@@ -256,7 +269,7 @@ static int __sbitmap_get_shallow(struct sbitmap *sb,
 			break;
 		}
 
-		if (sbitmap_deferred_clear(&sb->map[index]))
+		if (sbitmap_deferred_clear(sb->map[index]))
 			goto again;
 
 		/* Jump to next index. */
@@ -294,7 +307,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
 	unsigned int i;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		if (sb->map[i].word & ~sb->map[i].cleared)
+		if (sb->map[i]->word & ~sb->map[i]->cleared)
 			return true;
 	}
 	return false;
@@ -306,7 +319,7 @@ static unsigned int __sbitmap_weight(const struct sbitmap *sb, bool set)
 	unsigned int i, weight = 0;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		const struct sbitmap_word *word = &sb->map[i];
+		const struct sbitmap_word *word = sb->map[i];
 		unsigned int word_depth = __map_depth(sb, i);
 
 		if (set)
@@ -358,8 +371,9 @@ void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m)
 	int i;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		unsigned long word = READ_ONCE(sb->map[i].word);
-		unsigned long cleared = READ_ONCE(sb->map[i].cleared);
+		struct sbitmap_word *map = READ_ONCE(sb->map[i]);
+		unsigned long word = READ_ONCE(map->word);
+		unsigned long cleared = READ_ONCE(map->cleared);
 		unsigned int word_bits = __map_depth(sb, i);
 
 		word &= ~cleared;
@@ -522,7 +536,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
 	index = SB_NR_TO_INDEX(sb, hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *map = &sb->map[index];
+		struct sbitmap_word *map = sb->map[index];
 		unsigned long get_mask;
 		unsigned int map_depth = __map_depth(sb, index);
 
@@ -665,7 +679,7 @@ void sbitmap_queue_clear_batch(struct sbitmap_queue *sbq, int offset,
 		unsigned long *this_addr;
 
 		/* since we're clearing a batch, skip the deferred map */
-		this_addr = &sb->map[SB_NR_TO_INDEX(sb, tag)].word;
+		this_addr = &sb->map[SB_NR_TO_INDEX(sb, tag)]->word;
 		if (!addr) {
 			addr = this_addr;
 		} else if (addr != this_addr) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes
  2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
  2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
@ 2022-05-10 11:14 ` John Garry
  2022-05-10 19:31   ` kernel test robot
  2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
  To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry

Currently sbitmap words are allocated in a single array. That array is
flagged for allocating at the NUMA node passed in sbitmap_init_node().

However often the sbitmap will be accessed by all the CPUs in the system -
for example, when BLK_MQ_F_TAG_HCTX_SHARED is set for the blk-mq tagset.

This can lead to performance issues where all CPUs in the system are doing
cross-NUMA node accesses to read/set/unset sbitmap tags.

Improve this by spreading the word allocations across all NUMA nodes as
evenly as possible. We set the per-CPU hint to fall within range of words
allocated for the NUMA node to which that CPU belongs.

Known issues:
- sbitmap resize does not work well for this scheme
- Improve updating hint for sbitmap_get() failure and sbitmap_put() when
  hint is outside range of CPU's NUMA node words.
- Add intelligence for sub-arrays to be allocated at a single node, e.g.
  when numa != NUMA_NO_NODE in sbitmap_init_node()

Signed-off-by: John Garry <john.garry@huawei.com>
---
 include/linux/sbitmap.h |  5 ++++
 lib/sbitmap.c           | 63 +++++++++++++++++++++++++++++++++--------
 2 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 46268f391e32..6d897032dbc6 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -60,6 +60,11 @@ struct sbitmap {
 	 */
 	unsigned int map_nr;
 
+	/**
+	 * @map_nr_per_node: Number of words being used per NUMA node.
+	 */
+	unsigned int map_nr_per_node;
+
 	/**
 	 * @round_robin: Allocate bits in strict round-robin order.
 	 */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 64fb9800ed8c..99c87fbfa1a1 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -8,6 +8,17 @@
 #include <linux/random.h>
 #include <linux/sbitmap.h>
 #include <linux/seq_file.h>
+#include <linux/mm.h>
+
+static unsigned int sbitmap_get_new_hint(struct sbitmap *sb, int cpu)
+{
+	unsigned int shift = sb->shift;
+	unsigned int map_nr_per_node = sb->map_nr_per_node;
+	unsigned int bit_per_node = map_nr_per_node << shift;
+	unsigned int hint_base = bit_per_node * cpu_to_node(cpu);
+
+	return hint_base + (prandom_u32() % bit_per_node);
+}
 
 static int init_alloc_hint(struct sbitmap *sb, gfp_t flags)
 {
@@ -20,8 +31,10 @@ static int init_alloc_hint(struct sbitmap *sb, gfp_t flags)
 	if (depth && !sb->round_robin) {
 		int i;
 
-		for_each_possible_cpu(i)
-			*per_cpu_ptr(sb->alloc_hint, i) = prandom_u32() % depth;
+		for_each_possible_cpu(i) {
+			*per_cpu_ptr(sb->alloc_hint, i) =
+				sbitmap_get_new_hint(sb, i);
+		}
 	}
 	return 0;
 }
@@ -86,7 +99,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 {
 	unsigned int bits_per_word;
 	struct sbitmap_word *map;
-	int index;
+	int index, num_nodes = num_online_nodes();
+	int nid, map_nr_cnt;
 
 	if (shift < 0)
 		shift = sbitmap_calculate_shift(depth);
@@ -105,6 +119,11 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		return 0;
 	}
 
+	if (sb->map_nr < num_nodes) {
+		sb->map_nr_per_node = 1;
+	} else {
+		sb->map_nr_per_node = sb->map_nr / num_nodes;
+	}
 	if (alloc_hint) {
 		if (init_alloc_hint(sb, flags))
 			return -ENOMEM;
@@ -113,23 +132,43 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 	}
 
 	sb->map = kvzalloc_node(sb->map_nr * sizeof(*sb->map), flags, node);
-	if (!sb->map) {
-		free_percpu(sb->alloc_hint);
-		return -ENOMEM;
-	}
-
-	map = kvzalloc_node(sb->map_nr * sizeof(**sb->map), flags, node);
-	if (!map)
-		return -ENOMEM;
+	if (!sb->map)
+		goto err_map;
 
-	for (index = 0; index < sb->map_nr; index++, map++) {
+	for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
 		struct sbitmap_word **_map;
 
+		if ((index % sb->map_nr_per_node) == 0) {
+			int cnt;
+
+			if (index == 0) {
+				cnt = sb->map_nr_per_node +
+					(sb->map_nr % sb->map_nr_per_node);
+			} else {
+				cnt = sb->map_nr_per_node;
+			}
+
+			map = kvzalloc_node(cnt * sizeof(**sb->map), flags, nid);
+			if (!map)
+				goto err_map_numa;
+			nid++;
+		}
+
 		_map = &sb->map[index];
 		*_map = map;
 	}
 
 	return 0;
+err_map_numa:
+	for (index = 0; index < sb->map_nr; index++, map++) {
+		if ((index % sb->map_nr_per_node) == 0) {
+			kfree(map);
+		}
+	}
+err_map:
+	free_percpu(sb->alloc_hint);
+
+	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(sbitmap_init_node);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
  2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
  2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
@ 2022-05-10 12:50 ` Jens Axboe
  2022-05-10 13:44   ` John Garry
  2 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-05-10 12:50 UTC (permalink / raw)
  To: John Garry, linux-block; +Cc: linux-kernel, linux-scsi

On 5/10/22 5:14 AM, John Garry wrote:
> Hi Jens, guys,
> 
> I am sending this as an RFC to see if there is any future in it or ideas
> on how to make better. I also need to improve some items (as mentioned in
> 2/2 commit message) and test a lot more.
> 
> The general idea is that we change from allocating a single array of
> sbitmap words to allocating an sub-array per NUMA node. And then each CPU
> in that node is hinted to use that sub-array
> 
> Initial performance looks decent.
> 
> Some figures:
> System: 4-nodes (with memory on all nodes), 128 CPUs
> 
> null blk config block:
> 20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
> hw_queue_depth=256
> 
> fio config:
> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> ioscheduler=none
> 
> Before:
> 7130K
> 
> After:
> 7630K
> 
> So a +7% IOPS gain.

What does the comparison run on a non-NUMA non-shared queue look like?
Because I bet it'd be slower.

To be honest, I don't like this approach at all. It makes the normal
case quite a bit slower by having an extra layer of indirection for the
word, that's quite a bit of extra cost. It doesn't seem like a good
approach for the issue, as it pessimizes the normal fast case.

Spreading the memory out does probably make sense, but we need to retain
the fast normal case. Making sbitmap support both, selected at init
time, would be far more likely to be acceptable imho.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
@ 2022-05-10 13:44   ` John Garry
  2022-05-10 14:34     ` Jens Axboe
  2022-05-11  2:07     ` Ming Lei
  0 siblings, 2 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 13:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: linux-kernel, linux-scsi

On 10/05/2022 13:50, Jens Axboe wrote:
>> fio config:
>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>> ioscheduler=none
>>
>> Before:
>> 7130K
>>
>> After:
>> 7630K
>>
>> So a +7% IOPS gain.

Thanks for having a look.

> What does the comparison run on a non-NUMA non-shared queue look like?
> Because I bet it'd be slower.

I could test more to get a solid result for that.

> 
> To be honest, I don't like this approach at all. It makes the normal
> case quite a bit slower by having an extra layer of indirection for the
> word, that's quite a bit of extra cost.

Yes, there is the extra load. I would hope that there would be a low 
cost, but I agree that we still want to avoid it. So prob no point in 
testing this more.

> It doesn't seem like a good
> approach for the issue, as it pessimizes the normal fast case.
> 
> Spreading the memory out does probably make sense, but we need to retain
> the fast normal case. Making sbitmap support both, selected at init
> time, would be far more likely to be acceptable imho.

I wanted to keep the code changes minimal for an initial RFC to test the 
water.

My original approach did not introduce the extra load for normal path 
and had some init time selection for a normal word map vs numa word map, 
but the code grew and became somewhat unmanageable. I'll revisit it to 
see how to improve that.

Cheers,
john

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-10 13:44   ` John Garry
@ 2022-05-10 14:34     ` Jens Axboe
  2022-05-10 15:03       ` John Garry
  2022-05-11  2:07     ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-05-10 14:34 UTC (permalink / raw)
  To: John Garry, linux-block; +Cc: linux-kernel, linux-scsi

On 5/10/22 7:44 AM, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
>>> fio config:
>>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>>> ioscheduler=none
>>>
>>> Before:
>>> 7130K
>>>
>>> After:
>>> 7630K
>>>
>>> So a +7% IOPS gain.
> 
> Thanks for having a look.
> 
>> What does the comparison run on a non-NUMA non-shared queue look like?
>> Because I bet it'd be slower.
> 
> I could test more to get a solid result for that.
> 
>>
>> To be honest, I don't like this approach at all. It makes the normal
>> case quite a bit slower by having an extra layer of indirection for the
>> word, that's quite a bit of extra cost.
> 
> Yes, there is the extra load. I would hope that there would be a low
> cost, but I agree that we still want to avoid it. So prob no point in
> testing this more.

I don't think that's low cost at all. It's the very hot path, and you're
now not only doing an extra load, it's a dependent load - you need to
load both to make any progress. On top of that, it's not like it's two
loads from the same cacheline or even page. The most important thing for
performance these days is having good cache utilization, the patch as it
stands very much makes that a lot worse.

Besides, for any kind of performance work like that, it's customary to
showcase both the situation that is supposedly fixed or improved with
the change, but also to test that it didn't regress the existing
common/fast case.

>> It doesn't seem like a good
>> approach for the issue, as it pessimizes the normal fast case.
>>
>> Spreading the memory out does probably make sense, but we need to retain
>> the fast normal case. Making sbitmap support both, selected at init
>> time, would be far more likely to be acceptable imho.
> 
> I wanted to keep the code changes minimal for an initial RFC to test
> the water.
>
> My original approach did not introduce the extra load for normal path
> and had some init time selection for a normal word map vs numa word
> map, but the code grew and became somewhat unmanageable. I'll revisit
> it to see how to improve that.

Probably just needs some clean refactoring first, so that the actual
change can be pretty small.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-10 14:34     ` Jens Axboe
@ 2022-05-10 15:03       ` John Garry
  0 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 15:03 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: linux-kernel, linux-scsi

On 10/05/2022 15:34, Jens Axboe wrote:
>> Yes, there is the extra load. I would hope that there would be a low
>> cost, but I agree that we still want to avoid it. So prob no point in
>> testing this more.
> I don't think that's low cost at all. It's the very hot path, and you're
> now not only doing an extra load, it's a dependent load - you need to
> load both to make any progress. On top of that, it's not like it's two
> loads from the same cacheline or even page. The most important thing for
> performance these days is having good cache utilization, the patch as it
> stands very much makes that a lot worse.

Understood. Essentially patch #1/2 points in the wrong direction.

I have to admit that I was a bit blinkered by seeing how much I could 
improve the NUMA case.

> 
> Besides, for any kind of performance work like that, it's customary to
> showcase both the situation that is supposedly fixed or improved with
> the change, but also to test that it didn't regress the existing
> common/fast case.

Right, I should have done that.

> 
>>> It doesn't seem like a good
>>> approach for the issue, as it pessimizes the normal fast case.
>>>
>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test
>> the water.
>>
>> My original approach did not introduce the extra load for normal path
>> and had some init time selection for a normal word map vs numa word
>> map, but the code grew and became somewhat unmanageable. I'll revisit
>> it to see how to improve that.
> Probably just needs some clean refactoring first, so that the actual
> change can be pretty small.

I think that it may be just a case of separating out the handling of 
evaluating the sbitmap_word ptr as that is that common struct which we 
deal with.

Thanks,
John

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes
  2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
@ 2022-05-10 19:31   ` kernel test robot
  0 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2022-05-10 19:31 UTC (permalink / raw)
  To: John Garry; +Cc: llvm, kbuild-all

Hi John,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linux/master linus/master v5.18-rc6 next-20220510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Garry/sbitmap-NUMA-node-spreading/20220510-192122
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-a003-20220509 (https://download.01.org/0day-ci/archive/20220511/202205110332.EbWRHZ10-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 18dd123c56754edf62c7042dcf23185c3727610f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/9cb4cd7060790dcb3d7f9405c11a0da884a0c616
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review John-Garry/sbitmap-NUMA-node-spreading/20220510-192122
        git checkout 9cb4cd7060790dcb3d7f9405c11a0da884a0c616
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> lib/sbitmap.c:138:63: warning: variable 'map_nr_cnt' is uninitialized when used here [-Wuninitialized]
           for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
                                                                        ^~~~~~~~~~
   lib/sbitmap.c:103:21: note: initialize the variable 'map_nr_cnt' to silence this warning
           int nid, map_nr_cnt;
                              ^
                               = 0
   1 warning generated.


vim +/map_nr_cnt +138 lib/sbitmap.c

    95	
    96	int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
    97			      gfp_t flags, int node, bool round_robin,
    98			      bool alloc_hint)
    99	{
   100		unsigned int bits_per_word;
   101		struct sbitmap_word *map;
   102		int index, num_nodes = num_online_nodes();
   103		int nid, map_nr_cnt;
   104	
   105		if (shift < 0)
   106			shift = sbitmap_calculate_shift(depth);
   107	
   108		bits_per_word = 1U << shift;
   109		if (bits_per_word > BITS_PER_LONG)
   110			return -EINVAL;
   111	
   112		sb->shift = shift;
   113		sb->depth = depth;
   114		sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
   115		sb->round_robin = round_robin;
   116	
   117		if (depth == 0) {
   118			sb->map = NULL;
   119			return 0;
   120		}
   121	
   122		if (sb->map_nr < num_nodes) {
   123			sb->map_nr_per_node = 1;
   124		} else {
   125			sb->map_nr_per_node = sb->map_nr / num_nodes;
   126		}
   127		if (alloc_hint) {
   128			if (init_alloc_hint(sb, flags))
   129				return -ENOMEM;
   130		} else {
   131			sb->alloc_hint = NULL;
   132		}
   133	
   134		sb->map = kvzalloc_node(sb->map_nr * sizeof(*sb->map), flags, node);
   135		if (!sb->map)
   136			goto err_map;
   137	
 > 138		for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
   139			struct sbitmap_word **_map;
   140	
   141			if ((index % sb->map_nr_per_node) == 0) {
   142				int cnt;
   143	
   144				if (index == 0) {
   145					cnt = sb->map_nr_per_node +
   146						(sb->map_nr % sb->map_nr_per_node);
   147				} else {
   148					cnt = sb->map_nr_per_node;
   149				}
   150	
   151				map = kvzalloc_node(cnt * sizeof(**sb->map), flags, nid);
   152				if (!map)
   153					goto err_map_numa;
   154				nid++;
   155			}
   156	
   157			_map = &sb->map[index];
   158			*_map = map;
   159		}
   160	
   161		return 0;
   162	err_map_numa:
   163		for (index = 0; index < sb->map_nr; index++, map++) {
   164			if ((index % sb->map_nr_per_node) == 0) {
   165				kfree(map);
   166			}
   167		}
   168	err_map:
   169		free_percpu(sb->alloc_hint);
   170	
   171		return -ENOMEM;
   172	}
   173	EXPORT_SYMBOL_GPL(sbitmap_init_node);
   174	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-10 13:44   ` John Garry
  2022-05-10 14:34     ` Jens Axboe
@ 2022-05-11  2:07     ` Ming Lei
  2022-05-11  9:57       ` John Garry
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2022-05-11  2:07 UTC (permalink / raw)
  To: John Garry; +Cc: Jens Axboe, linux-block, linux-kernel, linux-scsi

On Tue, May 10, 2022 at 02:44:50PM +0100, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
> > > fio config:
> > > bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> > > ioscheduler=none
> > > 
> > > Before:
> > > 7130K
> > > 
> > > After:
> > > 7630K
> > > 
> > > So a +7% IOPS gain.
> 
> Thanks for having a look.
> 
> > What does the comparison run on a non-NUMA non-shared queue look like?
> > Because I bet it'd be slower.
> 
> I could test more to get a solid result for that.
> 
> > 
> > To be honest, I don't like this approach at all. It makes the normal
> > case quite a bit slower by having an extra layer of indirection for the
> > word, that's quite a bit of extra cost.
> 
> Yes, there is the extra load. I would hope that there would be a low cost,
> but I agree that we still want to avoid it. So prob no point in testing this
> more.
> 
> > It doesn't seem like a good
> > approach for the issue, as it pessimizes the normal fast case.
> > 
> > Spreading the memory out does probably make sense, but we need to retain
> > the fast normal case. Making sbitmap support both, selected at init
> > time, would be far more likely to be acceptable imho.
> 
> I wanted to keep the code changes minimal for an initial RFC to test the
> water.
> 
> My original approach did not introduce the extra load for normal path and
> had some init time selection for a normal word map vs numa word map, but the
> code grew and became somewhat unmanageable. I'll revisit it to see how to
> improve that.

I understand this approach just splits shared sbitmap into per-numa-node
part, but what if all IOs are just from CPUs in one same numa node? Doesn't
this way cause tag starvation and waste?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
  2022-05-11  2:07     ` Ming Lei
@ 2022-05-11  9:57       ` John Garry
  0 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-11  9:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel, linux-scsi

On 11/05/2022 03:07, Ming Lei wrote:

Hi Ming,

>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test the
>> water.
>>
>> My original approach did not introduce the extra load for normal path and
>> had some init time selection for a normal word map vs numa word map, but the
>> code grew and became somewhat unmanageable. I'll revisit it to see how to
>> improve that.
> I understand this approach just splits shared sbitmap into per-numa-node
> part, but what if all IOs are just from CPUs in one same numa node? Doesn't
> this way cause tag starvation and waste?
> 

We would not do this. If we can't find a free bit in one node then we 
need to check the others before giving up. This is some of the added 
complexity which I hinted at. And things like batch get or RR support 
become more complex.

Alternatively we could have the double pointer for numa spreading only, 
which would make things simpler. I need to check which is overall 
better. Adding the complexity for dealing with numa node sub-arrays may 
affect performance also.

Thanks,
John

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-05-11  9:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
2022-05-10 19:31   ` kernel test robot
2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
2022-05-10 13:44   ` John Garry
2022-05-10 14:34     ` Jens Axboe
2022-05-10 15:03       ` John Garry
2022-05-11  2:07     ` Ming Lei
2022-05-11  9:57       ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.