* [RFC PATCH 0/2] sbitmap: NUMA node spreading
@ 2022-05-10 11:14 John Garry
2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry
Hi Jens, guys,
I am sending this as an RFC to see if there is any future in it or ideas
on how to make better. I also need to improve some items (as mentioned in
2/2 commit message) and test a lot more.
The general idea is that we change from allocating a single array of
sbitmap words to allocating an sub-array per NUMA node. And then each CPU
in that node is hinted to use that sub-array
Initial performance looks decent.
Some figures:
System: 4-nodes (with memory on all nodes), 128 CPUs
null blk config block:
20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
hw_queue_depth=256
fio config:
bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
ioscheduler=none
Before:
7130K
After:
7630K
So a +7% IOPS gain.
Any comments welcome, thanks!.
Based on v5.18-rc6.
John Garry (2):
sbitmap: Make sbitmap.map a double pointer
sbitmap: Spread sbitmap word allocation over NUMA nodes
include/linux/sbitmap.h | 16 +++++---
lib/sbitmap.c | 83 +++++++++++++++++++++++++++++++++--------
2 files changed, 79 insertions(+), 20 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer
2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
@ 2022-05-10 11:14 ` John Garry
2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
2 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry
Currently sbitmap.map points to a single array of sbitmap words.
To allow each word to be allocated individually, make it a double pointer.
This will mean that to access to a word will require an extra load.
Signed-off-by: John Garry <john.garry@huawei.com>
---
include/linux/sbitmap.h | 11 ++++++-----
lib/sbitmap.c | 34 ++++++++++++++++++++++++----------
2 files changed, 30 insertions(+), 15 deletions(-)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8f5a86e210b9..46268f391e32 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -68,7 +68,7 @@ struct sbitmap {
/**
* @map: Allocated bitmap.
*/
- struct sbitmap_word *map;
+ struct sbitmap_word **map;
/*
* @alloc_hint: Cache of last successfully allocated or freed bit.
@@ -256,9 +256,10 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb,
unsigned int depth = min_t(unsigned int,
__map_depth(sb, index) - nr,
sb->depth - scanned);
-
+ struct sbitmap_word *map = sb->map[index];
scanned += depth;
- word = sb->map[index].word & ~sb->map[index].cleared;
+
+ word = map->word & ~map->cleared;
if (!word)
goto next;
@@ -299,7 +300,7 @@ static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
unsigned int bitnr)
{
- return &sb->map[SB_NR_TO_INDEX(sb, bitnr)].word;
+ return &sb->map[SB_NR_TO_INDEX(sb, bitnr)]->word;
}
/* Helpers equivalent to the operations in asm/bitops.h and linux/bitmap.h */
@@ -322,7 +323,7 @@ static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
*/
static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
{
- unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+ unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)]->cleared;
set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
}
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index ae4fd4de9ebe..64fb9800ed8c 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -85,6 +85,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
bool alloc_hint)
{
unsigned int bits_per_word;
+ struct sbitmap_word *map;
+ int index;
if (shift < 0)
shift = sbitmap_calculate_shift(depth);
@@ -116,6 +118,17 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
return -ENOMEM;
}
+ map = kvzalloc_node(sb->map_nr * sizeof(**sb->map), flags, node);
+ if (!map)
+ return -ENOMEM;
+
+ for (index = 0; index < sb->map_nr; index++, map++) {
+ struct sbitmap_word **_map;
+
+ _map = &sb->map[index];
+ *_map = map;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(sbitmap_init_node);
@@ -126,7 +139,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
unsigned int i;
for (i = 0; i < sb->map_nr; i++)
- sbitmap_deferred_clear(&sb->map[i]);
+ sbitmap_deferred_clear(sb->map[i]);
sb->depth = depth;
sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -170,7 +183,7 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
unsigned int alloc_hint)
{
- struct sbitmap_word *map = &sb->map[index];
+ struct sbitmap_word *map = sb->map[index];
int nr;
do {
@@ -246,7 +259,7 @@ static int __sbitmap_get_shallow(struct sbitmap *sb,
for (i = 0; i < sb->map_nr; i++) {
again:
- nr = __sbitmap_get_word(&sb->map[index].word,
+ nr = __sbitmap_get_word(&sb->map[index]->word,
min_t(unsigned int,
__map_depth(sb, index),
shallow_depth),
@@ -256,7 +269,7 @@ static int __sbitmap_get_shallow(struct sbitmap *sb,
break;
}
- if (sbitmap_deferred_clear(&sb->map[index]))
+ if (sbitmap_deferred_clear(sb->map[index]))
goto again;
/* Jump to next index. */
@@ -294,7 +307,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
unsigned int i;
for (i = 0; i < sb->map_nr; i++) {
- if (sb->map[i].word & ~sb->map[i].cleared)
+ if (sb->map[i]->word & ~sb->map[i]->cleared)
return true;
}
return false;
@@ -306,7 +319,7 @@ static unsigned int __sbitmap_weight(const struct sbitmap *sb, bool set)
unsigned int i, weight = 0;
for (i = 0; i < sb->map_nr; i++) {
- const struct sbitmap_word *word = &sb->map[i];
+ const struct sbitmap_word *word = sb->map[i];
unsigned int word_depth = __map_depth(sb, i);
if (set)
@@ -358,8 +371,9 @@ void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m)
int i;
for (i = 0; i < sb->map_nr; i++) {
- unsigned long word = READ_ONCE(sb->map[i].word);
- unsigned long cleared = READ_ONCE(sb->map[i].cleared);
+ struct sbitmap_word *map = READ_ONCE(sb->map[i]);
+ unsigned long word = READ_ONCE(map->word);
+ unsigned long cleared = READ_ONCE(map->cleared);
unsigned int word_bits = __map_depth(sb, i);
word &= ~cleared;
@@ -522,7 +536,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
index = SB_NR_TO_INDEX(sb, hint);
for (i = 0; i < sb->map_nr; i++) {
- struct sbitmap_word *map = &sb->map[index];
+ struct sbitmap_word *map = sb->map[index];
unsigned long get_mask;
unsigned int map_depth = __map_depth(sb, index);
@@ -665,7 +679,7 @@ void sbitmap_queue_clear_batch(struct sbitmap_queue *sbq, int offset,
unsigned long *this_addr;
/* since we're clearing a batch, skip the deferred map */
- this_addr = &sb->map[SB_NR_TO_INDEX(sb, tag)].word;
+ this_addr = &sb->map[SB_NR_TO_INDEX(sb, tag)]->word;
if (!addr) {
addr = this_addr;
} else if (addr != this_addr) {
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes
2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
@ 2022-05-10 11:14 ` John Garry
2022-05-10 19:31 ` kernel test robot
2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
2 siblings, 1 reply; 10+ messages in thread
From: John Garry @ 2022-05-10 11:14 UTC (permalink / raw)
To: axboe, linux-block; +Cc: linux-kernel, linux-scsi, John Garry
Currently sbitmap words are allocated in a single array. That array is
flagged for allocating at the NUMA node passed in sbitmap_init_node().
However often the sbitmap will be accessed by all the CPUs in the system -
for example, when BLK_MQ_F_TAG_HCTX_SHARED is set for the blk-mq tagset.
This can lead to performance issues where all CPUs in the system are doing
cross-NUMA node accesses to read/set/unset sbitmap tags.
Improve this by spreading the word allocations across all NUMA nodes as
evenly as possible. We set the per-CPU hint to fall within range of words
allocated for the NUMA node to which that CPU belongs.
Known issues:
- sbitmap resize does not work well for this scheme
- Improve updating hint for sbitmap_get() failure and sbitmap_put() when
hint is outside range of CPU's NUMA node words.
- Add intelligence for sub-arrays to be allocated at a single node, e.g.
when numa != NUMA_NO_NODE in sbitmap_init_node()
Signed-off-by: John Garry <john.garry@huawei.com>
---
include/linux/sbitmap.h | 5 ++++
lib/sbitmap.c | 63 +++++++++++++++++++++++++++++++++--------
2 files changed, 56 insertions(+), 12 deletions(-)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 46268f391e32..6d897032dbc6 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -60,6 +60,11 @@ struct sbitmap {
*/
unsigned int map_nr;
+ /**
+ * @map_nr_per_node: Number of words being used per NUMA node.
+ */
+ unsigned int map_nr_per_node;
+
/**
* @round_robin: Allocate bits in strict round-robin order.
*/
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 64fb9800ed8c..99c87fbfa1a1 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -8,6 +8,17 @@
#include <linux/random.h>
#include <linux/sbitmap.h>
#include <linux/seq_file.h>
+#include <linux/mm.h>
+
+static unsigned int sbitmap_get_new_hint(struct sbitmap *sb, int cpu)
+{
+ unsigned int shift = sb->shift;
+ unsigned int map_nr_per_node = sb->map_nr_per_node;
+ unsigned int bit_per_node = map_nr_per_node << shift;
+ unsigned int hint_base = bit_per_node * cpu_to_node(cpu);
+
+ return hint_base + (prandom_u32() % bit_per_node);
+}
static int init_alloc_hint(struct sbitmap *sb, gfp_t flags)
{
@@ -20,8 +31,10 @@ static int init_alloc_hint(struct sbitmap *sb, gfp_t flags)
if (depth && !sb->round_robin) {
int i;
- for_each_possible_cpu(i)
- *per_cpu_ptr(sb->alloc_hint, i) = prandom_u32() % depth;
+ for_each_possible_cpu(i) {
+ *per_cpu_ptr(sb->alloc_hint, i) =
+ sbitmap_get_new_hint(sb, i);
+ }
}
return 0;
}
@@ -86,7 +99,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
{
unsigned int bits_per_word;
struct sbitmap_word *map;
- int index;
+ int index, num_nodes = num_online_nodes();
+ int nid, map_nr_cnt;
if (shift < 0)
shift = sbitmap_calculate_shift(depth);
@@ -105,6 +119,11 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
return 0;
}
+ if (sb->map_nr < num_nodes) {
+ sb->map_nr_per_node = 1;
+ } else {
+ sb->map_nr_per_node = sb->map_nr / num_nodes;
+ }
if (alloc_hint) {
if (init_alloc_hint(sb, flags))
return -ENOMEM;
@@ -113,23 +132,43 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
}
sb->map = kvzalloc_node(sb->map_nr * sizeof(*sb->map), flags, node);
- if (!sb->map) {
- free_percpu(sb->alloc_hint);
- return -ENOMEM;
- }
-
- map = kvzalloc_node(sb->map_nr * sizeof(**sb->map), flags, node);
- if (!map)
- return -ENOMEM;
+ if (!sb->map)
+ goto err_map;
- for (index = 0; index < sb->map_nr; index++, map++) {
+ for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
struct sbitmap_word **_map;
+ if ((index % sb->map_nr_per_node) == 0) {
+ int cnt;
+
+ if (index == 0) {
+ cnt = sb->map_nr_per_node +
+ (sb->map_nr % sb->map_nr_per_node);
+ } else {
+ cnt = sb->map_nr_per_node;
+ }
+
+ map = kvzalloc_node(cnt * sizeof(**sb->map), flags, nid);
+ if (!map)
+ goto err_map_numa;
+ nid++;
+ }
+
_map = &sb->map[index];
*_map = map;
}
return 0;
+err_map_numa:
+ for (index = 0; index < sb->map_nr; index++, map++) {
+ if ((index % sb->map_nr_per_node) == 0) {
+ kfree(map);
+ }
+ }
+err_map:
+ free_percpu(sb->alloc_hint);
+
+ return -ENOMEM;
}
EXPORT_SYMBOL_GPL(sbitmap_init_node);
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
@ 2022-05-10 12:50 ` Jens Axboe
2022-05-10 13:44 ` John Garry
2 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-05-10 12:50 UTC (permalink / raw)
To: John Garry, linux-block; +Cc: linux-kernel, linux-scsi
On 5/10/22 5:14 AM, John Garry wrote:
> Hi Jens, guys,
>
> I am sending this as an RFC to see if there is any future in it or ideas
> on how to make better. I also need to improve some items (as mentioned in
> 2/2 commit message) and test a lot more.
>
> The general idea is that we change from allocating a single array of
> sbitmap words to allocating an sub-array per NUMA node. And then each CPU
> in that node is hinted to use that sub-array
>
> Initial performance looks decent.
>
> Some figures:
> System: 4-nodes (with memory on all nodes), 128 CPUs
>
> null blk config block:
> 20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
> hw_queue_depth=256
>
> fio config:
> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> ioscheduler=none
>
> Before:
> 7130K
>
> After:
> 7630K
>
> So a +7% IOPS gain.
What does the comparison run on a non-NUMA non-shared queue look like?
Because I bet it'd be slower.
To be honest, I don't like this approach at all. It makes the normal
case quite a bit slower by having an extra layer of indirection for the
word, that's quite a bit of extra cost. It doesn't seem like a good
approach for the issue, as it pessimizes the normal fast case.
Spreading the memory out does probably make sense, but we need to retain
the fast normal case. Making sbitmap support both, selected at init
time, would be far more likely to be acceptable imho.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
@ 2022-05-10 13:44 ` John Garry
2022-05-10 14:34 ` Jens Axboe
2022-05-11 2:07 ` Ming Lei
0 siblings, 2 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 13:44 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: linux-kernel, linux-scsi
On 10/05/2022 13:50, Jens Axboe wrote:
>> fio config:
>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>> ioscheduler=none
>>
>> Before:
>> 7130K
>>
>> After:
>> 7630K
>>
>> So a +7% IOPS gain.
Thanks for having a look.
> What does the comparison run on a non-NUMA non-shared queue look like?
> Because I bet it'd be slower.
I could test more to get a solid result for that.
>
> To be honest, I don't like this approach at all. It makes the normal
> case quite a bit slower by having an extra layer of indirection for the
> word, that's quite a bit of extra cost.
Yes, there is the extra load. I would hope that there would be a low
cost, but I agree that we still want to avoid it. So prob no point in
testing this more.
> It doesn't seem like a good
> approach for the issue, as it pessimizes the normal fast case.
>
> Spreading the memory out does probably make sense, but we need to retain
> the fast normal case. Making sbitmap support both, selected at init
> time, would be far more likely to be acceptable imho.
I wanted to keep the code changes minimal for an initial RFC to test the
water.
My original approach did not introduce the extra load for normal path
and had some init time selection for a normal word map vs numa word map,
but the code grew and became somewhat unmanageable. I'll revisit it to
see how to improve that.
Cheers,
john
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-10 13:44 ` John Garry
@ 2022-05-10 14:34 ` Jens Axboe
2022-05-10 15:03 ` John Garry
2022-05-11 2:07 ` Ming Lei
1 sibling, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-05-10 14:34 UTC (permalink / raw)
To: John Garry, linux-block; +Cc: linux-kernel, linux-scsi
On 5/10/22 7:44 AM, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
>>> fio config:
>>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>>> ioscheduler=none
>>>
>>> Before:
>>> 7130K
>>>
>>> After:
>>> 7630K
>>>
>>> So a +7% IOPS gain.
>
> Thanks for having a look.
>
>> What does the comparison run on a non-NUMA non-shared queue look like?
>> Because I bet it'd be slower.
>
> I could test more to get a solid result for that.
>
>>
>> To be honest, I don't like this approach at all. It makes the normal
>> case quite a bit slower by having an extra layer of indirection for the
>> word, that's quite a bit of extra cost.
>
> Yes, there is the extra load. I would hope that there would be a low
> cost, but I agree that we still want to avoid it. So prob no point in
> testing this more.
I don't think that's low cost at all. It's the very hot path, and you're
now not only doing an extra load, it's a dependent load - you need to
load both to make any progress. On top of that, it's not like it's two
loads from the same cacheline or even page. The most important thing for
performance these days is having good cache utilization, the patch as it
stands very much makes that a lot worse.
Besides, for any kind of performance work like that, it's customary to
showcase both the situation that is supposedly fixed or improved with
the change, but also to test that it didn't regress the existing
common/fast case.
>> It doesn't seem like a good
>> approach for the issue, as it pessimizes the normal fast case.
>>
>> Spreading the memory out does probably make sense, but we need to retain
>> the fast normal case. Making sbitmap support both, selected at init
>> time, would be far more likely to be acceptable imho.
>
> I wanted to keep the code changes minimal for an initial RFC to test
> the water.
>
> My original approach did not introduce the extra load for normal path
> and had some init time selection for a normal word map vs numa word
> map, but the code grew and became somewhat unmanageable. I'll revisit
> it to see how to improve that.
Probably just needs some clean refactoring first, so that the actual
change can be pretty small.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-10 14:34 ` Jens Axboe
@ 2022-05-10 15:03 ` John Garry
0 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-10 15:03 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: linux-kernel, linux-scsi
On 10/05/2022 15:34, Jens Axboe wrote:
>> Yes, there is the extra load. I would hope that there would be a low
>> cost, but I agree that we still want to avoid it. So prob no point in
>> testing this more.
> I don't think that's low cost at all. It's the very hot path, and you're
> now not only doing an extra load, it's a dependent load - you need to
> load both to make any progress. On top of that, it's not like it's two
> loads from the same cacheline or even page. The most important thing for
> performance these days is having good cache utilization, the patch as it
> stands very much makes that a lot worse.
Understood. Essentially patch #1/2 points in the wrong direction.
I have to admit that I was a bit blinkered by seeing how much I could
improve the NUMA case.
>
> Besides, for any kind of performance work like that, it's customary to
> showcase both the situation that is supposedly fixed or improved with
> the change, but also to test that it didn't regress the existing
> common/fast case.
Right, I should have done that.
>
>>> It doesn't seem like a good
>>> approach for the issue, as it pessimizes the normal fast case.
>>>
>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test
>> the water.
>>
>> My original approach did not introduce the extra load for normal path
>> and had some init time selection for a normal word map vs numa word
>> map, but the code grew and became somewhat unmanageable. I'll revisit
>> it to see how to improve that.
> Probably just needs some clean refactoring first, so that the actual
> change can be pretty small.
I think that it may be just a case of separating out the handling of
evaluating the sbitmap_word ptr as that is that common struct which we
deal with.
Thanks,
John
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes
2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
@ 2022-05-10 19:31 ` kernel test robot
0 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2022-05-10 19:31 UTC (permalink / raw)
To: John Garry; +Cc: llvm, kbuild-all
Hi John,
[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linux/master linus/master v5.18-rc6 next-20220510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/intel-lab-lkp/linux/commits/John-Garry/sbitmap-NUMA-node-spreading/20220510-192122
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-a003-20220509 (https://download.01.org/0day-ci/archive/20220511/202205110332.EbWRHZ10-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 18dd123c56754edf62c7042dcf23185c3727610f)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/9cb4cd7060790dcb3d7f9405c11a0da884a0c616
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review John-Garry/sbitmap-NUMA-node-spreading/20220510-192122
git checkout 9cb4cd7060790dcb3d7f9405c11a0da884a0c616
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> lib/sbitmap.c:138:63: warning: variable 'map_nr_cnt' is uninitialized when used here [-Wuninitialized]
for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
^~~~~~~~~~
lib/sbitmap.c:103:21: note: initialize the variable 'map_nr_cnt' to silence this warning
int nid, map_nr_cnt;
^
= 0
1 warning generated.
vim +/map_nr_cnt +138 lib/sbitmap.c
95
96 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
97 gfp_t flags, int node, bool round_robin,
98 bool alloc_hint)
99 {
100 unsigned int bits_per_word;
101 struct sbitmap_word *map;
102 int index, num_nodes = num_online_nodes();
103 int nid, map_nr_cnt;
104
105 if (shift < 0)
106 shift = sbitmap_calculate_shift(depth);
107
108 bits_per_word = 1U << shift;
109 if (bits_per_word > BITS_PER_LONG)
110 return -EINVAL;
111
112 sb->shift = shift;
113 sb->depth = depth;
114 sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
115 sb->round_robin = round_robin;
116
117 if (depth == 0) {
118 sb->map = NULL;
119 return 0;
120 }
121
122 if (sb->map_nr < num_nodes) {
123 sb->map_nr_per_node = 1;
124 } else {
125 sb->map_nr_per_node = sb->map_nr / num_nodes;
126 }
127 if (alloc_hint) {
128 if (init_alloc_hint(sb, flags))
129 return -ENOMEM;
130 } else {
131 sb->alloc_hint = NULL;
132 }
133
134 sb->map = kvzalloc_node(sb->map_nr * sizeof(*sb->map), flags, node);
135 if (!sb->map)
136 goto err_map;
137
> 138 for (index = 0, nid = 0; index < sb->map_nr; index++, map++, map_nr_cnt++) {
139 struct sbitmap_word **_map;
140
141 if ((index % sb->map_nr_per_node) == 0) {
142 int cnt;
143
144 if (index == 0) {
145 cnt = sb->map_nr_per_node +
146 (sb->map_nr % sb->map_nr_per_node);
147 } else {
148 cnt = sb->map_nr_per_node;
149 }
150
151 map = kvzalloc_node(cnt * sizeof(**sb->map), flags, nid);
152 if (!map)
153 goto err_map_numa;
154 nid++;
155 }
156
157 _map = &sb->map[index];
158 *_map = map;
159 }
160
161 return 0;
162 err_map_numa:
163 for (index = 0; index < sb->map_nr; index++, map++) {
164 if ((index % sb->map_nr_per_node) == 0) {
165 kfree(map);
166 }
167 }
168 err_map:
169 free_percpu(sb->alloc_hint);
170
171 return -ENOMEM;
172 }
173 EXPORT_SYMBOL_GPL(sbitmap_init_node);
174
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-10 13:44 ` John Garry
2022-05-10 14:34 ` Jens Axboe
@ 2022-05-11 2:07 ` Ming Lei
2022-05-11 9:57 ` John Garry
1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2022-05-11 2:07 UTC (permalink / raw)
To: John Garry; +Cc: Jens Axboe, linux-block, linux-kernel, linux-scsi
On Tue, May 10, 2022 at 02:44:50PM +0100, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
> > > fio config:
> > > bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> > > ioscheduler=none
> > >
> > > Before:
> > > 7130K
> > >
> > > After:
> > > 7630K
> > >
> > > So a +7% IOPS gain.
>
> Thanks for having a look.
>
> > What does the comparison run on a non-NUMA non-shared queue look like?
> > Because I bet it'd be slower.
>
> I could test more to get a solid result for that.
>
> >
> > To be honest, I don't like this approach at all. It makes the normal
> > case quite a bit slower by having an extra layer of indirection for the
> > word, that's quite a bit of extra cost.
>
> Yes, there is the extra load. I would hope that there would be a low cost,
> but I agree that we still want to avoid it. So prob no point in testing this
> more.
>
> > It doesn't seem like a good
> > approach for the issue, as it pessimizes the normal fast case.
> >
> > Spreading the memory out does probably make sense, but we need to retain
> > the fast normal case. Making sbitmap support both, selected at init
> > time, would be far more likely to be acceptable imho.
>
> I wanted to keep the code changes minimal for an initial RFC to test the
> water.
>
> My original approach did not introduce the extra load for normal path and
> had some init time selection for a normal word map vs numa word map, but the
> code grew and became somewhat unmanageable. I'll revisit it to see how to
> improve that.
I understand this approach just splits shared sbitmap into per-numa-node
part, but what if all IOs are just from CPUs in one same numa node? Doesn't
this way cause tag starvation and waste?
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 0/2] sbitmap: NUMA node spreading
2022-05-11 2:07 ` Ming Lei
@ 2022-05-11 9:57 ` John Garry
0 siblings, 0 replies; 10+ messages in thread
From: John Garry @ 2022-05-11 9:57 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel, linux-scsi
On 11/05/2022 03:07, Ming Lei wrote:
Hi Ming,
>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test the
>> water.
>>
>> My original approach did not introduce the extra load for normal path and
>> had some init time selection for a normal word map vs numa word map, but the
>> code grew and became somewhat unmanageable. I'll revisit it to see how to
>> improve that.
> I understand this approach just splits shared sbitmap into per-numa-node
> part, but what if all IOs are just from CPUs in one same numa node? Doesn't
> this way cause tag starvation and waste?
>
We would not do this. If we can't find a free bit in one node then we
need to check the others before giving up. This is some of the added
complexity which I hinted at. And things like batch get or RR support
become more complex.
Alternatively we could have the double pointer for numa spreading only,
which would make things simpler. I need to check which is overall
better. Adding the complexity for dealing with numa node sub-arrays may
affect performance also.
Thanks,
John
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-05-11 9:57 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10 11:14 [RFC PATCH 0/2] sbitmap: NUMA node spreading John Garry
2022-05-10 11:14 ` [RFC PATCH 1/2] sbitmap: Make sbitmap.map a double pointer John Garry
2022-05-10 11:14 ` [RFC PATCH 2/2] sbitmap: Spread sbitmap word allocation over NUMA nodes John Garry
2022-05-10 19:31 ` kernel test robot
2022-05-10 12:50 ` [RFC PATCH 0/2] sbitmap: NUMA node spreading Jens Axboe
2022-05-10 13:44 ` John Garry
2022-05-10 14:34 ` Jens Axboe
2022-05-10 15:03 ` John Garry
2022-05-11 2:07 ` Ming Lei
2022-05-11 9:57 ` John Garry
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.