* [PATCHSET 0/2] Improve batched tag allocation
@ 2021-10-12 17:15 Jens Axboe
2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
0 siblings, 2 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
To: linux-block
Hi,
We can do better than implementing batch tag allocs as just repeated
calls into sbitmap. Add a sbitmap helper to grab a batch all at once,
and use that instead.
Testing with instrumentation added, we get very close to the full batch
count. For NVMe, if I run with 32 batch submits, the actual success
batch size is ~31 on average. This is close to ideal, as one hw queue
will have a 63 tag size and hence we get 31 of 32 tags once every 1/32
alloc. This could be improved, but wasting the extra cycles in sbitmap
to skip to the next index for that case doesn't seem worth it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch()
2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
@ 2021-10-12 17:15 ` Jens Axboe
2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
1 sibling, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
To: linux-block; +Cc: Jens Axboe
The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.
An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.
A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.
On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/sbitmap.h | 13 +++++++++++
lib/sbitmap.c | 51 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 2713e689ad66..e30b56023ead 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -426,6 +426,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth);
*/
int __sbitmap_queue_get(struct sbitmap_queue *sbq);
+/**
+ * __sbitmap_queue_get_batch() - Try to allocate a batch of free bits
+ * @sbq: Bitmap queue to allocate from.
+ * @nr_tags: number of tags requested
+ * @offset: offset to add to returned bits
+ *
+ * Return: Mask of allocated tags, 0 if none are found. Each tag allocated is
+ * a bit in the mask returned, and the caller must add @offset to the value to
+ * get the absolute tag value.
+ */
+unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
+ unsigned int *offset);
+
/**
* __sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
* sbitmap_queue, limiting the depth used from each word, with preemption
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index b25db9be938a..f398e0ae548e 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -489,6 +489,57 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq)
}
EXPORT_SYMBOL_GPL(__sbitmap_queue_get);
+unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
+ unsigned int *offset)
+{
+ struct sbitmap *sb = &sbq->sb;
+ unsigned int hint, depth;
+ unsigned long index, nr;
+ int i;
+
+ if (unlikely(sb->round_robin))
+ return 0;
+
+ depth = READ_ONCE(sb->depth);
+ hint = update_alloc_hint_before_get(sb, depth);
+
+ index = SB_NR_TO_INDEX(sb, hint);
+
+ for (i = 0; i < sb->map_nr; i++) {
+ struct sbitmap_word *map = &sb->map[index];
+ unsigned long get_mask;
+
+ sbitmap_deferred_clear(map);
+ if (map->word == (1UL << (map->depth - 1)) - 1)
+ continue;
+
+ nr = find_first_zero_bit(&map->word, map->depth);
+ if (nr + nr_tags <= map->depth) {
+ atomic_long_t *ptr = (atomic_long_t *) &map->word;
+ int map_tags = min_t(int, nr_tags, map->depth);
+ unsigned long val, ret;
+
+ get_mask = ((1UL << map_tags) - 1) << nr;
+ do {
+ val = READ_ONCE(map->word);
+ ret = atomic_long_cmpxchg(ptr, val, get_mask | val);
+ } while (ret != val);
+ get_mask = (get_mask & ~ret) >> nr;
+ if (get_mask) {
+ *offset = nr + (index << sb->shift);
+ update_alloc_hint_after_get(sb, depth, hint,
+ *offset + map_tags - 1);
+ return get_mask;
+ }
+ }
+ /* Jump to next index. */
+ if (++index >= sb->map_nr)
+ index = 0;
+ }
+
+ return 0;
+}
+
int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
unsigned int shallow_depth)
{
--
2.33.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] block: improve batched tag allocation
2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
@ 2021-10-12 17:15 ` Jens Axboe
2021-10-14 5:17 ` Christoph Hellwig
1 sibling, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
To: linux-block; +Cc: Jens Axboe
Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().
This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-mq-tag.c | 15 ++++++++++++
block/blk-mq-tag.h | 2 ++
block/blk-mq.c | 57 ++++++++++++++++++++++++++++++----------------
3 files changed, 54 insertions(+), 20 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..c43b97201161 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -86,6 +86,21 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
return __sbitmap_queue_get(bt);
}
+unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
+ unsigned int *offset)
+{
+ struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+ struct sbitmap_queue *bt = &tags->bitmap_tags;
+ unsigned long ret;
+
+ if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||
+ data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+ return 0;
+ ret = __sbitmap_queue_get_batch(bt, nr_tags, offset);
+ *offset += tags->nr_reserved_tags;
+ return ret;
+}
+
unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
{
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d8ce89fa1686..71c2f7d8e9b7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -38,6 +38,8 @@ extern int blk_mq_init_bitmaps(struct sbitmap_queue *bitmap_tags,
int node, int alloc_policy);
extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
+unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
+ unsigned int *offset);
extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
unsigned int tag);
extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f42cf615c527..7027a25c5271 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -408,15 +408,39 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
blk_mq_tag_busy(data->hctx);
/*
- * Waiting allocations only fail because of an inactive hctx. In that
- * case just retry the hctx assignment and tag allocation as CPU hotplug
- * should have migrated us to an online CPU by now.
+ * Try batched alloc if we want more than 1 tag.
*/
- do {
+ if (data->nr_tags > 1) {
+ unsigned long tags;
+ unsigned int tag_offset;
+ int i, nr = 0;
+
+ tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
+ if (unlikely(!tags)) {
+ data->nr_tags = 1;
+ goto retry;
+ }
+ for (i = 0; tags; i++) {
+ if (!(tags & (1UL << i)))
+ continue;
+ tag = tag_offset + i;
+ tags &= ~(1UL << i);
+ rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+ rq->rq_next = *data->cached_rq;
+ *data->cached_rq = rq;
+ }
+ data->nr_tags -= nr;
+ } else {
+ /*
+ * Waiting allocations only fail because of an inactive hctx.
+ * In that case just retry the hctx assignment and tag
+ * allocation as CPU hotplug should have migrated us to an
+ * online CPU by now.
+ */
tag = blk_mq_get_tag(data);
if (tag == BLK_MQ_NO_TAG) {
if (data->flags & BLK_MQ_REQ_NOWAIT)
- break;
+ return NULL;
/*
* Give up the CPU and sleep for a random short time to
* ensure that thread using a realtime scheduling class
@@ -427,23 +451,16 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
goto retry;
}
- rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
- if (!--data->nr_tags || e ||
- (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
- return rq;
-
- /* link into the cached list */
- rq->rq_next = *data->cached_rq;
- *data->cached_rq = rq;
- data->flags |= BLK_MQ_REQ_NOWAIT;
- } while (1);
+ return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+ }
- if (!data->cached_rq)
- return NULL;
+ if (data->cached_rq) {
+ rq = *data->cached_rq;
+ *data->cached_rq = rq->rq_next;
+ return rq;
+ }
- rq = *data->cached_rq;
- *data->cached_rq = rq->rq_next;
- return rq;
+ return NULL;
}
struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
--
2.33.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] block: improve batched tag allocation
2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
@ 2021-10-14 5:17 ` Christoph Hellwig
2021-10-14 13:03 ` Jens Axboe
0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2021-10-14 5:17 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block
On Tue, Oct 12, 2021 at 11:15:25AM -0600, Jens Axboe wrote:
>
> +unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
> + unsigned int *offset)
> +{
> + struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
> + struct sbitmap_queue *bt = &tags->bitmap_tags;
> + unsigned long ret;
> +
> + if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||
Missing whitespace after the first ||.
> + if (data->nr_tags > 1) {
> + unsigned long tags;
> + unsigned int tag_offset;
> + int i, nr = 0;
> +
> + tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
> + if (unlikely(!tags)) {
> + data->nr_tags = 1;
> + goto retry;
Unless I'm missing something we don't want the retry case that maps
the contexts against and calls blk_mq_tag_busy again.
> + }
> + for (i = 0; tags; i++) {
> + if (!(tags & (1UL << i)))
> + continue;
> + tag = tag_offset + i;
> + tags &= ~(1UL << i);
> + rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
> + rq->rq_next = *data->cached_rq;
> + *data->cached_rq = rq;
> + }
> + data->nr_tags -= nr;
And keeping all this batch logic in a helper (even if inlined) would
simplify the code a lot. Something like this untested patch:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ac94149fb4be..608b270a7f6b8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -373,6 +373,38 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
return rq;
}
+static inline struct request *
+__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data,
+ u64 alloc_time_ns)
+{
+ unsigned int tag, tag_offset;
+ struct request *rq;
+ unsigned long tags;
+ int i, nr = 0;
+
+ tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
+ if (unlikely(!tags))
+ return NULL;
+
+ for (i = 0; tags; i++) {
+ if (!(tags & (1UL << i)))
+ continue;
+ tag = tag_offset + i;
+ tags &= ~(1UL << i);
+ rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+ rq->rq_next = *data->cached_rq;
+ *data->cached_rq = rq;
+ }
+ data->nr_tags -= nr;
+
+ if (!data->cached_rq)
+ return NULL;
+
+ rq = *data->cached_rq;
+ *data->cached_rq = rq->rq_next;
+ return rq;
+}
+
static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
{
struct request_queue *q = data->q;
@@ -411,56 +443,32 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
* Try batched alloc if we want more than 1 tag.
*/
if (data->nr_tags > 1) {
- unsigned long tags;
- unsigned int tag_offset;
- int i, nr = 0;
-
- tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
- if (unlikely(!tags)) {
- data->nr_tags = 1;
- goto retry;
- }
- for (i = 0; tags; i++) {
- if (!(tags & (1UL << i)))
- continue;
- tag = tag_offset + i;
- tags &= ~(1UL << i);
- rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
- rq->rq_next = *data->cached_rq;
- *data->cached_rq = rq;
- }
- data->nr_tags -= nr;
- } else {
- /*
- * Waiting allocations only fail because of an inactive hctx.
- * In that case just retry the hctx assignment and tag
- * allocation as CPU hotplug should have migrated us to an
- * online CPU by now.
- */
- tag = blk_mq_get_tag(data);
- if (tag == BLK_MQ_NO_TAG) {
- if (data->flags & BLK_MQ_REQ_NOWAIT)
- return NULL;
- /*
- * Give up the CPU and sleep for a random short time to
- * ensure that thread using a realtime scheduling class
- * are migrated off the CPU, and thus off the hctx that
- * is going away.
- */
- msleep(3);
- goto retry;
- }
-
- return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+ rq = __blk_mq_alloc_requests_batch(data, alloc_time_ns);
+ if (rq)
+ return rq;
+ data->nr_tags = 1;
}
- if (data->cached_rq) {
- rq = *data->cached_rq;
- *data->cached_rq = rq->rq_next;
- return rq;
+ /*
+ * Waiting allocations only fail because of an inactive hctx. In that
+ * case just retry the hctx assignment and tag allocation as CPU hotplug
+ * should have migrated us to an online CPU by now.
+ */
+ tag = blk_mq_get_tag(data);
+ if (tag == BLK_MQ_NO_TAG) {
+ if (data->flags & BLK_MQ_REQ_NOWAIT)
+ return NULL;
+ /*
+ * Give up the CPU and sleep for a random short time to
+ * ensure that thread using a realtime scheduling class
+ * are migrated off the CPU, and thus off the hctx that
+ * is going away.
+ */
+ msleep(3);
+ goto retry;
}
- return NULL;
+ return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
}
struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] block: improve batched tag allocation
2021-10-14 5:17 ` Christoph Hellwig
@ 2021-10-14 13:03 ` Jens Axboe
0 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-14 13:03 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block
On 10/13/21 11:17 PM, Christoph Hellwig wrote:
> On Tue, Oct 12, 2021 at 11:15:25AM -0600, Jens Axboe wrote:
>>
>> +unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
>> + unsigned int *offset)
>> +{
>> + struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>> + struct sbitmap_queue *bt = &tags->bitmap_tags;
>> + unsigned long ret;
>> +
>> + if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||
>
> Missing whitespace after the first ||.
Fixed.
>> + if (data->nr_tags > 1) {
>> + unsigned long tags;
>> + unsigned int tag_offset;
>> + int i, nr = 0;
>> +
>> + tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
>> + if (unlikely(!tags)) {
>> + data->nr_tags = 1;
>> + goto retry;
>
> Unless I'm missing something we don't want the retry case that maps
> the contexts against and calls blk_mq_tag_busy again.
Yes and no, we could've been preempted and we're now on a different CPU.
It's not a huge deal or likely outcome, and it'll only hurt efficiency a
bit if that's the case. Can be skipped.
>> + }
>> + for (i = 0; tags; i++) {
>> + if (!(tags & (1UL << i)))
>> + continue;
>> + tag = tag_offset + i;
>> + tags &= ~(1UL << i);
>> + rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>> + rq->rq_next = *data->cached_rq;
>> + *data->cached_rq = rq;
>> + }
>> + data->nr_tags -= nr;
>
> And keeping all this batch logic in a helper (even if inlined) would
> simplify the code a lot. Something like this untested patch:
Yeah, I did go back and forth on that, it's small enough that I didn't
care too much about it, but an inline helper is fine too. I can fold
this in.
--
Jens Axboe
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-10-14 13:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
2021-10-14 5:17 ` Christoph Hellwig
2021-10-14 13:03 ` Jens Axboe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.