All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/2] Improve batched tag allocation
@ 2021-10-12 17:15 Jens Axboe
  2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
  2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
  0 siblings, 2 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
  To: linux-block

Hi,

We can do better than implementing batch tag allocs as just repeated
calls into sbitmap. Add a sbitmap helper to grab a batch all at once,
and use that instead.

Testing with instrumentation added, we get very close to the full batch
count. For NVMe, if I run with 32 batch submits, the actual success
batch size is ~31 on average. This is close to ideal, as one hw queue
will have a 63 tag size and hence we get 31 of 32 tags once every 1/32
alloc. This could be improved, but wasting the extra cycles in sbitmap
to skip to the next index for that case doesn't seem worth it.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch()
  2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
@ 2021-10-12 17:15 ` Jens Axboe
  2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
  1 sibling, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.

An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.

A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.

On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sbitmap.h | 13 +++++++++++
 lib/sbitmap.c           | 51 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 2713e689ad66..e30b56023ead 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -426,6 +426,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth);
  */
 int __sbitmap_queue_get(struct sbitmap_queue *sbq);
 
+/**
+ * __sbitmap_queue_get_batch() - Try to allocate a batch of free bits
+ * @sbq: Bitmap queue to allocate from.
+ * @nr_tags: number of tags requested
+ * @offset: offset to add to returned bits
+ *
+ * Return: Mask of allocated tags, 0 if none are found. Each tag allocated is
+ * a bit in the mask returned, and the caller must add @offset to the value to
+ * get the absolute tag value.
+ */
+unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
+					unsigned int *offset);
+
 /**
  * __sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
  * sbitmap_queue, limiting the depth used from each word, with preemption
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index b25db9be938a..f398e0ae548e 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -489,6 +489,57 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq)
 }
 EXPORT_SYMBOL_GPL(__sbitmap_queue_get);
 
+unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
+					unsigned int *offset)
+{
+	struct sbitmap *sb = &sbq->sb;
+	unsigned int hint, depth;
+	unsigned long index, nr;
+	int i;
+
+	if (unlikely(sb->round_robin))
+		return 0;
+
+	depth = READ_ONCE(sb->depth);
+	hint = update_alloc_hint_before_get(sb, depth);
+
+	index = SB_NR_TO_INDEX(sb, hint);
+
+	for (i = 0; i < sb->map_nr; i++) {
+		struct sbitmap_word *map = &sb->map[index];
+		unsigned long get_mask;
+
+		sbitmap_deferred_clear(map);
+		if (map->word == (1UL << (map->depth - 1)) - 1)
+			continue;
+
+		nr = find_first_zero_bit(&map->word, map->depth);
+		if (nr + nr_tags <= map->depth) {
+			atomic_long_t *ptr = (atomic_long_t *) &map->word;
+			int map_tags = min_t(int, nr_tags, map->depth);
+			unsigned long val, ret;
+
+			get_mask = ((1UL << map_tags) - 1) << nr;
+			do {
+				val = READ_ONCE(map->word);
+				ret = atomic_long_cmpxchg(ptr, val, get_mask | val);
+			} while (ret != val);
+			get_mask = (get_mask & ~ret) >> nr;
+			if (get_mask) {
+				*offset = nr + (index << sb->shift);
+				update_alloc_hint_after_get(sb, depth, hint,
+							*offset + map_tags - 1);
+				return get_mask;
+			}
+		}
+		/* Jump to next index. */
+		if (++index >= sb->map_nr)
+			index = 0;
+	}
+
+	return 0;
+}
+
 int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
 				unsigned int shallow_depth)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] block: improve batched tag allocation
  2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
  2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
@ 2021-10-12 17:15 ` Jens Axboe
  2021-10-14  5:17   ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2021-10-12 17:15 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-tag.c | 15 ++++++++++++
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     | 57 ++++++++++++++++++++++++++++++----------------
 3 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..c43b97201161 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -86,6 +86,21 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 		return __sbitmap_queue_get(bt);
 }
 
+unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
+			      unsigned int *offset)
+{
+	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+	struct sbitmap_queue *bt = &tags->bitmap_tags;
+	unsigned long ret;
+
+	if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||
+	    data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+		return 0;
+	ret = __sbitmap_queue_get_batch(bt, nr_tags, offset);
+	*offset += tags->nr_reserved_tags;
+	return ret;
+}
+
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d8ce89fa1686..71c2f7d8e9b7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -38,6 +38,8 @@ extern int blk_mq_init_bitmaps(struct sbitmap_queue *bitmap_tags,
 			       int node, int alloc_policy);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
+unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
+			      unsigned int *offset);
 extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 			   unsigned int tag);
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f42cf615c527..7027a25c5271 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -408,15 +408,39 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		blk_mq_tag_busy(data->hctx);
 
 	/*
-	 * Waiting allocations only fail because of an inactive hctx.  In that
-	 * case just retry the hctx assignment and tag allocation as CPU hotplug
-	 * should have migrated us to an online CPU by now.
+	 * Try batched alloc if we want more than 1 tag.
 	 */
-	do {
+	if (data->nr_tags > 1) {
+		unsigned long tags;
+		unsigned int tag_offset;
+		int i, nr = 0;
+
+		tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
+		if (unlikely(!tags)) {
+			data->nr_tags = 1;
+			goto retry;
+		}
+		for (i = 0; tags; i++) {
+			if (!(tags & (1UL << i)))
+				continue;
+			tag = tag_offset + i;
+			tags &= ~(1UL << i);
+			rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+			rq->rq_next = *data->cached_rq;
+			*data->cached_rq = rq;
+		}
+		data->nr_tags -= nr;
+	} else {
+		/*
+		 * Waiting allocations only fail because of an inactive hctx.
+		 * In that case just retry the hctx assignment and tag
+		 * allocation as CPU hotplug should have migrated us to an
+		 * online CPU by now.
+		 */
 		tag = blk_mq_get_tag(data);
 		if (tag == BLK_MQ_NO_TAG) {
 			if (data->flags & BLK_MQ_REQ_NOWAIT)
-				break;
+				return NULL;
 			/*
 			 * Give up the CPU and sleep for a random short time to
 			 * ensure that thread using a realtime scheduling class
@@ -427,23 +451,16 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 			goto retry;
 		}
 
-		rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
-		if (!--data->nr_tags || e ||
-		    (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-			return rq;
-
-		/* link into the cached list */
-		rq->rq_next = *data->cached_rq;
-		*data->cached_rq = rq;
-		data->flags |= BLK_MQ_REQ_NOWAIT;
-	} while (1);
+		return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+	}
 
-	if (!data->cached_rq)
-		return NULL;
+	if (data->cached_rq) {
+		rq = *data->cached_rq;
+		*data->cached_rq = rq->rq_next;
+		return rq;
+	}
 
-	rq = *data->cached_rq;
-	*data->cached_rq = rq->rq_next;
-	return rq;
+	return NULL;
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] block: improve batched tag allocation
  2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
@ 2021-10-14  5:17   ` Christoph Hellwig
  2021-10-14 13:03     ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2021-10-14  5:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On Tue, Oct 12, 2021 at 11:15:25AM -0600, Jens Axboe wrote:
>  
> +unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
> +			      unsigned int *offset)
> +{
> +	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
> +	struct sbitmap_queue *bt = &tags->bitmap_tags;
> +	unsigned long ret;
> +
> +	if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||

Missing whitespace after the first ||.

> +	if (data->nr_tags > 1) {
> +		unsigned long tags;
> +		unsigned int tag_offset;
> +		int i, nr = 0;
> +
> +		tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
> +		if (unlikely(!tags)) {
> +			data->nr_tags = 1;
> +			goto retry;

Unless I'm missing something we don't want the retry case that maps
the contexts against and calls blk_mq_tag_busy again.

> +		}
> +		for (i = 0; tags; i++) {
> +			if (!(tags & (1UL << i)))
> +				continue;
> +			tag = tag_offset + i;
> +			tags &= ~(1UL << i);
> +			rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
> +			rq->rq_next = *data->cached_rq;
> +			*data->cached_rq = rq;
> +		}
> +		data->nr_tags -= nr;

And keeping all this batch logic in a helper (even if inlined) would
simplify the code a lot.  Something like this untested patch:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ac94149fb4be..608b270a7f6b8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -373,6 +373,38 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	return rq;
 }
 
+static inline struct request *
+__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data,
+		u64 alloc_time_ns)
+{
+	unsigned int tag, tag_offset;
+	struct request *rq;
+	unsigned long tags;
+	int i, nr = 0;
+
+	tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
+	if (unlikely(!tags))
+		return NULL;
+
+	for (i = 0; tags; i++) {
+		if (!(tags & (1UL << i)))
+			continue;
+		tag = tag_offset + i;
+		tags &= ~(1UL << i);
+		rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+		rq->rq_next = *data->cached_rq;
+		*data->cached_rq = rq;
+	}
+	data->nr_tags -= nr;
+
+	if (!data->cached_rq)
+		return NULL;
+
+	rq = *data->cached_rq;
+	*data->cached_rq = rq->rq_next;
+	return rq;
+}
+
 static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
@@ -411,56 +443,32 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 	 * Try batched alloc if we want more than 1 tag.
 	 */
 	if (data->nr_tags > 1) {
-		unsigned long tags;
-		unsigned int tag_offset;
-		int i, nr = 0;
-
-		tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
-		if (unlikely(!tags)) {
-			data->nr_tags = 1;
-			goto retry;
-		}
-		for (i = 0; tags; i++) {
-			if (!(tags & (1UL << i)))
-				continue;
-			tag = tag_offset + i;
-			tags &= ~(1UL << i);
-			rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
-			rq->rq_next = *data->cached_rq;
-			*data->cached_rq = rq;
-		}
-		data->nr_tags -= nr;
-	} else {
-		/*
-		 * Waiting allocations only fail because of an inactive hctx.
-		 * In that case just retry the hctx assignment and tag
-		 * allocation as CPU hotplug should have migrated us to an
-		 * online CPU by now.
-		 */
-		tag = blk_mq_get_tag(data);
-		if (tag == BLK_MQ_NO_TAG) {
-			if (data->flags & BLK_MQ_REQ_NOWAIT)
-				return NULL;
-			/*
-			 * Give up the CPU and sleep for a random short time to
-			 * ensure that thread using a realtime scheduling class
-			 * are migrated off the CPU, and thus off the hctx that
-			 * is going away.
-			 */
-			msleep(3);
-			goto retry;
-		}
-
-		return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
+		rq = __blk_mq_alloc_requests_batch(data, alloc_time_ns);
+		if (rq)
+			return rq;
+		data->nr_tags = 1;
 	}
 
-	if (data->cached_rq) {
-		rq = *data->cached_rq;
-		*data->cached_rq = rq->rq_next;
-		return rq;
+	/*
+	 * Waiting allocations only fail because of an inactive hctx.  In that
+	 * case just retry the hctx assignment and tag allocation as CPU hotplug
+	 * should have migrated us to an online CPU by now.
+	 */
+	tag = blk_mq_get_tag(data);
+	if (tag == BLK_MQ_NO_TAG) {
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return NULL;
+		/*
+		 * Give up the CPU and sleep for a random short time to
+		 * ensure that thread using a realtime scheduling class
+		 * are migrated off the CPU, and thus off the hctx that
+		 * is going away.
+		 */
+		msleep(3);
+		goto retry;
 	}
 
-	return NULL;
+	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] block: improve batched tag allocation
  2021-10-14  5:17   ` Christoph Hellwig
@ 2021-10-14 13:03     ` Jens Axboe
  0 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2021-10-14 13:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On 10/13/21 11:17 PM, Christoph Hellwig wrote:
> On Tue, Oct 12, 2021 at 11:15:25AM -0600, Jens Axboe wrote:
>>  
>> +unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
>> +			      unsigned int *offset)
>> +{
>> +	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>> +	struct sbitmap_queue *bt = &tags->bitmap_tags;
>> +	unsigned long ret;
>> +
>> +	if (data->shallow_depth ||data->flags & BLK_MQ_REQ_RESERVED ||
> 
> Missing whitespace after the first ||.

Fixed.

>> +	if (data->nr_tags > 1) {
>> +		unsigned long tags;
>> +		unsigned int tag_offset;
>> +		int i, nr = 0;
>> +
>> +		tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
>> +		if (unlikely(!tags)) {
>> +			data->nr_tags = 1;
>> +			goto retry;
> 
> Unless I'm missing something we don't want the retry case that maps
> the contexts against and calls blk_mq_tag_busy again.

Yes and no, we could've been preempted and we're now on a different CPU.
It's not a huge deal or likely outcome, and it'll only hurt efficiency a
bit if that's the case. Can be skipped.

>> +		}
>> +		for (i = 0; tags; i++) {
>> +			if (!(tags & (1UL << i)))
>> +				continue;
>> +			tag = tag_offset + i;
>> +			tags &= ~(1UL << i);
>> +			rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>> +			rq->rq_next = *data->cached_rq;
>> +			*data->cached_rq = rq;
>> +		}
>> +		data->nr_tags -= nr;
> 
> And keeping all this batch logic in a helper (even if inlined) would
> simplify the code a lot.  Something like this untested patch:

Yeah, I did go back and forth on that, it's small enough that I didn't
care too much about it, but an inline helper is fine too. I can fold
this in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-14 13:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12 17:15 [PATCHSET 0/2] Improve batched tag allocation Jens Axboe
2021-10-12 17:15 ` [PATCH 1/2] sbitmap: add __sbitmap_queue_get_batch() Jens Axboe
2021-10-12 17:15 ` [PATCH 2/2] block: improve batched tag allocation Jens Axboe
2021-10-14  5:17   ` Christoph Hellwig
2021-10-14 13:03     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.