linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned
       [not found] <20181130011234.32674-1-axboe@kernel.dk>
@ 2018-11-30  1:12 ` Jens Axboe
  2018-11-30  1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30  1:12 UTC (permalink / raw)
  To: linux-block, osandov; +Cc: Jens Axboe

We try to be careful with alignment for cache purposes, but all of
that is worthless if we don't actually align the maps themselves.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sbitmap.h | 11 ++++++++---
 lib/sbitmap.c           |  7 +++++--
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 804a50983ec5..5cb1755d32da 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -63,9 +63,14 @@ struct sbitmap {
 	unsigned int map_nr;
 
 	/**
-	 * @map: Allocated bitmap.
+	 * @map: Aligned allocated bitmap.
 	 */
 	struct sbitmap_word *map;
+
+	/**
+	 * @map_ptr: Originally allocated map pointer
+	 */
+	void *map_ptr;
 };
 
 #define SBQ_WAIT_QUEUES 8
@@ -157,8 +162,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
  */
 static inline void sbitmap_free(struct sbitmap *sb)
 {
-	kfree(sb->map);
-	sb->map = NULL;
+	kfree(sb->map_ptr);
+	sb->map_ptr = sb->map = NULL;
 }
 
 /**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 45cab6bbc1c7..21e776e3128d 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -25,6 +25,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 {
 	unsigned int bits_per_word;
 	unsigned int i;
+	size_t size;
 
 	if (shift < 0) {
 		shift = ilog2(BITS_PER_LONG);
@@ -52,9 +53,11 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		return 0;
 	}
 
-	sb->map = kcalloc_node(sb->map_nr, sizeof(*sb->map), flags, node);
-	if (!sb->map)
+	size = sb->map_nr * sizeof(*sb->map) + L1_CACHE_BYTES - 1;
+	sb->map_ptr = kzalloc_node(size, flags, node);
+	if (!sb->map_ptr)
 		return -ENOMEM;
+	sb->map = PTR_ALIGN(sb->map_ptr, L1_CACHE_BYTES);
 
 	for (i = 0; i < sb->map_nr; i++) {
 		sb->map[i].depth = min(depth, bits_per_word);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] sbitmap: ammortize cost of clearing bits
       [not found] <20181130011234.32674-1-axboe@kernel.dk>
  2018-11-30  1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
@ 2018-11-30  1:12 ` Jens Axboe
  2018-12-09  5:51   ` Guenter Roeck
  2018-11-30  1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
  2018-11-30  2:09 ` Jens Axboe
  3 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2018-11-30  1:12 UTC (permalink / raw)
  To: linux-block, osandov; +Cc: Jens Axboe

sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.

This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg().

In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sbitmap.h | 26 +++++++++++++++---
 lib/sbitmap.c           | 60 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 5cb1755d32da..13eb8973bd10 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -30,14 +30,19 @@ struct seq_file;
  */
 struct sbitmap_word {
 	/**
-	 * @word: The bitmap word itself.
+	 * @depth: Number of bits being used in @word/@cleared
 	 */
-	unsigned long word;
+	unsigned long depth;
 
 	/**
-	 * @depth: Number of bits being used in @word.
+	 * @word: word holding free bits
 	 */
-	unsigned long depth;
+	unsigned long word ____cacheline_aligned_in_smp;
+
+	/**
+	 * @cleared: word holding cleared bits
+	 */
+	unsigned long cleared ____cacheline_aligned_in_smp;
 } ____cacheline_aligned_in_smp;
 
 /**
@@ -315,6 +320,19 @@ static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
 	clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
 }
 
+/*
+ * This one is special, since it doesn't actually clear the bit, rather it
+ * sets the corresponding bit in the ->cleared mask instead. Paired with
+ * the caller doing sbitmap_batch_clear() if a given index is full, which
+ * will clear the previously freed entries in the corresponding ->word.
+ */
+static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
+{
+	unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+
+	set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
+}
+
 static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
 					    unsigned int bitnr)
 {
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 21e776e3128d..04db31f4dfda 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -114,6 +114,58 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+
+	if (!sb->map[index].cleared)
+		return false;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	/*
+	 * If someone found ->cleared == 0 before we wrote ->word, then
+	 * they could have failed when they should not have. Check for
+	 * waiters.
+	 */
+	smp_mb__after_atomic();
+	sbitmap_queue_wake_up(container_of(sb, struct sbitmap_queue, sb));
+	return true;
+}
+
+static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
+				     unsigned int alloc_hint, bool round_robin)
+{
+	int nr;
+
+	do {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth, alloc_hint,
+					!round_robin);
+		if (nr != -1)
+			break;
+		if (!sbitmap_deferred_clear(sb, index))
+			break;
+	} while (1);
+
+	return nr;
+}
+
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 {
 	unsigned int i, index;
@@ -132,9 +184,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 		alloc_hint = 0;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index].word,
-					sb->map[index].depth, alloc_hint,
-					!round_robin);
+		nr = sbitmap_find_bit_in_index(sb, index, alloc_hint,
+						round_robin);
 		if (nr != -1) {
 			nr += index << sb->shift;
 			break;
@@ -517,7 +568,8 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
 void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
-	sbitmap_clear_bit_unlock(&sbq->sb, nr);
+	sbitmap_deferred_clear_bit(&sbq->sb, nr);
+
 	/*
 	 * Pairs with the memory barrier in set_current_state() to ensure the
 	 * proper ordering of clear_bit_unlock()/waitqueue_active() in the waker
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] sbitmap: optimize wakeup check
       [not found] <20181130011234.32674-1-axboe@kernel.dk>
  2018-11-30  1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
  2018-11-30  1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
@ 2018-11-30  1:12 ` Jens Axboe
  2018-11-30  2:09 ` Jens Axboe
  3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30  1:12 UTC (permalink / raw)
  To: linux-block, osandov; +Cc: Jens Axboe

Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost adds up.

Shift a bit of the cost to the slow path, when we actually have waiters.
Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
an internal count of how many are currently active. Then we can simply
check this count in sbq_wake_ptr() and not have to loop if we don't
have any sleepers.

Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-tag.c                       |  7 +++----
 drivers/target/iscsi/iscsi_target_util.c |  8 +++++---
 include/linux/sbitmap.h                  | 19 +++++++++++++++++++
 lib/sbitmap.c                            | 21 +++++++++++++++++++++
 4 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 87bc5df72d48..66c3a1c887ed 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -154,8 +154,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != -1)
 			break;
 
-		prepare_to_wait_exclusive(&ws->wait, &wait,
-						TASK_UNINTERRUPTIBLE);
+		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
 		if (tag != -1)
@@ -167,6 +166,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		bt_prev = bt;
 		io_schedule();
 
+		sbitmap_finish_wait(bt, ws, &wait);
+
 		data->ctx = blk_mq_get_ctx(data->q);
 		data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
 						data->ctx->cpu);
@@ -176,8 +177,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		else
 			bt = &tags->bitmap_tags;
 
-		finish_wait(&ws->wait, &wait);
-
 		/*
 		 * If destination hw queue is changed, fake wake up on
 		 * previous queue for compensating the wake up miss, so
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 36b742932c72..d7d03d601732 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -152,13 +152,15 @@ static int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
 	int tag = -1;
 	DEFINE_WAIT(wait);
 	struct sbq_wait_state *ws;
+	struct sbitmap_queue *sbq;
 
 	if (state == TASK_RUNNING)
 		return tag;
 
-	ws = &se_sess->sess_tag_pool.ws[0];
+	sbq = &se_sess->sess_tag_pool;
+	ws = &sbq->ws[0];
 	for (;;) {
-		prepare_to_wait_exclusive(&ws->wait, &wait, state);
+		sbitmap_prepare_to_wait(sbq, ws, &wait, state);
 		if (signal_pending_state(state, current))
 			break;
 		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
@@ -167,7 +169,7 @@ static int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
 		schedule();
 	}
 
-	finish_wait(&ws->wait, &wait);
+	sbitmap_finish_wait(sbq, ws, &wait);
 	return tag;
 }
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 13eb8973bd10..dbfbac0c4daa 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -135,6 +135,11 @@ struct sbitmap_queue {
 	 */
 	struct sbq_wait_state *ws;
 
+	/*
+	 * @ws_active: count of currently active ws waitqueues
+	 */
+	atomic_t ws_active;
+
 	/**
 	 * @round_robin: Allocate bits in strict round-robin order.
 	 */
@@ -554,4 +559,18 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
  */
 void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
 
+/*
+ * Wrapper around prepare_to_wait_exclusive(), which maintains some extra
+ * internal state.
+ */
+void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
+				struct sbq_wait_state *ws,
+				struct wait_queue_entry *wait, int state);
+
+/*
+ * Must be paired with sbitmap_prepare_to_wait().
+ */
+void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
+				struct wait_queue_entry *wait);
+
 #endif /* __LINUX_SCALE_BITMAP_H */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 04db31f4dfda..1cc21f916276 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -384,6 +384,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	sbq->min_shallow_depth = UINT_MAX;
 	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
 	atomic_set(&sbq->wake_index, 0);
+	atomic_set(&sbq->ws_active, 0);
 
 	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
 	if (!sbq->ws) {
@@ -499,6 +500,9 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 {
 	int i, wake_index;
 
+	if (!atomic_read(&sbq->ws_active))
+		return NULL;
+
 	wake_index = atomic_read(&sbq->wake_index);
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		struct sbq_wait_state *ws = &sbq->ws[wake_index];
@@ -639,3 +643,20 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
 	seq_printf(m, "min_shallow_depth=%u\n", sbq->min_shallow_depth);
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_show);
+
+void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
+			     struct sbq_wait_state *ws,
+			     struct wait_queue_entry *wait, int state)
+{
+	atomic_inc(&sbq->ws_active);
+	prepare_to_wait_exclusive(&ws->wait, wait, state);
+}
+EXPORT_SYMBOL_GPL(sbitmap_prepare_to_wait);
+
+void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
+			 struct wait_queue_entry *wait)
+{
+	finish_wait(&ws->wait, wait);
+	atomic_dec(&sbq->ws_active);
+}
+EXPORT_SYMBOL_GPL(sbitmap_finish_wait);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re:
       [not found] <20181130011234.32674-1-axboe@kernel.dk>
                   ` (2 preceding siblings ...)
  2018-11-30  1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
@ 2018-11-30  2:09 ` Jens Axboe
  3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30  2:09 UTC (permalink / raw)
  To: linux-block, osandov

On 11/29/18 6:12 PM, Jens Axboe wrote:
> Three patches here:
> 
> 1) Ensure that we align ->map properly
> 
> 2) v2 of the sbitmap clear cost ammortization. Updated to do a wakeup
>    check AFTER we're done swapping free/cleared masks. Kept the
>    separate alignment for ->word, as it is faster in testing.
> 
> 3) Cost reduction of having to do wait queue checks.

Ignore this one, see v3 posted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/3] sbitmap: ammortize cost of clearing bits
  2018-11-30  1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
@ 2018-12-09  5:51   ` Guenter Roeck
  2018-12-09  6:19     ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Guenter Roeck @ 2018-12-09  5:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, osandov, linux-kernel

Hi,

On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg().
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

This patch results in irq lock inversion warnings when trying to boot
from usb drives. This was observed with qemu boots of aarch64, x86, and
x86_64 images.

Log output as well as bisect results are attached below. I ran the bisect
twice, once with aarch64 and once with x86_64, with the same results.

Unfortunately, reverting the patch results in a compile error, so I was
unable to test if reverting the patch fixes the problem.

Sample qemu command line (x86_64):

qemu-system-x86_64 \
	-kernel arch/x86/boot/bzImage -M q35 -cpu SandyBridge \
	-no-reboot -snapshot -smp 4 -m 1G \
	-usb -device usb-storage,drive=d0 \
	-drive file=rootfs.ext2,if=none,id=d0,format=raw \
	--append 'root=/dev/sda rw rootwait panic=-1 console=ttyS0 console=tty' \
	-nographic

Guenter

---
...
[   19.035262] Waiting for root device /dev/sda...
[   19.139135] scsi 6:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[   19.161632] sd 6:0:0:0: Attached scsi generic sg1 type 0
[   19.171661] sd 6:0:0:0: Power-on or device reset occurred
[   19.186664] sd 6:0:0:0: [sda] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB)
[   19.197035] sd 6:0:0:0: [sda] Write Protect is off
[   19.209327] sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   19.330020] sd 6:0:0:0: [sda] Attached SCSI disk
[   19.363210] EXT4-fs (sda): mounting ext2 file system using the ext4 subsystem
[   19.392150] 
[   19.394410] ========================================================
[   19.397051] WARNING: possible irq lock inversion dependency detected
[   19.399744] 4.20.0-rc5-next-20181207 #1 Not tainted
[   19.402294] --------------------------------------------------------
[   19.404981] swapper/3/0 just changed the state of lock:
[   19.407185] (____ptrval____) (&sbq->ws[i].wait){..-.}, at: __wake_up_common_lock+0x5e/0xb0
[   19.409500] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   19.411650]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
[   19.411663] 
[   19.411663] 
[   19.411663] and interrupts could create inverse lock ordering between them.
[   19.411663] 
[   19.421461] 
[   19.421461] other info that might help us debug this:
[   19.425187] Chain exists of:
[   19.425187]   &sbq->ws[i].wait --> &(&hctx->dispatch_wait_lock)->rlock --> &(&sb->map[i].swap_lock)->rlock
[   19.425187] 
[   19.432974]  Possible interrupt unsafe locking scenario:
[   19.432974] 
[   19.436954]        CPU0                    CPU1
[   19.439002]        ----                    ----
[   19.440897]   lock(&(&sb->map[i].swap_lock)->rlock);
[   19.442828]                                local_irq_disable();
[   19.444767]                                lock(&sbq->ws[i].wait);
[   19.446767]                                lock(&(&hctx->dispatch_wait_lock)->rlock);
[   19.448716]   <Interrupt>
[   19.450574]     lock(&sbq->ws[i].wait);
[   19.452330] 
[   19.452330]  *** DEADLOCK ***
[   19.452330] 
[   19.456951] no locks held by swapper/3/0.
[   19.458648] 
[   19.458648] the shortest dependencies between 2nd lock and 1st lock:
[   19.461928]   -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 38 {
[   19.463648]      HARDIRQ-ON-W at:
[   19.465233]                         _raw_spin_lock+0x27/0x40
[   19.466837]                         sbitmap_get+0x8b/0x160
[   19.468490]                         __sbitmap_queue_get+0x24/0x90
[   19.469988]                         blk_mq_get_tag+0x28a/0x2c0
[   19.471455]                         blk_mq_get_driver_tag+0x7a/0x100
[   19.473051]                         blk_mq_dispatch_rq_list+0x95/0x550
[   19.474653]                         blk_mq_do_dispatch_sched+0x71/0x110
[   19.476137]                         blk_mq_sched_dispatch_requests+0x114/0x160
[   19.477712]                         __blk_mq_run_hw_queue+0x67/0xf0
[   19.479282]                         __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.480830]                         blk_mq_run_hw_queue+0x83/0xe0
[   19.482302]                         blk_mq_sched_insert_request+0x14a/0x1c0
[   19.483868]                         blk_execute_rq+0x64/0x90
[   19.485396]                         __scsi_execute+0x104/0x250
[   19.486779]                         sd_revalidate_disk+0xe7/0x1cf0
[   19.488184]                         sd_probe_async+0xae/0x194
[   19.489598]                         async_run_entry_fn+0x34/0x100
[   19.491025]                         process_one_work+0x237/0x5d0
[   19.493611]                         worker_thread+0x37/0x380
[   19.496588]                         kthread+0x118/0x130
[   19.498871]                         ret_from_fork+0x3a/0x50
[   19.501009]      SOFTIRQ-ON-W at:
[   19.502991]                         _raw_spin_lock+0x27/0x40
[   19.505302]                         sbitmap_get+0x8b/0x160
[   19.507375]                         __sbitmap_queue_get+0x24/0x90
[   19.509440]                         blk_mq_get_tag+0x28a/0x2c0
[   19.510859]                         blk_mq_get_driver_tag+0x7a/0x100
[   19.512334]                         blk_mq_dispatch_rq_list+0x95/0x550
[   19.513960]                         blk_mq_do_dispatch_sched+0x71/0x110
[   19.515388]                         blk_mq_sched_dispatch_requests+0x114/0x160
[   19.516909]                         __blk_mq_run_hw_queue+0x67/0xf0
[   19.518433]                         __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.519831]                         blk_mq_run_hw_queue+0x83/0xe0
[   19.521232]                         blk_mq_sched_insert_request+0x14a/0x1c0
[   19.522638]                         blk_execute_rq+0x64/0x90
[   19.524015]                         __scsi_execute+0x104/0x250
[   19.525393]                         sd_revalidate_disk+0xe7/0x1cf0
[   19.526795]                         sd_probe_async+0xae/0x194
[   19.528213]                         async_run_entry_fn+0x34/0x100
[   19.529810]                         process_one_work+0x237/0x5d0
[   19.531178]                         worker_thread+0x37/0x380
[   19.532548]                         kthread+0x118/0x130
[   19.533892]                         ret_from_fork+0x3a/0x50
[   19.535232]      INITIAL USE at:
[   19.536504]                        _raw_spin_lock+0x27/0x40
[   19.537939]                        sbitmap_get+0x8b/0x160
[   19.539411]                        __sbitmap_queue_get+0x24/0x90
[   19.540782]                        blk_mq_get_tag+0x28a/0x2c0
[   19.542134]                        blk_mq_get_driver_tag+0x7a/0x100
[   19.543495]                        blk_mq_dispatch_rq_list+0x95/0x550
[   19.544925]                        blk_mq_do_dispatch_sched+0x71/0x110
[   19.546312]                        blk_mq_sched_dispatch_requests+0x114/0x160
[   19.547712]                        __blk_mq_run_hw_queue+0x67/0xf0
[   19.549097]                        __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.550692]                        blk_mq_run_hw_queue+0x83/0xe0
[   19.552086]                        blk_mq_sched_insert_request+0x14a/0x1c0
[   19.553494]                        blk_execute_rq+0x64/0x90
[   19.554862]                        __scsi_execute+0x104/0x250
[   19.556233]                        sd_revalidate_disk+0xe7/0x1cf0
[   19.557621]                        sd_probe_async+0xae/0x194
[   19.558993]                        async_run_entry_fn+0x34/0x100
[   19.560383]                        process_one_work+0x237/0x5d0
[   19.561751]                        worker_thread+0x37/0x380
[   19.563267]                        kthread+0x118/0x130
[   19.564715]                        ret_from_fork+0x3a/0x50
[   19.566050]    }
[   19.567296]    ... key      at: [<ffffffffa2f27468>] __key.23877+0x0/0x8
[   19.568691]    ... acquired at:
[   19.569986]    sbitmap_get+0x8b/0x160
[   19.571279]    __sbitmap_queue_get+0x24/0x90
[   19.572619]    blk_mq_get_tag+0x28a/0x2c0
[   19.573957]    blk_mq_get_driver_tag+0xda/0x100
[   19.575308]    blk_mq_dispatch_rq_list+0x2f6/0x550
[   19.576643]    blk_mq_do_dispatch_sched+0x71/0x110
[   19.577968]    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.579328]    __blk_mq_run_hw_queue+0x67/0xf0
[   19.580671]    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.582047]    blk_mq_run_hw_queue+0x83/0xe0
[   19.583517]    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.584932]    blk_execute_rq+0x64/0x90
[   19.586256]    __scsi_execute+0x104/0x250
[   19.587617]    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.589020]    __scsi_scan_target+0x1f9/0x530
[   19.590367]    scsi_scan_channel.part.10+0x51/0x70
[   19.591742]    scsi_scan_host_selected+0xc9/0x140
[   19.593200]    scsi_scan_host+0x18f/0x1d0
[   19.594621]    usb_stor_scan_dwork+0x1a/0x80
[   19.595943]    process_one_work+0x237/0x5d0
[   19.597271]    worker_thread+0x37/0x380
[   19.598566]    kthread+0x118/0x130
[   19.599845]    ret_from_fork+0x3a/0x50
[   19.601128] 
[   19.602384]  -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[   19.603964]     INITIAL USE at:
[   19.605271]                      _raw_spin_lock+0x27/0x40
[   19.606670]                      blk_mq_dispatch_rq_list+0x270/0x550
[   19.608277]                      blk_mq_do_dispatch_sched+0x71/0x110
[   19.609808]                      blk_mq_sched_dispatch_requests+0x114/0x160
[   19.611239]                      __blk_mq_run_hw_queue+0x67/0xf0
[   19.612650]                      __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.614080]                      blk_mq_run_hw_queue+0x83/0xe0
[   19.615497]                      blk_mq_sched_insert_request+0x14a/0x1c0
[   19.616949]                      blk_execute_rq+0x64/0x90
[   19.618374]                      __scsi_execute+0x104/0x250
[   19.619787]                      scsi_probe_and_add_lun+0x1b5/0xc00
[   19.621226]                      __scsi_scan_target+0x1f9/0x530
[   19.622666]                      scsi_scan_channel.part.10+0x51/0x70
[   19.624117]                      scsi_scan_host_selected+0xc9/0x140
[   19.625557]                      scsi_scan_host+0x18f/0x1d0
[   19.626992]                      usb_stor_scan_dwork+0x1a/0x80
[   19.628441]                      process_one_work+0x237/0x5d0
[   19.629832]                      worker_thread+0x37/0x380
[   19.631235]                      kthread+0x118/0x130
[   19.632724]                      ret_from_fork+0x3a/0x50
[   19.634252]   }
[   19.635499]   ... key      at: [<ffffffffa2f256a0>] __key.49100+0x0/0x8
[   19.636925]   ... acquired at:
[   19.638255]    blk_mq_dispatch_rq_list+0x270/0x550
[   19.639655]    blk_mq_do_dispatch_sched+0x71/0x110
[   19.641076]    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.642517]    __blk_mq_run_hw_queue+0x67/0xf0
[   19.643961]    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.645424]    blk_mq_run_hw_queue+0x83/0xe0
[   19.646831]    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.648265]    blk_execute_rq+0x64/0x90
[   19.649634]    __scsi_execute+0x104/0x250
[   19.650983]    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.652360]    __scsi_scan_target+0x1f9/0x530
[   19.653808]    scsi_scan_channel.part.10+0x51/0x70
[   19.655364]    scsi_scan_host_selected+0xc9/0x140
[   19.656751]    scsi_scan_host+0x18f/0x1d0
[   19.658086]    usb_stor_scan_dwork+0x1a/0x80
[   19.659416]    process_one_work+0x237/0x5d0
[   19.660750]    worker_thread+0x37/0x380
[   19.662057]    kthread+0x118/0x130
[   19.663353]    ret_from_fork+0x3a/0x50
[   19.664649] 
[   19.665860] -> (&sbq->ws[i].wait){..-.} ops: 3 {
[   19.667186]    IN-SOFTIRQ-W at:
[   19.668474]                     _raw_spin_lock_irqsave+0x30/0x40
[   19.669878]                     __wake_up_common_lock+0x5e/0xb0
[   19.671278]                     __sbq_wake_up+0x8d/0xa0
[   19.672650]                     sbitmap_queue_clear+0x3b/0x60
[   19.674036]                     __blk_mq_free_request+0x64/0x90
[   19.675438]                     scsi_end_request+0x109/0x350
[   19.676911]                     scsi_io_completion+0x72/0x600
[   19.678477]                     blk_done_softirq+0x9a/0xd0
[   19.679847]                     __do_softirq+0xc8/0x455
[   19.681212]                     irq_exit+0xcc/0xe0
[   19.682543]                     call_function_single_interrupt+0xf/0x20
[   19.683941]                     default_idle+0x19/0x160
[   19.685311]                     do_idle+0x1d3/0x250
[   19.686643]                     cpu_startup_entry+0x14/0x20
[   19.688000]                     start_secondary+0x18f/0x1c0
[   19.689443]                     secondary_startup_64+0xa4/0xb0
[   19.690983]    INITIAL USE at:
[   19.693258]                    _raw_spin_lock_irq+0x2d/0x40
[   19.696185]                    blk_mq_dispatch_rq_list+0x25c/0x550
[   19.699217]                    blk_mq_do_dispatch_sched+0x71/0x110
[   19.702076]                    blk_mq_sched_dispatch_requests+0x114/0x160
[   19.704737]                    __blk_mq_run_hw_queue+0x67/0xf0
[   19.707182]                    __blk_mq_delay_run_hw_queue+0x150/0x170
[   19.709412]                    blk_mq_run_hw_queue+0x83/0xe0
[   19.710880]                    blk_mq_sched_insert_request+0x14a/0x1c0
[   19.712307]                    blk_execute_rq+0x64/0x90
[   19.713697]                    __scsi_execute+0x104/0x250
[   19.715083]                    scsi_probe_and_add_lun+0x1b5/0xc00
[   19.716574]                    __scsi_scan_target+0x1f9/0x530
[   19.718206]                    scsi_scan_channel.part.10+0x51/0x70
[   19.719641]                    scsi_scan_host_selected+0xc9/0x140
[   19.721184]                    scsi_scan_host+0x18f/0x1d0
[   19.722697]                    usb_stor_scan_dwork+0x1a/0x80
[   19.724077]                    process_one_work+0x237/0x5d0
[   19.725460]                    worker_thread+0x37/0x380
[   19.726820]                    kthread+0x118/0x130
[   19.728167]                    ret_from_fork+0x3a/0x50
[   19.729508]  }
[   19.730752]  ... key      at: [<ffffffffa2f27460>] __key.24146+0x0/0x8
[   19.732169]  ... acquired at:
[   19.733509]    __lock_acquire+0x8c6/0x1880
[   19.734865]    lock_acquire+0xaa/0x180
[   19.736210]    _raw_spin_lock_irqsave+0x30/0x40
[   19.737564]    __wake_up_common_lock+0x5e/0xb0
[   19.738927]    __sbq_wake_up+0x8d/0xa0
[   19.740276]    sbitmap_queue_clear+0x3b/0x60
[   19.741635]    __blk_mq_free_request+0x64/0x90
[   19.743008]    scsi_end_request+0x109/0x350
[   19.744361]    scsi_io_completion+0x72/0x600
[   19.745690]    blk_done_softirq+0x9a/0xd0
[   19.747017]    __do_softirq+0xc8/0x455
[   19.748342]    irq_exit+0xcc/0xe0
[   19.749637]    call_function_single_interrupt+0xf/0x20
[   19.750979]    default_idle+0x19/0x160
[   19.752318]    do_idle+0x1d3/0x250
[   19.753758]    cpu_startup_entry+0x14/0x20
[   19.755129]    start_secondary+0x18f/0x1c0
[   19.756470]    secondary_startup_64+0xa4/0xb0
[   19.757954] 
[   19.759225] 
[   19.759225] stack backtrace:
[   19.761899] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.20.0-rc5-next-20181207 #1
[   19.763359] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[   19.765024] Call Trace:
[   19.766589]  <IRQ>
[   19.768074]  dump_stack+0x67/0x90
[   19.769580]  check_usage_forwards.cold.54+0x20/0x29
[   19.771137]  mark_lock+0x2ca/0x620
[   19.772705]  ? check_usage_backwards+0x110/0x110
[   19.774300]  __lock_acquire+0x8c6/0x1880
[   19.775676]  ? select_task_rq_fair+0x1ca/0x1110
[   19.777062]  ? __lock_acquire+0x38a/0x1880
[   19.778424]  ? __lock_acquire+0x38a/0x1880
[   19.779751]  lock_acquire+0xaa/0x180
[   19.781070]  ? __wake_up_common_lock+0x5e/0xb0
[   19.782422]  _raw_spin_lock_irqsave+0x30/0x40
[   19.783766]  ? __wake_up_common_lock+0x5e/0xb0
[   19.785129]  __wake_up_common_lock+0x5e/0xb0
[   19.786483]  __sbq_wake_up+0x8d/0xa0
[   19.787809]  sbitmap_queue_clear+0x3b/0x60
[   19.789143]  __blk_mq_free_request+0x64/0x90
[   19.790631]  scsi_end_request+0x109/0x350
[   19.792150]  scsi_io_completion+0x72/0x600
[   19.793671]  blk_done_softirq+0x9a/0xd0
[   19.795121]  __do_softirq+0xc8/0x455
[   19.796441]  irq_exit+0xcc/0xe0
[   19.797730]  call_function_single_interrupt+0xf/0x20
[   19.799153]  </IRQ>
[   19.800515] RIP: 0010:default_idle+0x19/0x160
[   19.801932] Code: ff 48 89 ea 48 89 df 31 f6 5b 5d e9 31 73 91 ff 90 41 55 41 54 55 53 65 8b 2d a3 dd e4 5e 66 66 66 66 90 e8 b9 1f 59 ff fb f4 <65> 8b 2d 90 dd e4 5e 66 66 66 66 90 5b 5d 41 5c 41 5d c3 65 8b 05
[   19.806839] RSP: 0000:ffffb6f74020bec0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[   19.808531] RAX: ffff97563f2bc080 RBX: 0000000000000003 RCX: 0000000000000000
[   19.810312] RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff97563f2bc080
[   19.811925] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[   19.813555] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   19.815169] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   19.817002]  do_idle+0x1d3/0x250
[   19.818604]  cpu_startup_entry+0x14/0x20
[   19.820125]  start_secondary+0x18f/0x1c0
[   19.821635]  secondary_startup_64+0xa4/0xb0
[   19.856824] EXT4-fs (sda): mounted filesystem without journal. Opts: (null)
[   19.860450] VFS: Mounted root (ext2 filesystem) on device 8:0.
[   19.870456] devtmpfs: mounted
[   19.975778] Freeing unused kernel image memory: 1348K
[   19.987597] Write protecting the kernel read-only data: 22528k
[   19.999208] Freeing unused kernel image memory: 2008K
[   20.007277] Freeing unused kernel image memory: 1824K
[   20.010030] rodata_test: all tests were successful
[   20.012948] Run /sbin/init as init process
[   20.640528] EXT4-fs (sda): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[   21.570798] random: dd: uninitialized urandom read (512 bytes read)
[   21.621870] dd (1296) used greatest stack depth: 12728 bytes left

---
# bad: [74c4a24df7cac1f9213a811d79558ecde23be9a2] Add linux-next specific files for 20181207
# good: [2595646791c319cadfdbf271563aac97d0843dc7] Linux 4.20-rc5
git bisect start 'HEAD' 'v4.20-rc5'
# good: [8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e] Merge remote-tracking branch 'ipsec-next/master'
git bisect good 8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e
# bad: [a8147fd1777cecc0a0cc57fcd742a2b06660bd12] Merge remote-tracking branch 'backlight/for-backlight-next'
git bisect bad a8147fd1777cecc0a0cc57fcd742a2b06660bd12
# good: [467e8a516dcf922d1ea343cebb0e751f81f0dca3] Merge tag 'drm-intel-next-2018-12-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect good 467e8a516dcf922d1ea343cebb0e751f81f0dca3
# good: [25f68bda094fd6a808d1284abc23b194b0668a47] Merge remote-tracking branch 'sound/for-next'
git bisect good 25f68bda094fd6a808d1284abc23b194b0668a47
# bad: [81b6aa39f339c7ea9fe318580c168e55c5a7f593] Merge branch 'for-4.21/block' into for-next
git bisect bad 81b6aa39f339c7ea9fe318580c168e55c5a7f593
# good: [b6676f653f13f83582985bc713525a48d735b2a3] block: remove a few unused exports
git bisect good b6676f653f13f83582985bc713525a48d735b2a3
# good: [80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff] ataflop: implement mq_ops->commit_rqs() hook
git bisect good 80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff
# bad: [79eb7650f7f33f4e63a2380f00fc2fac651da7c1] Merge branch 'for-4.21/block' into for-next
git bisect bad 79eb7650f7f33f4e63a2380f00fc2fac651da7c1
# bad: [6322307809649cba6f545640563f95d686ecf404] nvme-pci: cleanup SQ allocation a bit
git bisect bad 6322307809649cba6f545640563f95d686ecf404
# bad: [2149da0748fc236b9916c53e26b3b0c9ab20a5dd] block: add cmd_flags to print_req_error
git bisect bad 2149da0748fc236b9916c53e26b3b0c9ab20a5dd
# good: [27fae429acee1e9418059e7fa545438075af5256] sbitmap: don't loop for find_next_zero_bit() for !round_robin
git bisect good 27fae429acee1e9418059e7fa545438075af5256
# bad: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits
git bisect bad ea86ea2cdced20057da4d2c32965c1219c238197
# good: [531724abc3bfb556c1dd68086cf9cb51f76464e3] block: avoid extra bio reference for async O_DIRECT
git bisect good 531724abc3bfb556c1dd68086cf9cb51f76464e3
# first bad commit: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/3] sbitmap: ammortize cost of clearing bits
  2018-12-09  5:51   ` Guenter Roeck
@ 2018-12-09  6:19     ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-12-09  6:19 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linux-block, osandov, linux-kernel

On Dec 8, 2018, at 10:51 PM, Guenter Roeck <linux@roeck-us.net> wrote:
> 
> Hi,
> 
>> On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
>> sbitmap maintains a set of words that we use to set and clear bits, with
>> each bit representing a tag for blk-mq. Even though we spread the bits
>> out and maintain a hint cache, one particular bit allocated will end up
>> being cleared in the exact same spot.
>> 
>> This introduces batched clearing of bits. Instead of clearing a given
>> bit, the same bit is set in a cleared/free mask instead. If we fail
>> allocating a bit from a given word, then we check the free mask, and
>> batch move those cleared bits at that time. This trades 64 atomic bitops
>> for 2 cmpxchg().
>> 
>> In a threaded poll test case, half the overhead of getting and clearing
>> tags is removed with this change. On another poll test case with a
>> single thread, performance is unchanged.
>> 
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> This patch results in irq lock inversion warnings when trying to boot
> from usb drives. This was observed with qemu boots of aarch64, x86, and
> x86_64 images.

This one is a false positive, was already reported last week. Just need to add some lockdep annotation for it. 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-09  6:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20181130011234.32674-1-axboe@kernel.dk>
2018-11-30  1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
2018-11-30  1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
2018-12-09  5:51   ` Guenter Roeck
2018-12-09  6:19     ` Jens Axboe
2018-11-30  1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
2018-11-30  2:09 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).