* [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned
[not found] <20181130011234.32674-1-axboe@kernel.dk>
@ 2018-11-30 1:12 ` Jens Axboe
2018-11-30 1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30 1:12 UTC (permalink / raw)
To: linux-block, osandov; +Cc: Jens Axboe
We try to be careful with alignment for cache purposes, but all of
that is worthless if we don't actually align the maps themselves.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/sbitmap.h | 11 ++++++++---
lib/sbitmap.c | 7 +++++--
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 804a50983ec5..5cb1755d32da 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -63,9 +63,14 @@ struct sbitmap {
unsigned int map_nr;
/**
- * @map: Allocated bitmap.
+ * @map: Aligned allocated bitmap.
*/
struct sbitmap_word *map;
+
+ /**
+ * @map_ptr: Originally allocated map pointer
+ */
+ void *map_ptr;
};
#define SBQ_WAIT_QUEUES 8
@@ -157,8 +162,8 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
*/
static inline void sbitmap_free(struct sbitmap *sb)
{
- kfree(sb->map);
- sb->map = NULL;
+ kfree(sb->map_ptr);
+ sb->map_ptr = sb->map = NULL;
}
/**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 45cab6bbc1c7..21e776e3128d 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -25,6 +25,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
{
unsigned int bits_per_word;
unsigned int i;
+ size_t size;
if (shift < 0) {
shift = ilog2(BITS_PER_LONG);
@@ -52,9 +53,11 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
return 0;
}
- sb->map = kcalloc_node(sb->map_nr, sizeof(*sb->map), flags, node);
- if (!sb->map)
+ size = sb->map_nr * sizeof(*sb->map) + L1_CACHE_BYTES - 1;
+ sb->map_ptr = kzalloc_node(size, flags, node);
+ if (!sb->map_ptr)
return -ENOMEM;
+ sb->map = PTR_ALIGN(sb->map_ptr, L1_CACHE_BYTES);
for (i = 0; i < sb->map_nr; i++) {
sb->map[i].depth = min(depth, bits_per_word);
--
2.17.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] sbitmap: ammortize cost of clearing bits
[not found] <20181130011234.32674-1-axboe@kernel.dk>
2018-11-30 1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
@ 2018-11-30 1:12 ` Jens Axboe
2018-12-09 5:51 ` Guenter Roeck
2018-11-30 1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
2018-11-30 2:09 ` Jens Axboe
3 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2018-11-30 1:12 UTC (permalink / raw)
To: linux-block, osandov; +Cc: Jens Axboe
sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.
This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg().
In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/sbitmap.h | 26 +++++++++++++++---
lib/sbitmap.c | 60 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 8 deletions(-)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 5cb1755d32da..13eb8973bd10 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -30,14 +30,19 @@ struct seq_file;
*/
struct sbitmap_word {
/**
- * @word: The bitmap word itself.
+ * @depth: Number of bits being used in @word/@cleared
*/
- unsigned long word;
+ unsigned long depth;
/**
- * @depth: Number of bits being used in @word.
+ * @word: word holding free bits
*/
- unsigned long depth;
+ unsigned long word ____cacheline_aligned_in_smp;
+
+ /**
+ * @cleared: word holding cleared bits
+ */
+ unsigned long cleared ____cacheline_aligned_in_smp;
} ____cacheline_aligned_in_smp;
/**
@@ -315,6 +320,19 @@ static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
}
+/*
+ * This one is special, since it doesn't actually clear the bit, rather it
+ * sets the corresponding bit in the ->cleared mask instead. Paired with
+ * the caller doing sbitmap_batch_clear() if a given index is full, which
+ * will clear the previously freed entries in the corresponding ->word.
+ */
+static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
+{
+ unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+
+ set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
+}
+
static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
unsigned int bitnr)
{
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 21e776e3128d..04db31f4dfda 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -114,6 +114,58 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
return nr;
}
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+ unsigned long mask, val;
+
+ if (!sb->map[index].cleared)
+ return false;
+
+ /*
+ * First get a stable cleared mask, setting the old mask to 0.
+ */
+ do {
+ mask = sb->map[index].cleared;
+ } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+ /*
+ * Now clear the masked bits in our free word
+ */
+ do {
+ val = sb->map[index].word;
+ } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+ /*
+ * If someone found ->cleared == 0 before we wrote ->word, then
+ * they could have failed when they should not have. Check for
+ * waiters.
+ */
+ smp_mb__after_atomic();
+ sbitmap_queue_wake_up(container_of(sb, struct sbitmap_queue, sb));
+ return true;
+}
+
+static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
+ unsigned int alloc_hint, bool round_robin)
+{
+ int nr;
+
+ do {
+ nr = __sbitmap_get_word(&sb->map[index].word,
+ sb->map[index].depth, alloc_hint,
+ !round_robin);
+ if (nr != -1)
+ break;
+ if (!sbitmap_deferred_clear(sb, index))
+ break;
+ } while (1);
+
+ return nr;
+}
+
int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
{
unsigned int i, index;
@@ -132,9 +184,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
alloc_hint = 0;
for (i = 0; i < sb->map_nr; i++) {
- nr = __sbitmap_get_word(&sb->map[index].word,
- sb->map[index].depth, alloc_hint,
- !round_robin);
+ nr = sbitmap_find_bit_in_index(sb, index, alloc_hint,
+ round_robin);
if (nr != -1) {
nr += index << sb->shift;
break;
@@ -517,7 +568,8 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
unsigned int cpu)
{
- sbitmap_clear_bit_unlock(&sbq->sb, nr);
+ sbitmap_deferred_clear_bit(&sbq->sb, nr);
+
/*
* Pairs with the memory barrier in set_current_state() to ensure the
* proper ordering of clear_bit_unlock()/waitqueue_active() in the waker
--
2.17.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] sbitmap: optimize wakeup check
[not found] <20181130011234.32674-1-axboe@kernel.dk>
2018-11-30 1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
2018-11-30 1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
@ 2018-11-30 1:12 ` Jens Axboe
2018-11-30 2:09 ` Jens Axboe
3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30 1:12 UTC (permalink / raw)
To: linux-block, osandov; +Cc: Jens Axboe
Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost adds up.
Shift a bit of the cost to the slow path, when we actually have waiters.
Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
an internal count of how many are currently active. Then we can simply
check this count in sbq_wake_ptr() and not have to loop if we don't
have any sleepers.
Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-mq-tag.c | 7 +++----
drivers/target/iscsi/iscsi_target_util.c | 8 +++++---
include/linux/sbitmap.h | 19 +++++++++++++++++++
lib/sbitmap.c | 21 +++++++++++++++++++++
4 files changed, 48 insertions(+), 7 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 87bc5df72d48..66c3a1c887ed 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -154,8 +154,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
if (tag != -1)
break;
- prepare_to_wait_exclusive(&ws->wait, &wait,
- TASK_UNINTERRUPTIBLE);
+ sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
@@ -167,6 +166,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
bt_prev = bt;
io_schedule();
+ sbitmap_finish_wait(bt, ws, &wait);
+
data->ctx = blk_mq_get_ctx(data->q);
data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
data->ctx->cpu);
@@ -176,8 +177,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
else
bt = &tags->bitmap_tags;
- finish_wait(&ws->wait, &wait);
-
/*
* If destination hw queue is changed, fake wake up on
* previous queue for compensating the wake up miss, so
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 36b742932c72..d7d03d601732 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -152,13 +152,15 @@ static int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
int tag = -1;
DEFINE_WAIT(wait);
struct sbq_wait_state *ws;
+ struct sbitmap_queue *sbq;
if (state == TASK_RUNNING)
return tag;
- ws = &se_sess->sess_tag_pool.ws[0];
+ sbq = &se_sess->sess_tag_pool;
+ ws = &sbq->ws[0];
for (;;) {
- prepare_to_wait_exclusive(&ws->wait, &wait, state);
+ sbitmap_prepare_to_wait(sbq, ws, &wait, state);
if (signal_pending_state(state, current))
break;
tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
@@ -167,7 +169,7 @@ static int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
schedule();
}
- finish_wait(&ws->wait, &wait);
+ sbitmap_finish_wait(sbq, ws, &wait);
return tag;
}
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 13eb8973bd10..dbfbac0c4daa 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -135,6 +135,11 @@ struct sbitmap_queue {
*/
struct sbq_wait_state *ws;
+ /*
+ * @ws_active: count of currently active ws waitqueues
+ */
+ atomic_t ws_active;
+
/**
* @round_robin: Allocate bits in strict round-robin order.
*/
@@ -554,4 +559,18 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
*/
void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
+/*
+ * Wrapper around prepare_to_wait_exclusive(), which maintains some extra
+ * internal state.
+ */
+void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
+ struct sbq_wait_state *ws,
+ struct wait_queue_entry *wait, int state);
+
+/*
+ * Must be paired with sbitmap_prepare_to_wait().
+ */
+void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
+ struct wait_queue_entry *wait);
+
#endif /* __LINUX_SCALE_BITMAP_H */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 04db31f4dfda..1cc21f916276 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -384,6 +384,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
sbq->min_shallow_depth = UINT_MAX;
sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
atomic_set(&sbq->wake_index, 0);
+ atomic_set(&sbq->ws_active, 0);
sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
if (!sbq->ws) {
@@ -499,6 +500,9 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
{
int i, wake_index;
+ if (!atomic_read(&sbq->ws_active))
+ return NULL;
+
wake_index = atomic_read(&sbq->wake_index);
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
struct sbq_wait_state *ws = &sbq->ws[wake_index];
@@ -639,3 +643,20 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
seq_printf(m, "min_shallow_depth=%u\n", sbq->min_shallow_depth);
}
EXPORT_SYMBOL_GPL(sbitmap_queue_show);
+
+void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
+ struct sbq_wait_state *ws,
+ struct wait_queue_entry *wait, int state)
+{
+ atomic_inc(&sbq->ws_active);
+ prepare_to_wait_exclusive(&ws->wait, wait, state);
+}
+EXPORT_SYMBOL_GPL(sbitmap_prepare_to_wait);
+
+void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
+ struct wait_queue_entry *wait)
+{
+ finish_wait(&ws->wait, wait);
+ atomic_dec(&sbq->ws_active);
+}
+EXPORT_SYMBOL_GPL(sbitmap_finish_wait);
--
2.17.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re:
[not found] <20181130011234.32674-1-axboe@kernel.dk>
` (2 preceding siblings ...)
2018-11-30 1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
@ 2018-11-30 2:09 ` Jens Axboe
3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-11-30 2:09 UTC (permalink / raw)
To: linux-block, osandov
On 11/29/18 6:12 PM, Jens Axboe wrote:
> Three patches here:
>
> 1) Ensure that we align ->map properly
>
> 2) v2 of the sbitmap clear cost ammortization. Updated to do a wakeup
> check AFTER we're done swapping free/cleared masks. Kept the
> separate alignment for ->word, as it is faster in testing.
>
> 3) Cost reduction of having to do wait queue checks.
Ignore this one, see v3 posted.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] sbitmap: ammortize cost of clearing bits
2018-11-30 1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
@ 2018-12-09 5:51 ` Guenter Roeck
2018-12-09 6:19 ` Jens Axboe
0 siblings, 1 reply; 6+ messages in thread
From: Guenter Roeck @ 2018-12-09 5:51 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, osandov, linux-kernel
Hi,
On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
>
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg().
>
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch results in irq lock inversion warnings when trying to boot
from usb drives. This was observed with qemu boots of aarch64, x86, and
x86_64 images.
Log output as well as bisect results are attached below. I ran the bisect
twice, once with aarch64 and once with x86_64, with the same results.
Unfortunately, reverting the patch results in a compile error, so I was
unable to test if reverting the patch fixes the problem.
Sample qemu command line (x86_64):
qemu-system-x86_64 \
-kernel arch/x86/boot/bzImage -M q35 -cpu SandyBridge \
-no-reboot -snapshot -smp 4 -m 1G \
-usb -device usb-storage,drive=d0 \
-drive file=rootfs.ext2,if=none,id=d0,format=raw \
--append 'root=/dev/sda rw rootwait panic=-1 console=ttyS0 console=tty' \
-nographic
Guenter
---
...
[ 19.035262] Waiting for root device /dev/sda...
[ 19.139135] scsi 6:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
[ 19.161632] sd 6:0:0:0: Attached scsi generic sg1 type 0
[ 19.171661] sd 6:0:0:0: Power-on or device reset occurred
[ 19.186664] sd 6:0:0:0: [sda] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB)
[ 19.197035] sd 6:0:0:0: [sda] Write Protect is off
[ 19.209327] sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 19.330020] sd 6:0:0:0: [sda] Attached SCSI disk
[ 19.363210] EXT4-fs (sda): mounting ext2 file system using the ext4 subsystem
[ 19.392150]
[ 19.394410] ========================================================
[ 19.397051] WARNING: possible irq lock inversion dependency detected
[ 19.399744] 4.20.0-rc5-next-20181207 #1 Not tainted
[ 19.402294] --------------------------------------------------------
[ 19.404981] swapper/3/0 just changed the state of lock:
[ 19.407185] (____ptrval____) (&sbq->ws[i].wait){..-.}, at: __wake_up_common_lock+0x5e/0xb0
[ 19.409500] but this lock took another, SOFTIRQ-unsafe lock in the past:
[ 19.411650] (&(&sb->map[i].swap_lock)->rlock){+.+.}
[ 19.411663]
[ 19.411663]
[ 19.411663] and interrupts could create inverse lock ordering between them.
[ 19.411663]
[ 19.421461]
[ 19.421461] other info that might help us debug this:
[ 19.425187] Chain exists of:
[ 19.425187] &sbq->ws[i].wait --> &(&hctx->dispatch_wait_lock)->rlock --> &(&sb->map[i].swap_lock)->rlock
[ 19.425187]
[ 19.432974] Possible interrupt unsafe locking scenario:
[ 19.432974]
[ 19.436954] CPU0 CPU1
[ 19.439002] ---- ----
[ 19.440897] lock(&(&sb->map[i].swap_lock)->rlock);
[ 19.442828] local_irq_disable();
[ 19.444767] lock(&sbq->ws[i].wait);
[ 19.446767] lock(&(&hctx->dispatch_wait_lock)->rlock);
[ 19.448716] <Interrupt>
[ 19.450574] lock(&sbq->ws[i].wait);
[ 19.452330]
[ 19.452330] *** DEADLOCK ***
[ 19.452330]
[ 19.456951] no locks held by swapper/3/0.
[ 19.458648]
[ 19.458648] the shortest dependencies between 2nd lock and 1st lock:
[ 19.461928] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 38 {
[ 19.463648] HARDIRQ-ON-W at:
[ 19.465233] _raw_spin_lock+0x27/0x40
[ 19.466837] sbitmap_get+0x8b/0x160
[ 19.468490] __sbitmap_queue_get+0x24/0x90
[ 19.469988] blk_mq_get_tag+0x28a/0x2c0
[ 19.471455] blk_mq_get_driver_tag+0x7a/0x100
[ 19.473051] blk_mq_dispatch_rq_list+0x95/0x550
[ 19.474653] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.476137] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.477712] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.479282] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.480830] blk_mq_run_hw_queue+0x83/0xe0
[ 19.482302] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.483868] blk_execute_rq+0x64/0x90
[ 19.485396] __scsi_execute+0x104/0x250
[ 19.486779] sd_revalidate_disk+0xe7/0x1cf0
[ 19.488184] sd_probe_async+0xae/0x194
[ 19.489598] async_run_entry_fn+0x34/0x100
[ 19.491025] process_one_work+0x237/0x5d0
[ 19.493611] worker_thread+0x37/0x380
[ 19.496588] kthread+0x118/0x130
[ 19.498871] ret_from_fork+0x3a/0x50
[ 19.501009] SOFTIRQ-ON-W at:
[ 19.502991] _raw_spin_lock+0x27/0x40
[ 19.505302] sbitmap_get+0x8b/0x160
[ 19.507375] __sbitmap_queue_get+0x24/0x90
[ 19.509440] blk_mq_get_tag+0x28a/0x2c0
[ 19.510859] blk_mq_get_driver_tag+0x7a/0x100
[ 19.512334] blk_mq_dispatch_rq_list+0x95/0x550
[ 19.513960] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.515388] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.516909] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.518433] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.519831] blk_mq_run_hw_queue+0x83/0xe0
[ 19.521232] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.522638] blk_execute_rq+0x64/0x90
[ 19.524015] __scsi_execute+0x104/0x250
[ 19.525393] sd_revalidate_disk+0xe7/0x1cf0
[ 19.526795] sd_probe_async+0xae/0x194
[ 19.528213] async_run_entry_fn+0x34/0x100
[ 19.529810] process_one_work+0x237/0x5d0
[ 19.531178] worker_thread+0x37/0x380
[ 19.532548] kthread+0x118/0x130
[ 19.533892] ret_from_fork+0x3a/0x50
[ 19.535232] INITIAL USE at:
[ 19.536504] _raw_spin_lock+0x27/0x40
[ 19.537939] sbitmap_get+0x8b/0x160
[ 19.539411] __sbitmap_queue_get+0x24/0x90
[ 19.540782] blk_mq_get_tag+0x28a/0x2c0
[ 19.542134] blk_mq_get_driver_tag+0x7a/0x100
[ 19.543495] blk_mq_dispatch_rq_list+0x95/0x550
[ 19.544925] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.546312] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.547712] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.549097] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.550692] blk_mq_run_hw_queue+0x83/0xe0
[ 19.552086] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.553494] blk_execute_rq+0x64/0x90
[ 19.554862] __scsi_execute+0x104/0x250
[ 19.556233] sd_revalidate_disk+0xe7/0x1cf0
[ 19.557621] sd_probe_async+0xae/0x194
[ 19.558993] async_run_entry_fn+0x34/0x100
[ 19.560383] process_one_work+0x237/0x5d0
[ 19.561751] worker_thread+0x37/0x380
[ 19.563267] kthread+0x118/0x130
[ 19.564715] ret_from_fork+0x3a/0x50
[ 19.566050] }
[ 19.567296] ... key at: [<ffffffffa2f27468>] __key.23877+0x0/0x8
[ 19.568691] ... acquired at:
[ 19.569986] sbitmap_get+0x8b/0x160
[ 19.571279] __sbitmap_queue_get+0x24/0x90
[ 19.572619] blk_mq_get_tag+0x28a/0x2c0
[ 19.573957] blk_mq_get_driver_tag+0xda/0x100
[ 19.575308] blk_mq_dispatch_rq_list+0x2f6/0x550
[ 19.576643] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.577968] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.579328] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.580671] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.582047] blk_mq_run_hw_queue+0x83/0xe0
[ 19.583517] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.584932] blk_execute_rq+0x64/0x90
[ 19.586256] __scsi_execute+0x104/0x250
[ 19.587617] scsi_probe_and_add_lun+0x1b5/0xc00
[ 19.589020] __scsi_scan_target+0x1f9/0x530
[ 19.590367] scsi_scan_channel.part.10+0x51/0x70
[ 19.591742] scsi_scan_host_selected+0xc9/0x140
[ 19.593200] scsi_scan_host+0x18f/0x1d0
[ 19.594621] usb_stor_scan_dwork+0x1a/0x80
[ 19.595943] process_one_work+0x237/0x5d0
[ 19.597271] worker_thread+0x37/0x380
[ 19.598566] kthread+0x118/0x130
[ 19.599845] ret_from_fork+0x3a/0x50
[ 19.601128]
[ 19.602384] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[ 19.603964] INITIAL USE at:
[ 19.605271] _raw_spin_lock+0x27/0x40
[ 19.606670] blk_mq_dispatch_rq_list+0x270/0x550
[ 19.608277] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.609808] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.611239] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.612650] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.614080] blk_mq_run_hw_queue+0x83/0xe0
[ 19.615497] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.616949] blk_execute_rq+0x64/0x90
[ 19.618374] __scsi_execute+0x104/0x250
[ 19.619787] scsi_probe_and_add_lun+0x1b5/0xc00
[ 19.621226] __scsi_scan_target+0x1f9/0x530
[ 19.622666] scsi_scan_channel.part.10+0x51/0x70
[ 19.624117] scsi_scan_host_selected+0xc9/0x140
[ 19.625557] scsi_scan_host+0x18f/0x1d0
[ 19.626992] usb_stor_scan_dwork+0x1a/0x80
[ 19.628441] process_one_work+0x237/0x5d0
[ 19.629832] worker_thread+0x37/0x380
[ 19.631235] kthread+0x118/0x130
[ 19.632724] ret_from_fork+0x3a/0x50
[ 19.634252] }
[ 19.635499] ... key at: [<ffffffffa2f256a0>] __key.49100+0x0/0x8
[ 19.636925] ... acquired at:
[ 19.638255] blk_mq_dispatch_rq_list+0x270/0x550
[ 19.639655] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.641076] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.642517] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.643961] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.645424] blk_mq_run_hw_queue+0x83/0xe0
[ 19.646831] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.648265] blk_execute_rq+0x64/0x90
[ 19.649634] __scsi_execute+0x104/0x250
[ 19.650983] scsi_probe_and_add_lun+0x1b5/0xc00
[ 19.652360] __scsi_scan_target+0x1f9/0x530
[ 19.653808] scsi_scan_channel.part.10+0x51/0x70
[ 19.655364] scsi_scan_host_selected+0xc9/0x140
[ 19.656751] scsi_scan_host+0x18f/0x1d0
[ 19.658086] usb_stor_scan_dwork+0x1a/0x80
[ 19.659416] process_one_work+0x237/0x5d0
[ 19.660750] worker_thread+0x37/0x380
[ 19.662057] kthread+0x118/0x130
[ 19.663353] ret_from_fork+0x3a/0x50
[ 19.664649]
[ 19.665860] -> (&sbq->ws[i].wait){..-.} ops: 3 {
[ 19.667186] IN-SOFTIRQ-W at:
[ 19.668474] _raw_spin_lock_irqsave+0x30/0x40
[ 19.669878] __wake_up_common_lock+0x5e/0xb0
[ 19.671278] __sbq_wake_up+0x8d/0xa0
[ 19.672650] sbitmap_queue_clear+0x3b/0x60
[ 19.674036] __blk_mq_free_request+0x64/0x90
[ 19.675438] scsi_end_request+0x109/0x350
[ 19.676911] scsi_io_completion+0x72/0x600
[ 19.678477] blk_done_softirq+0x9a/0xd0
[ 19.679847] __do_softirq+0xc8/0x455
[ 19.681212] irq_exit+0xcc/0xe0
[ 19.682543] call_function_single_interrupt+0xf/0x20
[ 19.683941] default_idle+0x19/0x160
[ 19.685311] do_idle+0x1d3/0x250
[ 19.686643] cpu_startup_entry+0x14/0x20
[ 19.688000] start_secondary+0x18f/0x1c0
[ 19.689443] secondary_startup_64+0xa4/0xb0
[ 19.690983] INITIAL USE at:
[ 19.693258] _raw_spin_lock_irq+0x2d/0x40
[ 19.696185] blk_mq_dispatch_rq_list+0x25c/0x550
[ 19.699217] blk_mq_do_dispatch_sched+0x71/0x110
[ 19.702076] blk_mq_sched_dispatch_requests+0x114/0x160
[ 19.704737] __blk_mq_run_hw_queue+0x67/0xf0
[ 19.707182] __blk_mq_delay_run_hw_queue+0x150/0x170
[ 19.709412] blk_mq_run_hw_queue+0x83/0xe0
[ 19.710880] blk_mq_sched_insert_request+0x14a/0x1c0
[ 19.712307] blk_execute_rq+0x64/0x90
[ 19.713697] __scsi_execute+0x104/0x250
[ 19.715083] scsi_probe_and_add_lun+0x1b5/0xc00
[ 19.716574] __scsi_scan_target+0x1f9/0x530
[ 19.718206] scsi_scan_channel.part.10+0x51/0x70
[ 19.719641] scsi_scan_host_selected+0xc9/0x140
[ 19.721184] scsi_scan_host+0x18f/0x1d0
[ 19.722697] usb_stor_scan_dwork+0x1a/0x80
[ 19.724077] process_one_work+0x237/0x5d0
[ 19.725460] worker_thread+0x37/0x380
[ 19.726820] kthread+0x118/0x130
[ 19.728167] ret_from_fork+0x3a/0x50
[ 19.729508] }
[ 19.730752] ... key at: [<ffffffffa2f27460>] __key.24146+0x0/0x8
[ 19.732169] ... acquired at:
[ 19.733509] __lock_acquire+0x8c6/0x1880
[ 19.734865] lock_acquire+0xaa/0x180
[ 19.736210] _raw_spin_lock_irqsave+0x30/0x40
[ 19.737564] __wake_up_common_lock+0x5e/0xb0
[ 19.738927] __sbq_wake_up+0x8d/0xa0
[ 19.740276] sbitmap_queue_clear+0x3b/0x60
[ 19.741635] __blk_mq_free_request+0x64/0x90
[ 19.743008] scsi_end_request+0x109/0x350
[ 19.744361] scsi_io_completion+0x72/0x600
[ 19.745690] blk_done_softirq+0x9a/0xd0
[ 19.747017] __do_softirq+0xc8/0x455
[ 19.748342] irq_exit+0xcc/0xe0
[ 19.749637] call_function_single_interrupt+0xf/0x20
[ 19.750979] default_idle+0x19/0x160
[ 19.752318] do_idle+0x1d3/0x250
[ 19.753758] cpu_startup_entry+0x14/0x20
[ 19.755129] start_secondary+0x18f/0x1c0
[ 19.756470] secondary_startup_64+0xa4/0xb0
[ 19.757954]
[ 19.759225]
[ 19.759225] stack backtrace:
[ 19.761899] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.20.0-rc5-next-20181207 #1
[ 19.763359] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 19.765024] Call Trace:
[ 19.766589] <IRQ>
[ 19.768074] dump_stack+0x67/0x90
[ 19.769580] check_usage_forwards.cold.54+0x20/0x29
[ 19.771137] mark_lock+0x2ca/0x620
[ 19.772705] ? check_usage_backwards+0x110/0x110
[ 19.774300] __lock_acquire+0x8c6/0x1880
[ 19.775676] ? select_task_rq_fair+0x1ca/0x1110
[ 19.777062] ? __lock_acquire+0x38a/0x1880
[ 19.778424] ? __lock_acquire+0x38a/0x1880
[ 19.779751] lock_acquire+0xaa/0x180
[ 19.781070] ? __wake_up_common_lock+0x5e/0xb0
[ 19.782422] _raw_spin_lock_irqsave+0x30/0x40
[ 19.783766] ? __wake_up_common_lock+0x5e/0xb0
[ 19.785129] __wake_up_common_lock+0x5e/0xb0
[ 19.786483] __sbq_wake_up+0x8d/0xa0
[ 19.787809] sbitmap_queue_clear+0x3b/0x60
[ 19.789143] __blk_mq_free_request+0x64/0x90
[ 19.790631] scsi_end_request+0x109/0x350
[ 19.792150] scsi_io_completion+0x72/0x600
[ 19.793671] blk_done_softirq+0x9a/0xd0
[ 19.795121] __do_softirq+0xc8/0x455
[ 19.796441] irq_exit+0xcc/0xe0
[ 19.797730] call_function_single_interrupt+0xf/0x20
[ 19.799153] </IRQ>
[ 19.800515] RIP: 0010:default_idle+0x19/0x160
[ 19.801932] Code: ff 48 89 ea 48 89 df 31 f6 5b 5d e9 31 73 91 ff 90 41 55 41 54 55 53 65 8b 2d a3 dd e4 5e 66 66 66 66 90 e8 b9 1f 59 ff fb f4 <65> 8b 2d 90 dd e4 5e 66 66 66 66 90 5b 5d 41 5c 41 5d c3 65 8b 05
[ 19.806839] RSP: 0000:ffffb6f74020bec0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[ 19.808531] RAX: ffff97563f2bc080 RBX: 0000000000000003 RCX: 0000000000000000
[ 19.810312] RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff97563f2bc080
[ 19.811925] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[ 19.813555] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 19.815169] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 19.817002] do_idle+0x1d3/0x250
[ 19.818604] cpu_startup_entry+0x14/0x20
[ 19.820125] start_secondary+0x18f/0x1c0
[ 19.821635] secondary_startup_64+0xa4/0xb0
[ 19.856824] EXT4-fs (sda): mounted filesystem without journal. Opts: (null)
[ 19.860450] VFS: Mounted root (ext2 filesystem) on device 8:0.
[ 19.870456] devtmpfs: mounted
[ 19.975778] Freeing unused kernel image memory: 1348K
[ 19.987597] Write protecting the kernel read-only data: 22528k
[ 19.999208] Freeing unused kernel image memory: 2008K
[ 20.007277] Freeing unused kernel image memory: 1824K
[ 20.010030] rodata_test: all tests were successful
[ 20.012948] Run /sbin/init as init process
[ 20.640528] EXT4-fs (sda): re-mounted. Opts: block_validity,barrier,user_xattr,acl
[ 21.570798] random: dd: uninitialized urandom read (512 bytes read)
[ 21.621870] dd (1296) used greatest stack depth: 12728 bytes left
---
# bad: [74c4a24df7cac1f9213a811d79558ecde23be9a2] Add linux-next specific files for 20181207
# good: [2595646791c319cadfdbf271563aac97d0843dc7] Linux 4.20-rc5
git bisect start 'HEAD' 'v4.20-rc5'
# good: [8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e] Merge remote-tracking branch 'ipsec-next/master'
git bisect good 8d285caa38663cc22f7166f7cfb9ceb9d06d3c7e
# bad: [a8147fd1777cecc0a0cc57fcd742a2b06660bd12] Merge remote-tracking branch 'backlight/for-backlight-next'
git bisect bad a8147fd1777cecc0a0cc57fcd742a2b06660bd12
# good: [467e8a516dcf922d1ea343cebb0e751f81f0dca3] Merge tag 'drm-intel-next-2018-12-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect good 467e8a516dcf922d1ea343cebb0e751f81f0dca3
# good: [25f68bda094fd6a808d1284abc23b194b0668a47] Merge remote-tracking branch 'sound/for-next'
git bisect good 25f68bda094fd6a808d1284abc23b194b0668a47
# bad: [81b6aa39f339c7ea9fe318580c168e55c5a7f593] Merge branch 'for-4.21/block' into for-next
git bisect bad 81b6aa39f339c7ea9fe318580c168e55c5a7f593
# good: [b6676f653f13f83582985bc713525a48d735b2a3] block: remove a few unused exports
git bisect good b6676f653f13f83582985bc713525a48d735b2a3
# good: [80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff] ataflop: implement mq_ops->commit_rqs() hook
git bisect good 80ff2040ac3dd6d9d68d59159b0a6dffd8adc5ff
# bad: [79eb7650f7f33f4e63a2380f00fc2fac651da7c1] Merge branch 'for-4.21/block' into for-next
git bisect bad 79eb7650f7f33f4e63a2380f00fc2fac651da7c1
# bad: [6322307809649cba6f545640563f95d686ecf404] nvme-pci: cleanup SQ allocation a bit
git bisect bad 6322307809649cba6f545640563f95d686ecf404
# bad: [2149da0748fc236b9916c53e26b3b0c9ab20a5dd] block: add cmd_flags to print_req_error
git bisect bad 2149da0748fc236b9916c53e26b3b0c9ab20a5dd
# good: [27fae429acee1e9418059e7fa545438075af5256] sbitmap: don't loop for find_next_zero_bit() for !round_robin
git bisect good 27fae429acee1e9418059e7fa545438075af5256
# bad: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits
git bisect bad ea86ea2cdced20057da4d2c32965c1219c238197
# good: [531724abc3bfb556c1dd68086cf9cb51f76464e3] block: avoid extra bio reference for async O_DIRECT
git bisect good 531724abc3bfb556c1dd68086cf9cb51f76464e3
# first bad commit: [ea86ea2cdced20057da4d2c32965c1219c238197] sbitmap: ammortize cost of clearing bits
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] sbitmap: ammortize cost of clearing bits
2018-12-09 5:51 ` Guenter Roeck
@ 2018-12-09 6:19 ` Jens Axboe
0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-12-09 6:19 UTC (permalink / raw)
To: Guenter Roeck; +Cc: linux-block, osandov, linux-kernel
On Dec 8, 2018, at 10:51 PM, Guenter Roeck <linux@roeck-us.net> wrote:
>
> Hi,
>
>> On Thu, Nov 29, 2018 at 06:12:33PM -0700, Jens Axboe wrote:
>> sbitmap maintains a set of words that we use to set and clear bits, with
>> each bit representing a tag for blk-mq. Even though we spread the bits
>> out and maintain a hint cache, one particular bit allocated will end up
>> being cleared in the exact same spot.
>>
>> This introduces batched clearing of bits. Instead of clearing a given
>> bit, the same bit is set in a cleared/free mask instead. If we fail
>> allocating a bit from a given word, then we check the free mask, and
>> batch move those cleared bits at that time. This trades 64 atomic bitops
>> for 2 cmpxchg().
>>
>> In a threaded poll test case, half the overhead of getting and clearing
>> tags is removed with this change. On another poll test case with a
>> single thread, performance is unchanged.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> This patch results in irq lock inversion warnings when trying to boot
> from usb drives. This was observed with qemu boots of aarch64, x86, and
> x86_64 images.
This one is a false positive, was already reported last week. Just need to add some lockdep annotation for it.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-12-09 6:20 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20181130011234.32674-1-axboe@kernel.dk>
2018-11-30 1:12 ` [PATCH 1/3] sbitmap: ensure that sbitmap maps are properly aligned Jens Axboe
2018-11-30 1:12 ` [PATCH 2/3] sbitmap: ammortize cost of clearing bits Jens Axboe
2018-12-09 5:51 ` Guenter Roeck
2018-12-09 6:19 ` Jens Axboe
2018-11-30 1:12 ` [PATCH 3/3] sbitmap: optimize wakeup check Jens Axboe
2018-11-30 2:09 ` Jens Axboe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.