[bug report] worker watchdog timeout in dispatch loop for null_blk

* [bug report] worker watchdog timeout in dispatch loop for null_blk
@ 2022-03-10  9:16 Shinichiro Kawasaki
  2022-03-10 10:00 ` Ming Lei
  0 siblings, 1 reply; 11+ messages in thread
From: Shinichiro Kawasaki @ 2022-03-10  9:16 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe, Damien Le Moal

This issue does not look critical, but let me share it to ask comments for fix.

When fio command with 40 jobs [1] is run for a null_blk device with memory
backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
more than 30 seconds and other work can not run. The 40 fio jobs keep on
creating many read requests to a single null_blk device, then the every time
the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
means more than one request was dispatched. Hence, the while loop in
blk_mq_do_dispatch_sched() does not break.

static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
        int ret;

        do {
               ret = __blk_mq_do_dispatch_sched(hctx);
        } while (ret == 1);

        return ret;
}

The BUG message was observed when I ran blktests block/005 with various
conditions on a system with 40 CPUs. It was observed with kernel version
v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
poll queue support"). This commit added blk_mq_ops.map_queues callback. I
guess it changed dispatch behavior for null_blk devices and triggered the
BUG message.

I'm not so sure if we really need to fix this issue. It does not seem the real
world problem since it is observed only with null_blk. The real block devices
have slower IO operation then the dispatch should stop sooner when the hardware
queue gets full. Also the 40 jobs for single device is not realistic workload.

Having said that, it does not feel right that other works are pended during
dispatch for null_blk devices. To avoid the BUG message, I can think of two
fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
using a loop counter [3] (or jiffies timeout check). But it may not be good to
add such a check in the loop, assuming it is the hot path. The other fix idea is
to count number of requests within null_blk so that null_queue_rq() sometimes
returns BLK_STS_RESOURCE [4]. It will make the request requeued and the
dispatch loop have a break. This approach looks better for me since it touches
only null_blk.

Comments on solution will be appreciated. Is there better fix idea?  Now I'm
tempted to the fix in null_blk.


[1]

# fio --bs=4k --rw=randread --norandommap --numjobs=40 --name=reads --direct=1 \
    --filename=/dev/nullb0 --size=1g --group_reporting

[2]

[  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
[  609.701820] Showing busy workqueues and worker pools:
[  609.707915] workqueue events: flags=0x0
[  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  609.712687] workqueue events_freezable: flags=0x4
[  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.732952]     pending: pci_pme_list_scan
[  609.732968] workqueue events_power_efficient: flags=0x80
[  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.751955]     pending: neigh_managed_work
[  609.752018] workqueue kblockd: flags=0x18
[  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  609.769488]     in-flight: 1020:blk_mq_run_work_fn
[  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
[  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
[  639.909513] Showing busy workqueues and worker pools:
[  639.915404] workqueue events: flags=0x0
[  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  639.920365] workqueue kblockd: flags=0x18
[  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  639.939942]     in-flight: 1020:blk_mq_run_work_fn
[  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67

[3]

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55488ba978232..746c75cb6aafb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -178,13 +178,16 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	return !!dispatched;
 }
 
+#define DISPATCH_BREAK_COUNT 0x10000
+
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	int ret;
+	unsigned int count = DISPATCH_BREAK_COUNT;
 
 	do {
 		ret = __blk_mq_do_dispatch_sched(hctx);
-	} while (ret == 1);
+	} while (ret == 1 && count--);
 
 	return ret;
 }


[4]

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 13004beb48cab..3ef3a18fa8f3f 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -22,6 +22,8 @@ static DECLARE_FAULT_ATTR(null_requeue_attr);
 static DECLARE_FAULT_ATTR(null_init_hctx_attr);
 #endif
 
+#define REQUEST_BREAK_COUNT 0x10000
+
 static inline u64 mb_per_tick(int mbps)
 {
 	return (1 << 20) / TICKS_PER_SEC * ((u64) mbps);
@@ -1493,13 +1495,13 @@ static bool should_timeout_request(struct request *rq)
 	return false;
 }
 
-static bool should_requeue_request(struct request *rq)
+static bool should_requeue_request(struct request *rq, struct nullb_queue *nq)
 {
 #ifdef CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION
 	if (g_requeue_str[0])
 		return should_fail(&null_requeue_attr, 1);
 #endif
-	return false;
+	return !(nq->request_count % REQUEST_BREAK_COUNT);
 }
 
 static int null_map_queues(struct blk_mq_tag_set *set)
@@ -1632,8 +1634,9 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	cmd->fake_timeout = should_timeout_request(bd->rq);
 
 	blk_mq_start_request(bd->rq);
+	nq->request_count++;
 
-	if (should_requeue_request(bd->rq)) {
+	if (should_requeue_request(bd->rq, nq)) {
 		/*
 		 * Alternate between hitting the core BUSY path, and the
 		 * driver driven requeue path
@@ -1690,6 +1693,7 @@ static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
 	nq->dev = nullb->dev;
 	INIT_LIST_HEAD(&nq->poll_list);
 	spin_lock_init(&nq->poll_lock);
+	nq->request_count = 0;
 }
 
 static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 78eb56b0ca55f..165669ce97ad3 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -36,6 +36,7 @@ struct nullb_queue {
 	spinlock_t poll_lock;
 
 	struct nullb_cmd *cmds;
+	unsigned int request_count;
 };
 
 struct nullb_zone {

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply related	[flat|nested] 11+ messages in thread