* [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing @ 2020-04-02 15:51 Douglas Anderson 2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson 2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson 0 siblings, 2 replies; 15+ messages in thread From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw) To: axboe, jejb, martin.petersen Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck, Douglas Anderson, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, linux-kernel While doing reboot testing, I found that occasionally my device would trigger the hung task detector. Many tasks were stuck waiting for the a blkdev mutex, but at least one task in the system was always sitting waiting for IO to complete (and holding the blkdev mutex). One example of a task that was just waiting for its IO to complete on one reboot: udevd D 0 2177 306 0x00400209 Call trace: __switch_to+0x15c/0x17c __schedule+0x6e0/0x928 schedule+0x8c/0xbc schedule_timeout+0x9c/0xfc io_schedule_timeout+0x24/0x48 do_wait_for_common+0xd0/0x160 wait_for_completion_io_timeout+0x54/0x74 blk_execute_rq+0x9c/0xd8 __scsi_execute+0x104/0x198 scsi_test_unit_ready+0xa0/0x154 sd_check_events+0xb4/0x164 disk_check_events+0x58/0x154 disk_clear_events+0x74/0x110 check_disk_change+0x28/0x6c sd_open+0x5c/0x130 __blkdev_get+0x20c/0x3d4 blkdev_get+0x74/0x170 blkdev_open+0x94/0xa8 do_dentry_open+0x268/0x3a0 vfs_open+0x34/0x40 path_openat+0x39c/0xdf4 do_filp_open+0x90/0x10c do_sys_open+0x150/0x3c8 ... I've reproduced this on two systems: one boots from an internal UFS disk and one from eMMC. Each has a card reader attached via USB with an SD card plugged in. On the USB-attached SD card is a disk with 12 partitions (a Chrome OS test image), if it matters. The system doesn't do much with the USB disk other than probe it (it's plugged in my system to help me recover). From digging, I believe that there are two separate but related issues. Both issues relate to the SCSI code saying that there is no budget. I have done testing with only one or the other of the two patches in this series and found that I could still encounter hung tasks if only one of the two patches was applied. This deserves a bit of explanation. To me, it's fairly obvious that the first fix wouldn't fix the problems talked about in the second patch. However, it's less obvious why the second patch doesn't fix the problems in blk_mq_dispatch_rq_list(). It turns out that it _almost_ does (problems become much more rare), but I did manage to get a single trace where the "kick" scheduled by the second patch happened really quickly. The scheduled kick then ran and found nothing to do. This happened in parallel to a task running in blk_mq_dispatch_rq_list() which hadn't gotten around to splicing the list back into hctx->dispatch. This is why we need both fixes or a heavier hammer where we always kick whenever two threads request budget at the same time. Most of my testing has been atop Chrome OS 5.4's kernel tree which currently has v5.4.28 merged in. The Chrome OS 5.4 tree also has a patch by Salman Qazi, namely ("block: Limit number of items taken from the I/O scheduler in one go"). Reverting that patch didn't make the hung tasks go away, so I kept it in for most of my testing. I have also done some testing on mainline Linux (git describe says I'm on v5.6-rc7-227-gf3e69428b5e2) even without Salman's patch. I found that I could reproduce the problems there and that traces looked about the same as I saw on the downstream branch. These patches were also confirmed to fix the problems on mainline. Chrome OS is currently setup to use the BFQ scheduler and I found that I couldn't reproduce the problems without BFQ. As discussed in the second patch this is believed to be because BFQ sometimes returns "true" from has_work() but then NULL from dispatch_request(). I'll insert my usual caveat that I'm sending patches to code that I know very little about. If I'm making a total bozo patch here, please help me figure out how I should fix the problems I found in a better way. If you want to see a total ridiculous amount of chatter where I stumbled around a whole bunch trying to figure out what was wrong and how to fix it, feel free to read <https://crbug.com/1061950>. I promise it will make your eyes glaze over right away if this cover letter didn't already do that. Specifically comment 79 in that bug includes a link to my ugly prototype of making BFQ's has_work() more exact (I only managed it by actually defining _both_ an exact and inexact function to avoid circular locking problems when it was called directly from blk_mq_hctx_has_pending()). Comment 79 also has more thoughts about alternatives considered. I don't know if these fixes represent a regression of some sort or are new. As per above I could only reproduce with BFQ enabled which makes it nearly impossible to go too far back with this. I haven't listed any "Fixes" tags here, but if someone felt it was appropriate to backport this to some stable trees that seems like it'd be nice. Presumably at least 5.4 stable would make sense. Thanks to Salman Qazi, Paolo Valente, and Guenter Roeck who spent a bunch of time helping me trawl through some of this code and reviewing early versions of this patch. Changes in v2: - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") Douglas Anderson (2): blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick blk-mq: Rerun dispatching in the case of budget contention block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- block/blk-mq.c | 14 +++++++++++--- include/linux/blkdev.h | 2 ++ 3 files changed, 37 insertions(+), 5 deletions(-) -- 2.26.0.rc2.310.g2932bb562d-goog ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick 2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson @ 2020-04-02 15:51 ` Douglas Anderson 2020-04-03 1:55 ` Ming Lei 2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson 1 sibling, 1 reply; 15+ messages in thread From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw) To: axboe, jejb, martin.petersen Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck, Douglas Anderson, linux-kernel In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns true and the driver returns BLK_STS_RESOURCE then we'll kick the queue. However, there's another case where we might need to kick it. If we were unable to get budget we can be in much the same state as when the driver returns BLK_STS_RESOURCE, so we should treat it the same. Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: None block/blk-mq.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index d92088dec6c3..2cd8d2b49ff4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1189,6 +1189,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool no_tag = false; int errors, queued; blk_status_t ret = BLK_STS_OK; + bool no_budget_avail = false; if (list_empty(list)) return false; @@ -1205,8 +1206,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, rq = list_first_entry(list, struct request, queuelist); hctx = rq->mq_hctx; - if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) + if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) { + no_budget_avail = true; break; + } if (!blk_mq_get_driver_tag(rq)) { /* @@ -1311,13 +1314,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * * If driver returns BLK_STS_RESOURCE and SCHED_RESTART * bit is set, run queue after a delay to avoid IO stalls - * that could otherwise occur if the queue is idle. + * that could otherwise occur if the queue is idle. We'll do + * similar if we couldn't get budget and SCHED_RESTART is set. */ needs_restart = blk_mq_sched_needs_restart(hctx); if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); - else if (needs_restart && (ret == BLK_STS_RESOURCE)) + else if (needs_restart && (ret == BLK_STS_RESOURCE || + no_budget_avail)) blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); blk_mq_update_dispatch_busy(hctx, true); -- 2.26.0.rc2.310.g2932bb562d-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick 2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson @ 2020-04-03 1:55 ` Ming Lei 0 siblings, 0 replies; 15+ messages in thread From: Ming Lei @ 2020-04-03 1:55 UTC (permalink / raw) To: Douglas Anderson Cc: axboe, jejb, martin.petersen, paolo.valente, sqazi, linux-block, linux-scsi, groeck, linux-kernel On Thu, Apr 02, 2020 at 08:51:29AM -0700, Douglas Anderson wrote: > In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns > true and the driver returns BLK_STS_RESOURCE then we'll kick the > queue. However, there's another case where we might need to kick it. > If we were unable to get budget we can be in much the same state as > when the driver returns BLK_STS_RESOURCE, so we should treat it the > same. The situation is really similar, especially the restart handler may not see the request which may not be added to hctx->dispatch, so Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson 2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson @ 2020-04-02 15:51 ` Douglas Anderson 2020-04-03 1:33 ` Ming Lei 1 sibling, 1 reply; 15+ messages in thread From: Douglas Anderson @ 2020-04-02 15:51 UTC (permalink / raw) To: axboe, jejb, martin.petersen Cc: paolo.valente, sqazi, linux-block, linux-scsi, Ming Lei, groeck, Douglas Anderson, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, linux-kernel It is possible for two threads to be running blk_mq_do_dispatch_sched() at the same time with the same "hctx". This is because there can be more than one caller to __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't prevent more than one thread from entering. If more than one thread is running blk_mq_do_dispatch_sched() at the same time with the same "hctx", they may have contention acquiring budget. The blk_mq_get_dispatch_budget() can eventually translate into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not uncommon) then only one of the two threads will be the one to increment "device_busy" to 1 and get the budget. The losing thread will break out of blk_mq_do_dispatch_sched() and will stop dispatching requests. The assumption is that when more budget is available later (when existing transactions finish) the queue will be kicked again, perhaps in scsi_end_request(). The winning thread now has budget and can go on to call dispatch_request(). If dispatch_request() returns NULL here then we have a potential problem. Specifically we'll now call blk_mq_put_dispatch_budget() which translates into scsi_mq_put_budget(). That will mark the device as no longer busy but doesn't do anything to kick the queue. This violates the assumption that the queue would be kicked when more budget was available. Pictorially: Thread A Thread B ================================= ================================== blk_mq_get_dispatch_budget() => 1 dispatch_request() => NULL blk_mq_get_dispatch_budget() => 0 // because Thread A marked // "device_busy" in scsi_device blk_mq_put_dispatch_budget() The above case was observed in reboot tests and caused a task to hang forever waiting for IO to complete. Traces showed that in fact two tasks were running blk_mq_do_dispatch_sched() at the same time with the same "hctx". The task that got the budget did in fact see dispatch_request() return NULL. Both tasks returned and the system went on for several minutes (until the hung task delay kicked in) without the given "hctx" showing up again in traces. Let's attempt to fix this problem by detecting if there was contention for the budget in the case where we put the budget without dispatching anything. If we saw contention we kick all hctx's associated with the queue where there was contention. We do this without any locking by adding a double-check for budget and accepting a small amount of faux contention if the 2nd check gives us budget but then we don't dispatch anything (we'll look like we contended with ourselves). A few extra notes: - This whole thing is only a problem due to the inexact nature of has_work(). Specifically if has_work() always guaranteed that a "true" return meant that dispatch_request() would return non-NULL then we wouldn't have this problem. That's because we only grab the budget if has_work() returned true. If we had the non-NULL guarantee then at least one of the threads would actually dispatch work and when the work was done then queues would be kicked normally. - One specific I/O scheduler that trips this problem quite a bit is BFQ which definitely returns "true" for has_work() in cases where it wouldn't dispatch. Making BFQ's has_work() more exact requires that has_work() becomes a much heavier function, including figuring out how to acquire spinlocks in has_work() without tripping circular lock dependencies. This is prototyped but it's unclear if it's really the way to go when the problem can be solved with a relatively lightweight contention detection mechanism. - Because this problem only trips with inexact has_work() it's believed that blk_mq_do_dispatch_ctx() does not need a similar change. Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- block/blk-mq.c | 3 +++ include/linux/blkdev.h | 2 ++ 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 74cedea56034..0195d75f5f96 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) break; - if (!blk_mq_get_dispatch_budget(hctx)) - break; + + if (!blk_mq_get_dispatch_budget(hctx)) { + /* + * We didn't get budget so set contention. If + * someone else had the budget but didn't dispatch + * they'll kick everything. NOTE: we check one + * extra time _after_ setting contention to fully + * close the race. If we don't actually dispatch + * we'll detext faux contention (with ourselves) + * but that should be rare. + */ + atomic_set(&q->budget_contention, 1); + + if (!blk_mq_get_dispatch_budget(hctx)) + break; + } rq = e->type->ops.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); + + /* + * We've released the budget but us holding it might + * have prevented someone else from dispatching. + * Detect that case and run all queues again. + */ + if (atomic_read(&q->budget_contention)) + blk_mq_run_hw_queues(q, true); break; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 2cd8d2b49ff4..6163c43ceca5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) struct blk_mq_hw_ctx *hctx; int i; + /* We're running the queues, so clear the contention detector */ + atomic_set(&q->budget_contention, 0); + queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_hctx_stopped(hctx)) continue; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f629d40c645c..07f21e45d993 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -583,6 +583,8 @@ struct request_queue { #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; + + atomic_t budget_contention; }; #define QUEUE_FLAG_STOPPED 0 /* queue is stopped */ -- 2.26.0.rc2.310.g2932bb562d-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson @ 2020-04-03 1:33 ` Ming Lei 2020-04-03 15:10 ` Doug Anderson 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2020-04-03 1:33 UTC (permalink / raw) To: Douglas Anderson Cc: axboe, jejb, martin.petersen, paolo.valente, sqazi, linux-block, linux-scsi, groeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, linux-kernel On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote: > It is possible for two threads to be running > blk_mq_do_dispatch_sched() at the same time with the same "hctx". > This is because there can be more than one caller to > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't > prevent more than one thread from entering. > > If more than one thread is running blk_mq_do_dispatch_sched() at the > same time with the same "hctx", they may have contention acquiring > budget. The blk_mq_get_dispatch_budget() can eventually translate > into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not > uncommon) then only one of the two threads will be the one to > increment "device_busy" to 1 and get the budget. > > The losing thread will break out of blk_mq_do_dispatch_sched() and > will stop dispatching requests. The assumption is that when more > budget is available later (when existing transactions finish) the > queue will be kicked again, perhaps in scsi_end_request(). > > The winning thread now has budget and can go on to call > dispatch_request(). If dispatch_request() returns NULL here then we > have a potential problem. Specifically we'll now call As I mentioned before, it is a BFQ specific issue, it tells blk-mq that it has work to do, and now the budget is assigned to the only winning thread, however, dispatch_request() still returns NULL. > blk_mq_put_dispatch_budget() which translates into > scsi_mq_put_budget(). That will mark the device as no longer busy but > doesn't do anything to kick the queue. This violates the assumption > that the queue would be kicked when more budget was available. The queue is still kicked in by BFQ in its idle timer, however that timer doesn't make forward progress. Without this idle timer, it is simply a BFQ issue. > > Pictorially: > > Thread A Thread B > ================================= ================================== > blk_mq_get_dispatch_budget() => 1 > dispatch_request() => NULL > blk_mq_get_dispatch_budget() => 0 > // because Thread A marked > // "device_busy" in scsi_device > blk_mq_put_dispatch_budget() What if there is only thread A? You need to mention that thread B is from BFQ. > > The above case was observed in reboot tests and caused a task to hang > forever waiting for IO to complete. Traces showed that in fact two > tasks were running blk_mq_do_dispatch_sched() at the same time with > the same "hctx". The task that got the budget did in fact see > dispatch_request() return NULL. Both tasks returned and the system > went on for several minutes (until the hung task delay kicked in) > without the given "hctx" showing up again in traces. > > Let's attempt to fix this problem by detecting if there was contention > for the budget in the case where we put the budget without dispatching > anything. If we saw contention we kick all hctx's associated with the > queue where there was contention. We do this without any locking by > adding a double-check for budget and accepting a small amount of faux > contention if the 2nd check gives us budget but then we don't dispatch > anything (we'll look like we contended with ourselves). > > A few extra notes: > > - This whole thing is only a problem due to the inexact nature of > has_work(). Specifically if has_work() always guaranteed that a > "true" return meant that dispatch_request() would return non-NULL > then we wouldn't have this problem. That's because we only grab the > budget if has_work() returned true. If we had the non-NULL > guarantee then at least one of the threads would actually dispatch > work and when the work was done then queues would be kicked > normally. > > - One specific I/O scheduler that trips this problem quite a bit is > BFQ which definitely returns "true" for has_work() in cases where it > wouldn't dispatch. Making BFQ's has_work() more exact requires that > has_work() becomes a much heavier function, including figuring out > how to acquire spinlocks in has_work() without tripping circular > lock dependencies. This is prototyped but it's unclear if it's > really the way to go when the problem can be solved with a > relatively lightweight contention detection mechanism. > > - Because this problem only trips with inexact has_work() it's > believed that blk_mq_do_dispatch_ctx() does not need a similar > change. Right, I prefer to fix it in BFQ, given it isn't a generic issue, not worth of generic solution. > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") > > block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- > block/blk-mq.c | 3 +++ > include/linux/blkdev.h | 2 ++ > 3 files changed, 29 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 74cedea56034..0195d75f5f96 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) > break; > > - if (!blk_mq_get_dispatch_budget(hctx)) > - break; > + > + if (!blk_mq_get_dispatch_budget(hctx)) { > + /* > + * We didn't get budget so set contention. If > + * someone else had the budget but didn't dispatch > + * they'll kick everything. NOTE: we check one > + * extra time _after_ setting contention to fully > + * close the race. If we don't actually dispatch > + * we'll detext faux contention (with ourselves) > + * but that should be rare. > + */ > + atomic_set(&q->budget_contention, 1); > + > + if (!blk_mq_get_dispatch_budget(hctx)) scsi_mq_get_budget() implies a smp_mb(), so the barrier can order between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1). > + break; > + } > > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + /* > + * We've released the budget but us holding it might > + * have prevented someone else from dispatching. > + * Detect that case and run all queues again. > + */ > + if (atomic_read(&q->budget_contention)) scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic() is needed between the above two op. > + blk_mq_run_hw_queues(q, true); > break; > } > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2cd8d2b49ff4..6163c43ceca5 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) > struct blk_mq_hw_ctx *hctx; > int i; > > + /* We're running the queues, so clear the contention detector */ > + atomic_set(&q->budget_contention, 0); > + You add extra cost in fast path. > queue_for_each_hw_ctx(q, hctx, i) { > if (blk_mq_hctx_stopped(hctx)) > continue; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index f629d40c645c..07f21e45d993 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -583,6 +583,8 @@ struct request_queue { > > #define BLK_MAX_WRITE_HINTS 5 > u64 write_hints[BLK_MAX_WRITE_HINTS]; > + > + atomic_t budget_contention; It needn't to be a atomic variable, and simple 'unsigned' int should be fine, what matters is that the order between R/W this flag and get/put budget. thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-03 1:33 ` Ming Lei @ 2020-04-03 15:10 ` Doug Anderson 2020-04-03 15:49 ` Doug Anderson 0 siblings, 1 reply; 15+ messages in thread From: Doug Anderson @ 2020-04-03 15:10 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Thu, Apr 2, 2020 at 6:34 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote: > > It is possible for two threads to be running > > blk_mq_do_dispatch_sched() at the same time with the same "hctx". > > This is because there can be more than one caller to > > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't > > prevent more than one thread from entering. > > > > If more than one thread is running blk_mq_do_dispatch_sched() at the > > same time with the same "hctx", they may have contention acquiring > > budget. The blk_mq_get_dispatch_budget() can eventually translate > > into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not > > uncommon) then only one of the two threads will be the one to > > increment "device_busy" to 1 and get the budget. > > > > The losing thread will break out of blk_mq_do_dispatch_sched() and > > will stop dispatching requests. The assumption is that when more > > budget is available later (when existing transactions finish) the > > queue will be kicked again, perhaps in scsi_end_request(). > > > > The winning thread now has budget and can go on to call > > dispatch_request(). If dispatch_request() returns NULL here then we > > have a potential problem. Specifically we'll now call > > As I mentioned before, it is a BFQ specific issue, it tells blk-mq > that it has work to do, and now the budget is assigned to the only > winning thread, however, dispatch_request() still returns NULL. Correct that it only happens with BFQ, but whether it's a BFQ bug or not just depends on how you define the has_work() API. If has_work() is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug depending on how you cut it. If has_work() must be exact then it is certainly a BFQ bug. If has_work() doesn't need to be exact then it's not a BFQ bug. I believe that a sane API could be defined either way. Either has_work() can be defined as a lightweight hint to trigger heavier code or it can be defined as something exact. It's really up to blk-mq to say how they define it. From his response on the SCSI patch [1], it sounded like Jens was OK with has_work() being a lightweight hint as long as BFQ ensures that the queues run later. ...but, as my investigation found, I believe that BFQ _does_ try to ensure that the queue is run at a later time by calling blk_mq_run_hw_queues(). The irony is that due to the race we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to be reliable if has_work() is inexact. :( One way to address this is to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. ...so Jens: care to clarify how you'd like has_work() to be defined? > > - Because this problem only trips with inexact has_work() it's > > believed that blk_mq_do_dispatch_ctx() does not need a similar > > change. > > Right, I prefer to fix it in BFQ, given it isn't a generic issue, > not worth of generic solution. Just to confirm: it sounds like you are saying that BFQ is not a first class citizen here because not everyone uses BFQ. Is that really the best policy? Couldn't I say that SCSI shouldn't be a first class citizen because not everyone uses SCSI? In such a case I could argue that we should speed up the blk-mq layer by removing all the budget code since SCSI is the only user. I'm not actually suggesting it, only pointing out that sometimes we need to make allowances in the code. > > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) > > break; > > > > - if (!blk_mq_get_dispatch_budget(hctx)) > > - break; > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) { > > + /* > > + * We didn't get budget so set contention. If > > + * someone else had the budget but didn't dispatch > > + * they'll kick everything. NOTE: we check one > > + * extra time _after_ setting contention to fully > > + * close the race. If we don't actually dispatch > > + * we'll detext faux contention (with ourselves) > > + * but that should be rare. > > + */ > > + atomic_set(&q->budget_contention, 1); > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) > > scsi_mq_get_budget() implies a smp_mb(), so the barrier can order > between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1). I always struggle when working with memory barriers. I think you are saying that this section of code is OK, though, presumably because the SCSI code does "atomic_inc_return" which implies this barrier. Do you happen to know if it's documented that anyone who implements get_budget() for "struct blk_mq_ops" will have a memory barrier here? I know SCSI is the only existing user, but we'd want to keep it generic. > > + break; > > + } > > > > rq = e->type->ops.dispatch_request(hctx); > > if (!rq) { > > blk_mq_put_dispatch_budget(hctx); > > + > > + /* > > + * We've released the budget but us holding it might > > + * have prevented someone else from dispatching. > > + * Detect that case and run all queues again. > > + */ > > + if (atomic_read(&q->budget_contention)) > > scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic() > is needed between the above two op. OK, thanks. I will add that we decide to move forward with this patch. Again I'd wonder if this type of thing should be documented. > > + blk_mq_run_hw_queues(q, true); > > break; > > } > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 2cd8d2b49ff4..6163c43ceca5 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) > > struct blk_mq_hw_ctx *hctx; > > int i; > > > > + /* We're running the queues, so clear the contention detector */ > > + atomic_set(&q->budget_contention, 0); > > + > > You add extra cost in fast path. Yes, but it is a fairly minor cost added. It's called once in a place where we're unlikely to be looping and where we're about to do a lot heavier operations (perhaps in a loop). This doesn't feel like a dealbreaker. > > queue_for_each_hw_ctx(q, hctx, i) { > > if (blk_mq_hctx_stopped(hctx)) > > continue; > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index f629d40c645c..07f21e45d993 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -583,6 +583,8 @@ struct request_queue { > > > > #define BLK_MAX_WRITE_HINTS 5 > > u64 write_hints[BLK_MAX_WRITE_HINTS]; > > + > > + atomic_t budget_contention; > > It needn't to be a atomic variable, and simple 'unsigned' > int should be fine, what matters is that the order between > R/W this flag and get/put budget. Apparently I'm the only one who feels atomic_t is more documenting here. I will cede to the will of the majority here and change to 'unsigned int' if we decide to move forward with this patch. -Doug [1] https://lore.kernel.org/r/d6af2344-11f7-5862-daed-e21cbd496d92@kernel.dk ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-03 15:10 ` Doug Anderson @ 2020-04-03 15:49 ` Doug Anderson 2020-04-05 9:14 ` Ming Lei 0 siblings, 1 reply; 15+ messages in thread From: Doug Anderson @ 2020-04-03 15:49 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote: > > Correct that it only happens with BFQ, but whether it's a BFQ bug or > not just depends on how you define the has_work() API. If has_work() > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug > depending on how you cut it. If has_work() must be exact then it is > certainly a BFQ bug. If has_work() doesn't need to be exact then it's > not a BFQ bug. I believe that a sane API could be defined either way. > Either has_work() can be defined as a lightweight hint to trigger > heavier code or it can be defined as something exact. It's really up > to blk-mq to say how they define it. > > From his response on the SCSI patch [1], it sounded like Jens was OK > with has_work() being a lightweight hint as long as BFQ ensures that > the queues run later. ...but, as my investigation found, I believe > that BFQ _does_ try to ensure that the queue is run at a later time by > calling blk_mq_run_hw_queues(). The irony is that due to the race > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to > be reliable if has_work() is inexact. :( One way to address this is > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. > > ...so Jens: care to clarify how you'd like has_work() to be defined? Sorry to reply so quickly after my prior reply, but I might have found an extreme corner case where we can still run into the same race even if has_work() is exact. This is all theoretical from code analysis. Maybe you can poke a hole in my scenario or tell me it's so implausible that we don't care, but it seems like it's theoretically possible. For this example I'll assume a budget of 1 (AKA only one thread can get budget for a given queue): * Threads A and B both run has_work() at the same time with the same "hctx". has_work() is exact but there's no lock, so it's OK if Thread A and B both get back true. * Thread B gets interrupted for a long time right after it decides that there is work. Maybe its CPU gets an interrupt and the interrupt handler is slow. * Thread A runs, get budget, dispatches work. * Thread A's work finishes and budget is released. * Thread B finally runs again and gets budget. * Since Thread A already took care of the work and no new work has come in, Thread B will get NULL from dispatch_request(). I believe this is specifically why dispatch_request() is allowed to return NULL in the first place if has_work() must be exact. * Thread B will now be holding the budget and is about to call put_budget(), but hasn't called it yet. * Thread B gets interrupted for a long time (again). Dang interrupts. * Now Thread C (with a different "hctx" but the same queue) comes along and runs blk_mq_do_dispatch_sched(). * Thread C won't do anything because it can't get budget. * Finally Thread B will run again and put the budget without kicking any queues. Now we have a potential I/O stall because nobody will ever kick the queues. I think the above example could happen even on non-BFQ systems and I think it would also be fixed by an approach like the one in my patch. -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-03 15:49 ` Doug Anderson @ 2020-04-05 9:14 ` Ming Lei 2020-04-05 14:00 ` Doug Anderson 2020-04-05 16:26 ` Doug Anderson 0 siblings, 2 replies; 15+ messages in thread From: Ming Lei @ 2020-04-05 9:14 UTC (permalink / raw) To: Doug Anderson Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML On Fri, Apr 03, 2020 at 08:49:54AM -0700, Doug Anderson wrote: > Hi, > > On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote: > > > > Correct that it only happens with BFQ, but whether it's a BFQ bug or > > not just depends on how you define the has_work() API. If has_work() > > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug > > depending on how you cut it. If has_work() must be exact then it is > > certainly a BFQ bug. If has_work() doesn't need to be exact then it's > > not a BFQ bug. I believe that a sane API could be defined either way. > > Either has_work() can be defined as a lightweight hint to trigger > > heavier code or it can be defined as something exact. It's really up > > to blk-mq to say how they define it. > > > > From his response on the SCSI patch [1], it sounded like Jens was OK > > with has_work() being a lightweight hint as long as BFQ ensures that > > the queues run later. ...but, as my investigation found, I believe > > that BFQ _does_ try to ensure that the queue is run at a later time by > > calling blk_mq_run_hw_queues(). The irony is that due to the race > > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to > > be reliable if has_work() is inexact. :( One way to address this is > > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. > > > > ...so Jens: care to clarify how you'd like has_work() to be defined? > > Sorry to reply so quickly after my prior reply, but I might have found > an extreme corner case where we can still run into the same race even > if has_work() is exact. This is all theoretical from code analysis. > Maybe you can poke a hole in my scenario or tell me it's so > implausible that we don't care, but it seems like it's theoretically > possible. For this example I'll assume a budget of 1 (AKA only one > thread can get budget for a given queue): > > * Threads A and B both run has_work() at the same time with the same > "hctx". has_work() is exact but there's no lock, so it's OK if > Thread A and B both get back true. > > * Thread B gets interrupted for a long time right after it decides > that there is work. Maybe its CPU gets an interrupt and the > interrupt handler is slow. > > * Thread A runs, get budget, dispatches work. > > * Thread A's work finishes and budget is released. > > * Thread B finally runs again and gets budget. > > * Since Thread A already took care of the work and no new work has > come in, Thread B will get NULL from dispatch_request(). I believe > this is specifically why dispatch_request() is allowed to return > NULL in the first place if has_work() must be exact. > > * Thread B will now be holding the budget and is about to call > put_budget(), but hasn't called it yet. > > * Thread B gets interrupted for a long time (again). Dang interrupts. > > * Now Thread C (with a different "hctx" but the same queue) comes > along and runs blk_mq_do_dispatch_sched(). > > * Thread C won't do anything because it can't get budget. > > * Finally Thread B will run again and put the budget without kicking > any queues. > > Now we have a potential I/O stall because nobody will ever kick the > queues. > > > I think the above example could happen even on non-BFQ systems and I > think it would also be fixed by an approach like the one in my patch. OK, looks it isn't specific on BFQ any more. Follows another candidate approach for this issue, given it is so hard to trigger, we can make it more reliable by rerun queue when has_work() returns true after ops->dispath_request() returns NULL. diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 74cedea56034..4408e5d4fcd8 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) blk_mq_run_hw_queue(hctx, true); } +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ /* * Only SCSI implements .get_budget and .put_budget, and SCSI restarts * its queue by itself in its completion handler, so we don't need to @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) rq = e->type->ops.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); + + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); break; } Thanks, Ming ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 9:14 ` Ming Lei @ 2020-04-05 14:00 ` Doug Anderson 2020-04-05 14:57 ` Paolo Valente 2020-04-05 16:26 ` Doug Anderson 1 sibling, 1 reply; 15+ messages in thread From: Doug Anderson @ 2020-04-05 14:00 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > OK, looks it isn't specific on BFQ any more. > > Follows another candidate approach for this issue, given it is so hard > to trigger, we can make it more reliable by rerun queue when has_work() > returns true after ops->dispath_request() returns NULL. > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 74cedea56034..4408e5d4fcd8 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) > blk_mq_run_hw_queue(hctx, true); > } > > +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ > /* > * Only SCSI implements .get_budget and .put_budget, and SCSI restarts > * its queue by itself in its completion handler, so we don't need to > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); I agree that your patch should solve the race. With the current BFQ's has_work() it's a bit of a disaster though. It will essentially put blk-mq into a busy-wait loop (with a 3 ms delay between each poll) while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. ...so I guess the question that still needs to be answered: does has_work() need to be exact? If so then we need the patch you propose plus one to BFQ. If not, we should continue along the lines of my patch. -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 14:00 ` Doug Anderson @ 2020-04-05 14:57 ` Paolo Valente 2020-04-05 16:16 ` Doug Anderson 0 siblings, 1 reply; 15+ messages in thread From: Paolo Valente @ 2020-04-05 14:57 UTC (permalink / raw) To: Doug Anderson Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML > Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: > > Hi, > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: >> >> OK, looks it isn't specific on BFQ any more. >> >> Follows another candidate approach for this issue, given it is so hard >> to trigger, we can make it more reliable by rerun queue when has_work() >> returns true after ops->dispath_request() returns NULL. >> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >> index 74cedea56034..4408e5d4fcd8 100644 >> --- a/block/blk-mq-sched.c >> +++ b/block/blk-mq-sched.c >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) >> blk_mq_run_hw_queue(hctx, true); >> } >> >> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ >> /* >> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts >> * its queue by itself in its completion handler, so we don't need to >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) >> rq = e->type->ops.dispatch_request(hctx); >> if (!rq) { >> blk_mq_put_dispatch_budget(hctx); >> + >> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > I agree that your patch should solve the race. With the current BFQ's > has_work() it's a bit of a disaster though. It will essentially put > blk-mq into a busy-wait loop (with a 3 ms delay between each poll) > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. > > ...so I guess the question that still needs to be answered: does > has_work() need to be exact? If so then we need the patch you propose > plus one to BFQ. If not, we should continue along the lines of my > patch. > Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, with this last Ming's patch, BFQ may happen to be polled every 3ms, for at most three times. On the opposite end, making bfq_has_work plugging aware costs more complexity, and possibly one more lock. While avoiding the above occasional polling, this may imply a lot of overhead or CPU stalls on every dispatch. Paolo > -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 14:57 ` Paolo Valente @ 2020-04-05 16:16 ` Doug Anderson 2020-04-07 10:43 ` Paolo Valente 0 siblings, 1 reply; 15+ messages in thread From: Doug Anderson @ 2020-04-05 16:16 UTC (permalink / raw) To: Paolo Valente Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote: > > > Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: > > > > Hi, > > > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > >> > >> OK, looks it isn't specific on BFQ any more. > >> > >> Follows another candidate approach for this issue, given it is so hard > >> to trigger, we can make it more reliable by rerun queue when has_work() > >> returns true after ops->dispath_request() returns NULL. > >> > >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > >> index 74cedea56034..4408e5d4fcd8 100644 > >> --- a/block/blk-mq-sched.c > >> +++ b/block/blk-mq-sched.c > >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) > >> blk_mq_run_hw_queue(hctx, true); > >> } > >> > >> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ > >> /* > >> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts > >> * its queue by itself in its completion handler, so we don't need to > >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > >> rq = e->type->ops.dispatch_request(hctx); > >> if (!rq) { > >> blk_mq_put_dispatch_budget(hctx); > >> + > >> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > > > I agree that your patch should solve the race. With the current BFQ's > > has_work() it's a bit of a disaster though. It will essentially put > > blk-mq into a busy-wait loop (with a 3 ms delay between each poll) > > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. > > > > ...so I guess the question that still needs to be answered: does > > has_work() need to be exact? If so then we need the patch you propose > > plus one to BFQ. If not, we should continue along the lines of my > > patch. > > > > Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, > with this last Ming's patch, BFQ may happen to be polled every 3ms, > for at most three times. Ah! I did not know this. OK, then Ming's patch seems like it should work. If nothing else it should fix the problem. If this ends up making BFQ chew up too much CPU time then presumably someone will notice and BFQ's has_work() can be improved. Ming: how do you want to proceed? Do you want to formally post the patch? Do you want me to post a v3 of my series where I place patch #2 with your patch? Do you want authorship (which implies adding your Signed-off-by)? > On the opposite end, making bfq_has_work plugging aware costs more > complexity, and possibly one more lock. While avoiding the above > occasional polling, this may imply a lot of overhead or CPU stalls on > every dispatch. I still think it would be interesting to run performance tests with my proof-of-concept solution for has_work(). Even if it's not ideal, knowing whether performance increased, decreased, or stayed the same would give information about how much more effort should be put into this. -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 16:16 ` Doug Anderson @ 2020-04-07 10:43 ` Paolo Valente 0 siblings, 0 replies; 15+ messages in thread From: Paolo Valente @ 2020-04-07 10:43 UTC (permalink / raw) To: Doug Anderson Cc: Ming Lei, Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML > Il giorno 5 apr 2020, alle ore 18:16, Doug Anderson <dianders@chromium.org> ha scritto: > > Hi, > > On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote: >> >>> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: >>> >>> Hi, >>> >>> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>> OK, looks it isn't specific on BFQ any more. >>>> >>>> Follows another candidate approach for this issue, given it is so hard >>>> to trigger, we can make it more reliable by rerun queue when has_work() >>>> returns true after ops->dispath_request() returns NULL. >>>> >>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >>>> index 74cedea56034..4408e5d4fcd8 100644 >>>> --- a/block/blk-mq-sched.c >>>> +++ b/block/blk-mq-sched.c >>>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) >>>> blk_mq_run_hw_queue(hctx, true); >>>> } >>>> >>>> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ >>>> /* >>>> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts >>>> * its queue by itself in its completion handler, so we don't need to >>>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) >>>> rq = e->type->ops.dispatch_request(hctx); >>>> if (!rq) { >>>> blk_mq_put_dispatch_budget(hctx); >>>> + >>>> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) >>>> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); >>> >>> I agree that your patch should solve the race. With the current BFQ's >>> has_work() it's a bit of a disaster though. It will essentially put >>> blk-mq into a busy-wait loop (with a 3 ms delay between each poll) >>> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. >>> >>> ...so I guess the question that still needs to be answered: does >>> has_work() need to be exact? If so then we need the patch you propose >>> plus one to BFQ. If not, we should continue along the lines of my >>> patch. >>> >> >> Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, >> with this last Ming's patch, BFQ may happen to be polled every 3ms, >> for at most three times. > > Ah! I did not know this. OK, then Ming's patch seems like it should > work. If nothing else it should fix the problem. If this ends up > making BFQ chew up too much CPU time then presumably someone will > notice and BFQ's has_work() can be improved. > > Ming: how do you want to proceed? Do you want to formally post the > patch? Do you want me to post a v3 of my series where I place patch > #2 with your patch? Do you want authorship (which implies adding your > Signed-off-by)? > > >> On the opposite end, making bfq_has_work plugging aware costs more >> complexity, and possibly one more lock. While avoiding the above >> occasional polling, this may imply a lot of overhead or CPU stalls on >> every dispatch. > > I still think it would be interesting to run performance tests with my > proof-of-concept solution for has_work(). Even if it's not ideal, > knowing whether performance increased, decreased, or stayed the same > would give information about how much more effort should be put into > this. > Why not? It is however hard to hope that we add only negligible overhead and CPU stalls if we move from one lock-protected section per I/O-request dispatch, to two or more lock-protected sections per request (has_work may be invoked several times per request). At any rate, if useful, one of the scripts in my S benchmark suite can also measure max IOPS (when limited only by I/O processing) [1]. The script is for Linux distros; I don't know whether it works in your environments of interest, Doug. Paolo [1] https://github.com/Algodev-github/S/blob/master/throughput-sync/throughput-sync.sh > -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 9:14 ` Ming Lei 2020-04-05 14:00 ` Doug Anderson @ 2020-04-05 16:26 ` Doug Anderson 2020-04-07 2:14 ` Ming Lei 1 sibling, 1 reply; 15+ messages in thread From: Doug Anderson @ 2020-04-05 16:26 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); To really close the race, don't we need to run all the queues associated with the hctx? I haven't traced it through, but I've been assuming that the multiple "hctx"s associated with the same queue will have the same budget associated with them and thus they can block each other out. -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-05 16:26 ` Doug Anderson @ 2020-04-07 2:14 ` Ming Lei 2020-04-07 22:04 ` Doug Anderson 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2020-04-07 2:14 UTC (permalink / raw) To: Doug Anderson Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote: > Hi, > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > rq = e->type->ops.dispatch_request(hctx); > > if (!rq) { > > blk_mq_put_dispatch_budget(hctx); > > + > > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > To really close the race, don't we need to run all the queues > associated with the hctx? I haven't traced it through, but I've been > assuming that the multiple "hctx"s associated with the same queue will > have the same budget associated with them and thus they can block each > other out. Yeah, we should run all hctxs which share the same budget space. Also, in theory, we don't have to add the delay, however BFQ may plug the dispatch for 9 ms, so looks delay run queue is required. thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention 2020-04-07 2:14 ` Ming Lei @ 2020-04-07 22:04 ` Doug Anderson 0 siblings, 0 replies; 15+ messages in thread From: Doug Anderson @ 2020-04-07 22:04 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, James E.J. Bottomley, Martin K. Petersen, Paolo Valente, Salman Qazi, linux-block, linux-scsi, Guenter Roeck, Ajay Joshi, Arnd Bergmann, Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal, Hou Tao, Pavel Begunkov, Tejun Heo, LKML Hi, On Mon, Apr 6, 2020 at 7:14 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote: > > Hi, > > > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > > rq = e->type->ops.dispatch_request(hctx); > > > if (!rq) { > > > blk_mq_put_dispatch_budget(hctx); > > > + > > > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > > > To really close the race, don't we need to run all the queues > > associated with the hctx? I haven't traced it through, but I've been > > assuming that the multiple "hctx"s associated with the same queue will > > have the same budget associated with them and thus they can block each > > other out. > > Yeah, we should run all hctxs which share the same budget space. > > Also, in theory, we don't have to add the delay, however BFQ may plug the > dispatch for 9 ms, so looks delay run queue is required. I have posted up v3. A few notes: * Since we should run all "hctxs" I took out the check for has_work() before kicking the queue. * As far as I can tell the theoretical race happens for _anyone_ who puts budget, so I added it to blk_mq_put_dispatch_budget(). Feel free to shout at me in response to v3 if you believe I'm wrong about that. Thanks for all your reviews and suggestions so far! -Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-04-07 22:04 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-02 15:51 [PATCH v2 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson 2020-04-02 15:51 ` [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson 2020-04-03 1:55 ` Ming Lei 2020-04-02 15:51 ` [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention Douglas Anderson 2020-04-03 1:33 ` Ming Lei 2020-04-03 15:10 ` Doug Anderson 2020-04-03 15:49 ` Doug Anderson 2020-04-05 9:14 ` Ming Lei 2020-04-05 14:00 ` Doug Anderson 2020-04-05 14:57 ` Paolo Valente 2020-04-05 16:16 ` Doug Anderson 2020-04-07 10:43 ` Paolo Valente 2020-04-05 16:26 ` Doug Anderson 2020-04-07 2:14 ` Ming Lei 2020-04-07 22:04 ` Doug Anderson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.