* [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 2:41 Ming Lei 2018-01-18 16:50 ` Bart Van Assche 0 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-18 2:41 UTC (permalink / raw) To: Jens Axboe, linux-block, Mike Snitzer, dm-devel Cc: Christoph Hellwig, Bart Van Assche, linux-kernel, Omar Sandoval, Ming Lei BLK_STS_RESOURCE can be returned from driver when any resource is running out of. And the resource may not be related with tags, such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of BLK_STS_RESOURCE, restart can't work any more, then IO hang may be caused. Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, it may be triggered a bit easier since the request pool of underlying queue may be consumed up much easier. But in reality, it is still not easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug with all kinds of scsi_debug parameters, can't trigger this issue at all. But finally it is triggered in Bart's SRP test, which seems made by genius, :-) This patch deals with this situation by running the queue again when queue is found idle in timeout handler. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- Another approach is to do the check after BLK_STS_RESOURCE is returned from .queue_rq() and BLK_MQ_S_SCHED_RESTART is set, that way may introduce a bit cost in hot path, and it was V1 of this patch actually, please see that in the following link: https://github.com/ming1/linux/commit/68a66900f3647ea6751aab2848b1e5eef508feaa Or other better ways? block/blk-mq.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6e3f77829dcc..4d4af8d712da 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -896,6 +896,85 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, blk_mq_rq_timed_out(rq, reserved); } +struct hctx_busy_data { + struct blk_mq_hw_ctx *hctx; + bool reserved; + bool busy; +}; + +static bool check_busy_hctx(struct sbitmap *sb, unsigned int bitnr, void *data) +{ + struct hctx_busy_data *busy_data = data; + struct blk_mq_hw_ctx *hctx = busy_data->hctx; + struct request *rq; + + if (busy_data->reserved) + bitnr += hctx->tags->nr_reserved_tags; + + rq = hctx->tags->static_rqs[bitnr]; + if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) { + busy_data->busy = true; + return false; + } + return true; +} + +/* Check if there is any in-flight request */ +static bool blk_mq_hctx_is_busy(struct blk_mq_hw_ctx *hctx) +{ + struct hctx_busy_data data = { + .hctx = hctx, + .busy = false, + .reserved = true, + }; + + sbitmap_for_each_set(&hctx->tags->breserved_tags.sb, + check_busy_hctx, &data); + if (data.busy) + return true; + + data.reserved = false; + sbitmap_for_each_set(&hctx->tags->bitmap_tags.sb, + check_busy_hctx, &data); + if (data.busy) + return true; + + return false; +} + +static void blk_mq_fixup_restart(struct blk_mq_hw_ctx *hctx) +{ + if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) { + bool busy; + + /* + * If this hctx is still marked as RESTART, and there + * isn't any in-flight requests, we have to run queue + * here to prevent IO from hanging. + * + * BLK_STS_RESOURCE can be returned from driver when any + * resource is running out of. And the resource may not + * be related with tags, such as kmalloc(GFP_ATOMIC), when + * queue is idle under this kind of BLK_STS_RESOURCE, restart + * can't work any more, then IO hang may be caused. + * + * The counter-pair of the following barrier is the one + * in blk_mq_put_driver_tag() after returning BLK_STS_RESOURCE + * from ->queue_rq(). + */ + smp_mb(); + + busy = blk_mq_hctx_is_busy(hctx); + if (!busy) { + printk(KERN_WARNING "blk-mq: fixup RESTART\n"); + printk(KERN_WARNING "\t If this message is shown" + " a bit often, please report the issue to" + " linux-block@vger.kernel.org\n"); + blk_mq_run_hw_queue(hctx, true); + } + } +} + static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = @@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work) */ queue_for_each_hw_ctx(q, hctx, i) { /* the hctx may be unmapped, so check it here */ - if (blk_mq_hw_queue_mapped(hctx)) + if (blk_mq_hw_queue_mapped(hctx)) { blk_mq_tag_idle(hctx); + blk_mq_fixup_restart(hctx); + } } } blk_queue_exit(q); -- 2.9.5 ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 2:41 [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei @ 2018-01-18 16:50 ` Bart Van Assche 2018-01-18 17:03 ` Mike Snitzer 0 siblings, 1 reply; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 16:50 UTC (permalink / raw) To: Ming Lei, Jens Axboe, linux-block, Mike Snitzer, dm-devel Cc: Christoph Hellwig, Bart Van Assche, linux-kernel, Omar Sandoval On 01/17/18 18:41, Ming Lei wrote: > BLK_STS_RESOURCE can be returned from driver when any resource > is running out of. And the resource may not be related with tags, > such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of > BLK_STS_RESOURCE, restart can't work any more, then IO hang may > be caused. > > Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost > all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, > it may be triggered a bit easier since the request pool of underlying > queue may be consumed up much easier. But in reality, it is still not > easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug > with all kinds of scsi_debug parameters, can't trigger this issue > at all. But finally it is triggered in Bart's SRP test, which seems > made by genius, :-) > > [ ... ] > > static void blk_mq_timeout_work(struct work_struct *work) > { > struct request_queue *q = > @@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work) > */ > queue_for_each_hw_ctx(q, hctx, i) { > /* the hctx may be unmapped, so check it here */ > - if (blk_mq_hw_queue_mapped(hctx)) > + if (blk_mq_hw_queue_mapped(hctx)) { > blk_mq_tag_idle(hctx); > + blk_mq_fixup_restart(hctx); > + } > } > } > blk_queue_exit(q); Hello Ming, My comments about the above are as follows: - It can take up to q->rq_timeout jiffies after a .queue_rq() implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() gets called. However, it can happen that only a few milliseconds after .queue_rq() returned BLK_STS_RESOURCE that the condition that caused it to return BLK_STS_RESOURCE gets cleared. So the above approach can result in long delays during which it will seem like the queue got stuck. Additionally, I think that the block driver should decide how long it takes before a queue is rerun and not the block layer core. - The lockup that I reported only occurs with the dm driver but not any other block driver. So why to modify the block layer core since this can be fixed by modifying the dm driver? - A much simpler fix and a fix that is known to work exists, namely inserting a blk_mq_delay_run_hw_queue() call in the dm driver. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 16:50 ` Bart Van Assche @ 2018-01-18 17:03 ` Mike Snitzer 0 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 17:03 UTC (permalink / raw) To: Bart Van Assche Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Bart Van Assche, linux-kernel, Omar Sandoval On Thu, Jan 18 2018 at 11:50am -0500, Bart Van Assche <bart.vanassche@wdc.com> wrote: > On 01/17/18 18:41, Ming Lei wrote: > >BLK_STS_RESOURCE can be returned from driver when any resource > >is running out of. And the resource may not be related with tags, > >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of > >BLK_STS_RESOURCE, restart can't work any more, then IO hang may > >be caused. > > > >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost > >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, > >it may be triggered a bit easier since the request pool of underlying > >queue may be consumed up much easier. But in reality, it is still not > >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug > >with all kinds of scsi_debug parameters, can't trigger this issue > >at all. But finally it is triggered in Bart's SRP test, which seems > >made by genius, :-) > > > >[ ... ] > > > > static void blk_mq_timeout_work(struct work_struct *work) > > { > > struct request_queue *q = > >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work) > > */ > > queue_for_each_hw_ctx(q, hctx, i) { > > /* the hctx may be unmapped, so check it here */ > >- if (blk_mq_hw_queue_mapped(hctx)) > >+ if (blk_mq_hw_queue_mapped(hctx)) { > > blk_mq_tag_idle(hctx); > >+ blk_mq_fixup_restart(hctx); > >+ } > > } > > } > > blk_queue_exit(q); > > Hello Ming, > > My comments about the above are as follows: > - It can take up to q->rq_timeout jiffies after a .queue_rq() > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > gets called. However, it can happen that only a few milliseconds after > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > result in long delays during which it will seem like the queue got > stuck. Additionally, I think that the block driver should decide how > long it takes before a queue is rerun and not the block layer core. So configure q->rq_timeout to be shorter? Which is configurable though blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. That is the problem with timeouts, there is generally no one size fits all. > - The lockup that I reported only occurs with the dm driver but not any > other block driver. So why to modify the block layer core since this > can be fixed by modifying the dm driver? Hard to know it is only DM's blk-mq that is impacted. That is the only blk-mq driver that you're testing like this (that is also able to handle faults, etc). > - A much simpler fix and a fix that is known to work exists, namely > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. Because your "much simpler" fix actively hurts performance, as is detailed in this header: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec I'm not going to take your bandaid fix given it very much seems to be papering over a real blk-mq issue. Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 17:03 ` Mike Snitzer 0 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 17:03 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, Christoph Hellwig, linux-kernel, Ming Lei, linux-block, dm-devel, Bart Van Assche, Omar Sandoval On Thu, Jan 18 2018 at 11:50am -0500, Bart Van Assche <bart.vanassche@wdc.com> wrote: > On 01/17/18 18:41, Ming Lei wrote: > >BLK_STS_RESOURCE can be returned from driver when any resource > >is running out of. And the resource may not be related with tags, > >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of > >BLK_STS_RESOURCE, restart can't work any more, then IO hang may > >be caused. > > > >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost > >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath, > >it may be triggered a bit easier since the request pool of underlying > >queue may be consumed up much easier. But in reality, it is still not > >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug > >with all kinds of scsi_debug parameters, can't trigger this issue > >at all. But finally it is triggered in Bart's SRP test, which seems > >made by genius, :-) > > > >[ ... ] > > > > static void blk_mq_timeout_work(struct work_struct *work) > > { > > struct request_queue *q = > >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work) > > */ > > queue_for_each_hw_ctx(q, hctx, i) { > > /* the hctx may be unmapped, so check it here */ > >- if (blk_mq_hw_queue_mapped(hctx)) > >+ if (blk_mq_hw_queue_mapped(hctx)) { > > blk_mq_tag_idle(hctx); > >+ blk_mq_fixup_restart(hctx); > >+ } > > } > > } > > blk_queue_exit(q); > > Hello Ming, > > My comments about the above are as follows: > - It can take up to q->rq_timeout jiffies after a .queue_rq() > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > gets called. However, it can happen that only a few milliseconds after > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > result in long delays during which it will seem like the queue got > stuck. Additionally, I think that the block driver should decide how > long it takes before a queue is rerun and not the block layer core. So configure q->rq_timeout to be shorter? Which is configurable though blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. That is the problem with timeouts, there is generally no one size fits all. > - The lockup that I reported only occurs with the dm driver but not any > other block driver. So why to modify the block layer core since this > can be fixed by modifying the dm driver? Hard to know it is only DM's blk-mq that is impacted. That is the only blk-mq driver that you're testing like this (that is also able to handle faults, etc). > - A much simpler fix and a fix that is known to work exists, namely > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. Because your "much simpler" fix actively hurts performance, as is detailed in this header: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec I'm not going to take your bandaid fix given it very much seems to be papering over a real blk-mq issue. Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 17:03 ` Mike Snitzer @ 2018-01-18 17:20 ` Bart Van Assche -1 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 17:20 UTC (permalink / raw) To: snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei, axboe T24gVGh1LCAyMDE4LTAxLTE4IGF0IDEyOjAzIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IE9uIFRodSwgSmFuIDE4IDIwMTggYXQgMTE6NTBhbSAtMDUwMCwNCj4gQmFydCBWYW4gQXNzY2hl IDxiYXJ0LnZhbmFzc2NoZUB3ZGMuY29tPiB3cm90ZToNCj4gPiBNeSBjb21tZW50cyBhYm91dCB0 aGUgYWJvdmUgYXJlIGFzIGZvbGxvd3M6DQo+ID4gLSBJdCBjYW4gdGFrZSB1cCB0byBxLT5ycV90 aW1lb3V0IGppZmZpZXMgYWZ0ZXIgYSAucXVldWVfcnEoKQ0KPiA+ICAgaW1wbGVtZW50YXRpb24g cmV0dXJuZWQgQkxLX1NUU19SRVNPVVJDRSBiZWZvcmUgYmxrX21xX3RpbWVvdXRfd29yaygpDQo+ ID4gICBnZXRzIGNhbGxlZC4gSG93ZXZlciwgaXQgY2FuIGhhcHBlbiB0aGF0IG9ubHkgYSBmZXcg bWlsbGlzZWNvbmRzIGFmdGVyDQo+ID4gICAucXVldWVfcnEoKSByZXR1cm5lZCBCTEtfU1RTX1JF U09VUkNFIHRoYXQgdGhlIGNvbmRpdGlvbiB0aGF0IGNhdXNlZA0KPiA+ICAgaXQgdG8gcmV0dXJu IEJMS19TVFNfUkVTT1VSQ0UgZ2V0cyBjbGVhcmVkLiBTbyB0aGUgYWJvdmUgYXBwcm9hY2ggY2Fu DQo+ID4gICByZXN1bHQgaW4gbG9uZyBkZWxheXMgZHVyaW5nIHdoaWNoIGl0IHdpbGwgc2VlbSBs aWtlIHRoZSBxdWV1ZSBnb3QNCj4gPiAgIHN0dWNrLiBBZGRpdGlvbmFsbHksIEkgdGhpbmsgdGhh dCB0aGUgYmxvY2sgZHJpdmVyIHNob3VsZCBkZWNpZGUgaG93DQo+ID4gICBsb25nIGl0IHRha2Vz IGJlZm9yZSBhIHF1ZXVlIGlzIHJlcnVuIGFuZCBub3QgdGhlIGJsb2NrIGxheWVyIGNvcmUuDQo+ IA0KPiBTbyBjb25maWd1cmUgcS0+cnFfdGltZW91dCB0byBiZSBzaG9ydGVyPyAgV2hpY2ggaXMg Y29uZmlndXJhYmxlIHRob3VnaA0KPiBibGtfbXFfdGFnX3NldCdzICd0aW1lb3V0JyBtZW1iZXIu ICBJdCBhcHBhcmVudGx5IGRlZmF1bHRzIHRvIDMwICogSFouDQo+IA0KPiBUaGF0IGlzIHRoZSBw cm9ibGVtIHdpdGggdGltZW91dHMsIHRoZXJlIGlzIGdlbmVyYWxseSBubyBvbmUgc2l6ZSBmaXRz DQo+IGFsbC4NCg0KU29ycnkgYnV0IEkgdGhpbmsgdGhhdCB3b3VsZCBiZSB3cm9uZy4gVGhlIGRl bGF5IGFmdGVyIHdoaWNoIGEgcXVldWUgaXMgcmVydW4NCnNob3VsZCBub3QgYmUgY291cGxlZCB0 byB0aGUgcmVxdWVzdCB0aW1lb3V0LiBUaGVzZSB0d28gc2hvdWxkIGJlIGluZGVwZW5kZW50Lg0K DQo+ID4gLSBUaGUgbG9ja3VwIHRoYXQgSSByZXBvcnRlZCBvbmx5IG9jY3VycyB3aXRoIHRoZSBk bSBkcml2ZXIgYnV0IG5vdCBhbnkNCj4gPiAgIG90aGVyIGJsb2NrIGRyaXZlci4gU28gd2h5IHRv IG1vZGlmeSB0aGUgYmxvY2sgbGF5ZXIgY29yZSBzaW5jZSB0aGlzDQo+ID4gICBjYW4gYmUgZml4 ZWQgYnkgbW9kaWZ5aW5nIHRoZSBkbSBkcml2ZXI/DQo+IA0KPiBIYXJkIHRvIGtub3cgaXQgaXMg b25seSBETSdzIGJsay1tcSB0aGF0IGlzIGltcGFjdGVkLiAgVGhhdCBpcyB0aGUgb25seQ0KPiBi bGstbXEgZHJpdmVyIHRoYXQgeW91J3JlIHRlc3RpbmcgbGlrZSB0aGlzICh0aGF0IGlzIGFsc28g YWJsZSB0byBoYW5kbGUNCj4gZmF1bHRzLCBldGMpLg0KDQpUaGF0J3Mgbm90IGNvcnJlY3QuIEkn bSBhbHNvIHRlc3RpbmcgdGhlIFNDU0kgY29yZSwgd2hpY2ggaXMgb25lIG9mIHRoZSBtb3N0DQpj b21wbGljYXRlZCBibG9jayBkcml2ZXJzLg0KDQo+ID4gLSBBIG11Y2ggc2ltcGxlciBmaXggYW5k IGEgZml4IHRoYXQgaXMga25vd24gdG8gd29yayBleGlzdHMsIG5hbWVseQ0KPiA+ICAgaW5zZXJ0 aW5nIGEgYmxrX21xX2RlbGF5X3J1bl9od19xdWV1ZSgpIGNhbGwgaW4gdGhlIGRtIGRyaXZlci4N Cj4gDQo+IEJlY2F1c2UgeW91ciAibXVjaCBzaW1wbGVyIiBmaXggYWN0aXZlbHkgaHVydHMgcGVy Zm9ybWFuY2UsIGFzIGlzDQo+IGRldGFpbGVkIGluIHRoaXMgaGVhZGVyOg0KPiBodHRwczovL2dp dC5rZXJuZWwub3JnL3B1Yi9zY20vbGludXgva2VybmVsL2dpdC9kZXZpY2UtbWFwcGVyL2xpbnV4 LWRtLmdpdC9jb21taXQvP2g9ZG0tNC4xNiZpZD1lYzNlYWY5YTY3MzEwNmY2NjYwNjg5NmFlZDZk ZGQyMDE4MGIwMmVjDQoNCldlIGFyZSBjbG9zZSB0byB0aGUgc3RhcnQgb2YgdGhlIG1lcmdlIHdp bmRvdyBzbyBJIHRoaW5rIGl0J3MgYmV0dGVyIHRvIGZhbGwNCmJhY2sgdG8gYW4gb2xkIGFwcHJv YWNoIHRoYXQgaXMga25vd24gdG8gd29yayB0aGFuIHRvIGtlZXAgYSBuZXcgYXBwcm9hY2gNCnRo YXQgaXMga25vd24gbm90IHRvIHdvcmsuIEFkZGl0aW9uYWxseSwgdGhlIHBlcmZvcm1hbmNlIGlz c3VlIHlvdSByZWZlcnJlZA0KdG8gb25seSBhZmZlY3RzIElPUFMgYW5kIGJhbmR3aWR0aCBtb3Jl IHRoYW4gMSUgd2l0aCB0aGUgbHBmYyBkcml2ZXIgYW5kIHRoYXQNCmlzIGJlY2F1c2UgdGhlIHF1 ZXVlIGRlcHRoIGl0IHN1cHBvcnRzIGlzIG11Y2ggbG93ZXIgdGhhbiBmb3Igb3RoZXIgU0NTSSBI QkFzLA0KbmFtZWx5IDMgaW5zdGVhZCBvZiA2NC4NCg0KVGhhbmtzLA0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 17:20 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 17:20 UTC (permalink / raw) To: snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei, axboe On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 11:50am -0500, > Bart Van Assche <bart.vanassche@wdc.com> wrote: > > My comments about the above are as follows: > > - It can take up to q->rq_timeout jiffies after a .queue_rq() > > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > > gets called. However, it can happen that only a few milliseconds after > > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > > result in long delays during which it will seem like the queue got > > stuck. Additionally, I think that the block driver should decide how > > long it takes before a queue is rerun and not the block layer core. > > So configure q->rq_timeout to be shorter? Which is configurable though > blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. > > That is the problem with timeouts, there is generally no one size fits > all. Sorry but I think that would be wrong. The delay after which a queue is rerun should not be coupled to the request timeout. These two should be independent. > > - The lockup that I reported only occurs with the dm driver but not any > > other block driver. So why to modify the block layer core since this > > can be fixed by modifying the dm driver? > > Hard to know it is only DM's blk-mq that is impacted. That is the only > blk-mq driver that you're testing like this (that is also able to handle > faults, etc). That's not correct. I'm also testing the SCSI core, which is one of the most complicated block drivers. > > - A much simpler fix and a fix that is known to work exists, namely > > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. > > Because your "much simpler" fix actively hurts performance, as is > detailed in this header: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec We are close to the start of the merge window so I think it's better to fall back to an old approach that is known to work than to keep a new approach that is known not to work. Additionally, the performance issue you referred to only affects IOPS and bandwidth more than 1% with the lpfc driver and that is because the queue depth it supports is much lower than for other SCSI HBAs, namely 3 instead of 64. Thanks, Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 17:20 ` Bart Van Assche (?) @ 2018-01-18 18:30 ` Mike Snitzer 2018-01-18 18:47 ` Bart Van Assche -1 siblings, 1 reply; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 18:30 UTC (permalink / raw) To: Bart Van Assche Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei, axboe On Thu, Jan 18 2018 at 12:20pm -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 11:50am -0500, > > Bart Van Assche <bart.vanassche@wdc.com> wrote: > > > My comments about the above are as follows: > > > - It can take up to q->rq_timeout jiffies after a .queue_rq() > > > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work() > > > gets called. However, it can happen that only a few milliseconds after > > > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused > > > it to return BLK_STS_RESOURCE gets cleared. So the above approach can > > > result in long delays during which it will seem like the queue got > > > stuck. Additionally, I think that the block driver should decide how > > > long it takes before a queue is rerun and not the block layer core. > > > > So configure q->rq_timeout to be shorter? Which is configurable though > > blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ. > > > > That is the problem with timeouts, there is generally no one size fits > > all. > > Sorry but I think that would be wrong. The delay after which a queue is rerun > should not be coupled to the request timeout. These two should be independent. That's fair. Not saying I think that is a fix anyway. > > > - The lockup that I reported only occurs with the dm driver but not any > > > other block driver. So why to modify the block layer core since this > > > can be fixed by modifying the dm driver? > > > > Hard to know it is only DM's blk-mq that is impacted. That is the only > > blk-mq driver that you're testing like this (that is also able to handle > > faults, etc). > > That's not correct. I'm also testing the SCSI core, which is one of the most > complicated block drivers. OK, but SCSI mq is part of the problem here. It is a snowflake that has more exotic reasons for returning BLK_STS_RESOURCE. > > > - A much simpler fix and a fix that is known to work exists, namely > > > inserting a blk_mq_delay_run_hw_queue() call in the dm driver. > > > > Because your "much simpler" fix actively hurts performance, as is > > detailed in this header: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec > > We are close to the start of the merge window so I think it's better to fall > back to an old approach that is known to work than to keep a new approach > that is known not to work. Additionally, the performance issue you referred > to only affects IOPS and bandwidth more than 1% with the lpfc driver and that > is because the queue depth it supports is much lower than for other SCSI HBAs, > namely 3 instead of 64. 1%!? Where are you getting that number? Ming has detailed more significant performance gains than 1%.. and not just on lpfc (though you keep seizing on lpfc because of the low queue_depth of 3). This is all very tiresome. I'm _really_ not interested in this debate any more. The specific case that causes the stall need to be identified and a real fix needs to be developed. Ming is doing a lot of that hard work. Please contribute or at least stop pleading for your hack to be reintroduced. If at the end of the 4.16 release we still don't have a handle on the stall you're seeing I'll revisit this and likely revert to blindly kicking the queue after an arbitrary delay. But I'm willing to let this issue get more time without papering over it. Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 18:30 ` Mike Snitzer @ 2018-01-18 18:47 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 18:47 UTC (permalink / raw) To: snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei, axboe T24gVGh1LCAyMDE4LTAxLTE4IGF0IDEzOjMwIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IDElIT8gIFdoZXJlIGFyZSB5b3UgZ2V0dGluZyB0aGF0IG51bWJlcj8gIE1pbmcgaGFzIGRldGFp bGVkIG1vcmUNCj4gc2lnbmlmaWNhbnQgcGVyZm9ybWFuY2UgZ2FpbnMgdGhhbiAxJS4uIGFuZCBu b3QganVzdCBvbiBscGZjICh0aG91Z2ggeW91DQo+IGtlZXAgc2VpemluZyBvbiBscGZjIGJlY2F1 c2Ugb2YgdGhlIGxvdyBxdWV1ZV9kZXB0aCBvZiAzKS4NCg0KVGhhdCdzIHdoYXQgSSBkZXJpdmVk IGZyb20gdGhlIG51bWJlcnMgeW91IHBvc3RlZCBmb3IgbnVsbF9ibGsuIElmIE1pbmcgaGFzDQpw b3N0ZWQgcGVyZm9ybWFuY2UgcmVzdWx0cyBmb3Igb3RoZXIgZHJpdmVycyB0aGFuIGxwZmMsIHBs ZWFzZSBsZXQgbWUga25vdw0Kd2hlcmUgSSBjYW4gZmluZCB0aGVzZS4gSSBoYXZlIG5vdCB5ZXQg c2VlbiB0aGVzZSBudW1iZXJzLg0KDQo+IFRoaXMgaXMgYWxsIHZlcnkgdGlyZXNvbWUuDQoNClll cywgdGhpcyBpcyB0aXJlc29tZS4gSXQgaXMgdmVyeSBhbm5veWluZyB0byBtZSB0aGF0IG90aGVy cyBrZWVwIGludHJvZHVjaW5nDQpzbyBtYW55IHJlZ3Jlc3Npb25zIGluIHN1Y2ggaW1wb3J0YW50 IHBhcnRzIG9mIHRoZSBrZXJuZWwuIEl0IGlzIGFsc28gYW5ub3lpbmcNCnRvIG1lIHRoYXQgSSBn ZXQgYmxhbWVkIGlmIEkgcmVwb3J0IGEgcmVncmVzc2lvbiBpbnN0ZWFkIG9mIHNlZWluZyB0aGF0 IHRoZQ0KcmVncmVzc2lvbiBnZXRzIGZpeGVkLg0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 18:47 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 18:47 UTC (permalink / raw) To: snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei, axboe On Thu, 2018-01-18 at 13:30 -0500, Mike Snitzer wrote: > 1%!? Where are you getting that number? Ming has detailed more > significant performance gains than 1%.. and not just on lpfc (though you > keep seizing on lpfc because of the low queue_depth of 3). That's what I derived from the numbers you posted for null_blk. If Ming has posted performance results for other drivers than lpfc, please let me know where I can find these. I have not yet seen these numbers. > This is all very tiresome. Yes, this is tiresome. It is very annoying to me that others keep introducing so many regressions in such important parts of the kernel. It is also annoying to me that I get blamed if I report a regression instead of seeing that the regression gets fixed. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 18:47 ` Bart Van Assche @ 2018-01-18 20:11 ` Jens Axboe -1 siblings, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-18 20:11 UTC (permalink / raw) To: Bart Van Assche, snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On 1/18/18 11:47 AM, Bart Van Assche wrote: >> This is all very tiresome. > > Yes, this is tiresome. It is very annoying to me that others keep > introducing so many regressions in such important parts of the kernel. > It is also annoying to me that I get blamed if I report a regression > instead of seeing that the regression gets fixed. I agree, it sucks that any change there introduces the regression. I'm fine with doing the delay insert again until a new patch is proven to be better. >From the original topic of this email, we have conditions that can cause the driver to not be able to submit an IO. A set of those conditions can only happen if IO is in flight, and those cases we have covered just fine. Another set can potentially trigger without IO being in flight. These are cases where a non-device resource is unavailable at the time of submission. This might be iommu running out of space, for instance, or it might be a memory allocation of some sort. For these cases, we don't get any notification when the shortage clears. All we can do is ensure that we restart operations at some point in the future. We're SOL at that point, but we have to ensure that we make forward progress. That last set of conditions better not be a a common occurence, since performance is down the toilet at that point. I don't want to introduce hot path code to rectify it. Have the driver return if that happens in a way that is DIFFERENT from needing a normal restart. The driver knows if this is a resource that will become available when IO completes on this device or not. If we get that return, we have a generic run-again delay. This basically becomes the same as doing the delay queue thing from DM, but just in a generic fashion. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 20:11 ` Jens Axboe 0 siblings, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-18 20:11 UTC (permalink / raw) To: Bart Van Assche, snitzer Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On 1/18/18 11:47 AM, Bart Van Assche wrote: >> This is all very tiresome. > > Yes, this is tiresome. It is very annoying to me that others keep > introducing so many regressions in such important parts of the kernel. > It is also annoying to me that I get blamed if I report a regression > instead of seeing that the regression gets fixed. I agree, it sucks that any change there introduces the regression. I'm fine with doing the delay insert again until a new patch is proven to be better. From the original topic of this email, we have conditions that can cause the driver to not be able to submit an IO. A set of those conditions can only happen if IO is in flight, and those cases we have covered just fine. Another set can potentially trigger without IO being in flight. These are cases where a non-device resource is unavailable at the time of submission. This might be iommu running out of space, for instance, or it might be a memory allocation of some sort. For these cases, we don't get any notification when the shortage clears. All we can do is ensure that we restart operations at some point in the future. We're SOL at that point, but we have to ensure that we make forward progress. That last set of conditions better not be a a common occurence, since performance is down the toilet at that point. I don't want to introduce hot path code to rectify it. Have the driver return if that happens in a way that is DIFFERENT from needing a normal restart. The driver knows if this is a resource that will become available when IO completes on this device or not. If we get that return, we have a generic run-again delay. This basically becomes the same as doing the delay queue thing from DM, but just in a generic fashion. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 20:11 ` Jens Axboe (?) @ 2018-01-18 20:48 ` Mike Snitzer 2018-01-18 20:58 ` Bart Van Assche -1 siblings, 1 reply; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 20:48 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On Thu, Jan 18 2018 at 3:11pm -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 1/18/18 11:47 AM, Bart Van Assche wrote: > >> This is all very tiresome. > > > > Yes, this is tiresome. It is very annoying to me that others keep > > introducing so many regressions in such important parts of the kernel. > > It is also annoying to me that I get blamed if I report a regression > > instead of seeing that the regression gets fixed. > > I agree, it sucks that any change there introduces the regression. I'm > fine with doing the delay insert again until a new patch is proven to be > better. > > From the original topic of this email, we have conditions that can cause > the driver to not be able to submit an IO. A set of those conditions can > only happen if IO is in flight, and those cases we have covered just > fine. Another set can potentially trigger without IO being in flight. > These are cases where a non-device resource is unavailable at the time > of submission. This might be iommu running out of space, for instance, > or it might be a memory allocation of some sort. For these cases, we > don't get any notification when the shortage clears. All we can do is > ensure that we restart operations at some point in the future. We're SOL > at that point, but we have to ensure that we make forward progress. > > That last set of conditions better not be a a common occurence, since > performance is down the toilet at that point. I don't want to introduce > hot path code to rectify it. Have the driver return if that happens in a > way that is DIFFERENT from needing a normal restart. The driver knows if > this is a resource that will become available when IO completes on this > device or not. If we get that return, we have a generic run-again delay. > > This basically becomes the same as doing the delay queue thing from DM, > but just in a generic fashion. This is a bit confusing for me (as I see it we have 2 blk-mq drivers trying to collaborate, so your refering to "driver" lacks precision; but I could just be missing something)... For Bart's test the underlying scsi-mq driver is what is regularly hitting this case in __blk_mq_try_issue_directly(): if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) It certainly better not be the norm (Bart's test hammering on this aside). For starters, it'd be very useful to know if Bart is hitting the blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is triggering the use of blk_mq_sched_insert_request() -- I'd wager it is due to blk_queue_quiesced() but Bart _please_ try to figure it out. Anyway, in response to this case Bart would like the upper layer dm-mq driver to blk_mq_delay_run_hw_queue(). Certainly is quite the hammer. But that hammer aside, in general for this case, I'm concerned about: is it really correct to allow an already stopped/quiesced underlying queue to retain responsibility for processing the request? Or would the upper-layer dm-mq benefit from being able to retry the request on its terms (via a "DIFFERENT" return from blk-mq core)? Like this? The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in this case but... (Bart please note that this patch applies on linux-dm.git's 'for-next'; which is just a merge of Jens' 4.16 tree and dm-4.16) diff --git a/block/blk-mq.c b/block/blk-mq.c index 74a4f237ba91..371a1b97bf56 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request_queue *q = rq->q; bool run_queue = true; - /* - * RCU or SRCU read lock is needed before checking quiesced flag. - * - * When queue is stopped or quiesced, ignore 'bypass_insert' from - * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller, - * and avoid driver to try to dispatch again. - */ + /* RCU or SRCU read lock is needed before checking quiesced flag */ if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) { run_queue = false; - bypass_insert = false; + if (bypass_insert) + return BLK_STS_DM_REQUEUE; goto insert; } diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index d8519ddd7e1a..2f554ea485c3 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ clone->start_time = jiffies; r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE) + if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE) /* must complete clone in terms of original request */ dm_complete_request(rq, r); return r; @@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq, * Returns: * DM_MAPIO_* : the request has been processed as indicated * DM_MAPIO_REQUEUE : the original request needs to be immediately requeued + * DM_MAPIO_DELAY_REQUEUE : the original request needs to be requeued after delay * < 0 : the request was completed due to failure */ static int map_request(struct dm_rq_target_io *tio) @@ -500,11 +501,11 @@ static int map_request(struct dm_rq_target_io *tio) trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)), blk_rq_pos(rq)); ret = dm_dispatch_clone_request(clone, rq); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DM_REQUEUE) { blk_rq_unprep_clone(clone); tio->ti->type->release_clone_rq(clone); tio->clone = NULL; - if (!rq->q->mq_ops) + if (ret == BLK_STS_DM_REQUEUE || !rq->q->mq_ops) r = DM_MAPIO_DELAY_REQUEUE; else r = DM_MAPIO_REQUEUE; @@ -741,6 +742,7 @@ static int dm_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { + int r; struct request *rq = bd->rq; struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq); struct mapped_device *md = tio->md; @@ -768,10 +770,13 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, tio->ti = ti; /* Direct call is fine since .queue_rq allows allocations */ - if (map_request(tio) == DM_MAPIO_REQUEUE) { + r = map_request(tio); + if (r == DM_MAPIO_REQUEUE || r == DM_MAPIO_DELAY_REQUEUE) { /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); + if (r == DM_MAPIO_DELAY_REQUEUE) + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; } ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 20:48 ` Mike Snitzer @ 2018-01-18 20:58 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 20:58 UTC (permalink / raw) To: snitzer, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei T24gVGh1LCAyMDE4LTAxLTE4IGF0IDE1OjQ4IC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IEZvciBCYXJ0J3MgdGVzdCB0aGUgdW5kZXJseWluZyBzY3NpLW1xIGRyaXZlciBpcyB3aGF0IGlz IHJlZ3VsYXJseQ0KPiBoaXR0aW5nIHRoaXMgY2FzZSBpbiBfX2Jsa19tcV90cnlfaXNzdWVfZGly ZWN0bHkoKToNCj4gDQo+ICAgICAgICAgaWYgKGJsa19tcV9oY3R4X3N0b3BwZWQoaGN0eCkgfHwg YmxrX3F1ZXVlX3F1aWVzY2VkKHEpKQ0KDQpIZWxsbyBNaWtlLA0KDQpUaGF0IGNvZGUgcGF0aCBp cyBub3QgdGhlIGNvZGUgcGF0aCB0aGF0IHRyaWdnZXJlZCB0aGUgbG9ja3VwcyB0aGF0IEkgcmVw b3J0ZWQNCmR1cmluZyB0aGUgcGFzdCBkYXlzLiBUaGVzZSBsb2NrdXBzIHdlcmUgYWxsIHRyaWdn ZXJlZCBieSBpbmNvcnJlY3QgaGFuZGxpbmcgb2YNCi5xdWV1ZV9ycSgpIHJldHVybmluZyBCTEtf U1RTX1JFU09VUkNFLg0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 20:58 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 20:58 UTC (permalink / raw) To: snitzer, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > For Bart's test the underlying scsi-mq driver is what is regularly > hitting this case in __blk_mq_try_issue_directly(): > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) Hello Mike, That code path is not the code path that triggered the lockups that I reported during the past days. These lockups were all triggered by incorrect handling of .queue_rq() returning BLK_STS_RESOURCE. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 20:58 ` Bart Van Assche @ 2018-01-18 21:23 ` Mike Snitzer -1 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 21:23 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On Thu, Jan 18 2018 at 3:58P -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > For Bart's test the underlying scsi-mq driver is what is regularly > > hitting this case in __blk_mq_try_issue_directly(): > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > Hello Mike, > > That code path is not the code path that triggered the lockups that I reported > during the past days. If you're hitting blk_mq_sched_insert_request() then you most certainly are hitting that code path. If you aren't then what was your earlier email going on about? https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html If you were just focusing on that as one possible reason, that isn't very helpful. By this point you really should _know_ what is triggering the stall based on the code paths taken. Please use ftrace's function_graph tracer if need be. > These lockups were all triggered by incorrect handling of > .queue_rq() returning BLK_STS_RESOURCE. Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? Please try to do more work analyzing the test case that only you can easily run (due to srp_test being a PITA). And less time lobbying for a change that you don't understand to _really_ be correct. We have time to get this right, please stop hyperventilating about "regressions". Thanks, Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 21:23 ` Mike Snitzer 0 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 21:23 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, linux-block, linux-kernel, ming.lei, hch, dm-devel, osandov On Thu, Jan 18 2018 at 3:58P -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > For Bart's test the underlying scsi-mq driver is what is regularly > > hitting this case in __blk_mq_try_issue_directly(): > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > Hello Mike, > > That code path is not the code path that triggered the lockups that I reported > during the past days. If you're hitting blk_mq_sched_insert_request() then you most certainly are hitting that code path. If you aren't then what was your earlier email going on about? https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html If you were just focusing on that as one possible reason, that isn't very helpful. By this point you really should _know_ what is triggering the stall based on the code paths taken. Please use ftrace's function_graph tracer if need be. > These lockups were all triggered by incorrect handling of > .queue_rq() returning BLK_STS_RESOURCE. Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? Please try to do more work analyzing the test case that only you can easily run (due to srp_test being a PITA). And less time lobbying for a change that you don't understand to _really_ be correct. We have time to get this right, please stop hyperventilating about "regressions". Thanks, Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 21:23 ` Mike Snitzer (?) @ 2018-01-18 21:37 ` Laurence Oberman -1 siblings, 0 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 21:37 UTC (permalink / raw) To: Mike Snitzer, Bart Van Assche Cc: axboe, dm-devel, hch, linux-kernel, linux-block, osandov, ming.lei On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 3:58P -0500, > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > For Bart's test the underlying scsi-mq driver is what is > > > regularly > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > Hello Mike, > > > > That code path is not the code path that triggered the lockups that > > I reported > > during the past days. > > If you're hitting blk_mq_sched_insert_request() then you most > certainly > are hitting that code path. > > If you aren't then what was your earlier email going on about? > https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html > > If you were just focusing on that as one possible reason, that isn't > very helpful. By this point you really should _know_ what is > triggering > the stall based on the code paths taken. Please use ftrace's > function_graph tracer if need be. > > > These lockups were all triggered by incorrect handling of > > .queue_rq() returning BLK_STS_RESOURCE. > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA). And less time lobbying > for > a change that you don't understand to _really_ be correct. > > We have time to get this right, please stop hyperventilating about > "regressions". > > Thanks, > Mike Hello Bart I have run a good few loops of 02-mq and its stable for me on your tree. I am not running the entire disconnect re-connect loops and un-mounts etc. for good reason. I have 35 LUNS so its very impact-full to lose them and have them come back all the time. Anyway I am very happy to try reproduce this in-house so Mike and Ming can focus on it but I need to know if all I need to do is loop over 02-mq over and over. Also please let me know whats debugfs and sysfs to capture and I am happy to try help move this along. Regards Laurence ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 21:23 ` Mike Snitzer @ 2018-01-18 21:39 ` Bart Van Assche -1 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 21:39 UTC (permalink / raw) To: snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei T24gVGh1LCAyMDE4LTAxLTE4IGF0IDE2OjIzIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IE9uIFRodSwgSmFuIDE4IDIwMTggYXQgIDM6NThQIC0wNTAwLA0KPiBCYXJ0IFZhbiBBc3NjaGUg PEJhcnQuVmFuQXNzY2hlQHdkYy5jb20+IHdyb3RlOg0KPiANCj4gPiBPbiBUaHUsIDIwMTgtMDEt MTggYXQgMTU6NDggLTA1MDAsIE1pa2UgU25pdHplciB3cm90ZToNCj4gPiA+IEZvciBCYXJ0J3Mg dGVzdCB0aGUgdW5kZXJseWluZyBzY3NpLW1xIGRyaXZlciBpcyB3aGF0IGlzIHJlZ3VsYXJseQ0K PiA+ID4gaGl0dGluZyB0aGlzIGNhc2UgaW4gX19ibGtfbXFfdHJ5X2lzc3VlX2RpcmVjdGx5KCk6 DQo+ID4gPiANCj4gPiA+ICAgICAgICAgaWYgKGJsa19tcV9oY3R4X3N0b3BwZWQoaGN0eCkgfHwg YmxrX3F1ZXVlX3F1aWVzY2VkKHEpKQ0KPiA+DQo+ID4gVGhlc2UgbG9ja3VwcyB3ZXJlIGFsbCB0 cmlnZ2VyZWQgYnkgaW5jb3JyZWN0IGhhbmRsaW5nIG9mDQo+ID4gLnF1ZXVlX3JxKCkgcmV0dXJu aW5nIEJMS19TVFNfUkVTT1VSQ0UuDQo+IA0KPiBQbGVhc2UgYmUgcHJlY2lzZSwgZG1fbXFfcXVl dWVfcnEoKSdzIHJldHVybiBvZiBCTEtfU1RTX1JFU09VUkNFPw0KPiAiSW5jb3JyZWN0IiBiZWNh dXNlIGl0IG5vIGxvbmdlciBydW5zIGJsa19tcV9kZWxheV9ydW5faHdfcXVldWUoKT8NCg0KSW4g d2hhdCBJIHdyb3RlIEkgd2FzIHJlZmVycmluZyB0byBib3RoIGRtX21xX3F1ZXVlX3JxKCkgYW5k IHNjc2lfcXVldWVfcnEoKS4NCldpdGggImluY29ycmVjdCIgSSBtZWFudCB0aGF0IHF1ZXVlIGxv Y2t1cHMgYXJlIGludHJvZHVjZWQgdGhhdCBtYWtlIHVzZXINCnNwYWNlIHByb2Nlc3NlcyB1bmtp bGxhYmxlLiBUaGF0J3MgYSBzZXZlcmUgYnVnLg0KDQo+IFBsZWFzZSB0cnkgdG8gZG8gbW9yZSB3 b3JrIGFuYWx5emluZyB0aGUgdGVzdCBjYXNlIHRoYXQgb25seSB5b3UgY2FuDQo+IGVhc2lseSBy dW4gKGR1ZSB0byBzcnBfdGVzdCBiZWluZyBhIFBJVEEpLg0KDQpJdCBpcyBub3QgY29ycmVjdCB0 aGF0IEknbSB0aGUgb25seSBvbmUgd2hvIGlzIGFibGUgdG8gcnVuIHRoYXQgc29mdHdhcmUuDQpB bnlvbmUgd2hvIGlzIHdpbGxpbmcgdG8gbWVyZ2UgdGhlIGxhdGVzdCBTUlAgaW5pdGlhdG9yIGFu ZCB0YXJnZXQgZHJpdmVyDQpwYXRjaGVzIGluIGhpcyBvciBoZXIgdHJlZSBjYW4gcnVuIHRoYXQg c29mdHdhcmUgaW4NCmFueSBWTS4gSSdtIHdvcmtpbmcgaGFyZA0Kb24gZ2V0dGluZyB0aGUgcGF0 Y2hlcyB1cHN0cmVhbSB0aGF0IG1ha2UgaXQgcG9zc2libGUgdG8gcnVuIHRoZSBzcnAtdGVzdA0K c29mdHdhcmUgb24gYSBzZXR1cCB0aGF0IGlzIG5vdCBlcXVpcHBlZCB3aXRoIEluZmluaUJhbmQg aGFyZHdhcmUuDQoNCj4gV2UgaGF2ZSB0aW1lIHRvIGdldCB0aGlzIHJpZ2h0LCBwbGVhc2Ugc3Rv cCBoeXBlcnZlbnRpbGF0aW5nIGFib3V0DQo+ICJyZWdyZXNzaW9ucyIuDQoNClNvcnJ5IE1pa2Ug YnV0IHRoYXQncyBzb21ldGhpbmcgSSBjb25zaWRlciBhcyBhbiB1bmZhaXIgY29tbWVudC4gSWYg TWluZyBhbmQNCnlvdSB3b3JrIG9uIHBhdGNoZXMgdG9nZXRoZXIsIGl0J3MgeW91ciBqb2IgdG8g bWFrZSBzdXJlIHRoYXQgbm8gcmVncmVzc2lvbnMNCmFyZSBpbnRyb2R1Y2VkLiBJbnN0ZWFkIG9m IGJsYW1pbmcgbWUgYmVjYXVzZSBJIHJlcG9ydCB0aGVzZSByZWdyZXNzaW9ucyB5b3UNCnNob3Vs ZCBiZSBncmF0ZWZ1bCB0aGF0IEkgdGFrZSB0aGUgdGltZSBhbmQgZWZmb3J0IHRvIHJlcG9ydCB0 aGVzZSByZWdyZXNzaW9ucw0KZWFybHkuIEFuZCBzaW5jZSB5b3UgYXJlIGVtcGxveWVkIGJ5IGEg bGFyZ2Ugb3JnYW5pemF0aW9uIHRoYXQgc2VsbHMgTGludXgNCnN1cHBvcnQgc2VydmljZXMsIHlv dXIgZW1wbG95ZXIgc2hvdWxkIGludmVzdCBpbiBkZXZlbG9waW5nIHRlc3QgY2FzZXMgdGhhdA0K cmVhY2ggYSBoaWdoZXIgY292ZXJhZ2Ugb2YgdGhlIGRtLCBTQ1NJIGFuZCBibG9jayBsYXllciBj b2RlLiBJIGRvbid0IHRoaW5rDQp0aGF0IGl0J3Mgbm9ybWFsIHRoYXQgbXkgdGVzdHMgZGlzY292 ZXJlZCBzZXZlcmFsIGlzc3VlcyB0aGF0IHdlcmUgbm90DQpkaXNjb3ZlcmVkIGJ5IFJlZCBIYXQn cyBpbnRlcm5hbCB0ZXN0IHN1aXRlLiBUaGF0J3Mgc29tZXRoaW5nIFJlZCBIYXQgaGFzIHRvDQph ZGRyZXNzLg0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 21:39 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 21:39 UTC (permalink / raw) To: snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 3:58P -0500, > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > These lockups were all triggered by incorrect handling of > > .queue_rq() returning BLK_STS_RESOURCE. > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq(). With "incorrect" I meant that queue lockups are introduced that make user space processes unkillable. That's a severe bug. > Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA). It is not correct that I'm the only one who is able to run that software. Anyone who is willing to merge the latest SRP initiator and target driver patches in his or her tree can run that software in any VM. I'm working hard on getting the patches upstream that make it possible to run the srp-test software on a setup that is not equipped with InfiniBand hardware. > We have time to get this right, please stop hyperventilating about > "regressions". Sorry Mike but that's something I consider as an unfair comment. If Ming and you work on patches together, it's your job to make sure that no regressions are introduced. Instead of blaming me because I report these regressions you should be grateful that I take the time and effort to report these regressions early. And since you are employed by a large organization that sells Linux support services, your employer should invest in developing test cases that reach a higher coverage of the dm, SCSI and block layer code. I don't think that it's normal that my tests discovered several issues that were not discovered by Red Hat's internal test suite. That's something Red Hat has to address. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 21:39 ` Bart Van Assche @ 2018-01-18 21:45 ` Laurence Oberman -1 siblings, 0 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 21:45 UTC (permalink / raw) To: Bart Van Assche Cc: snitzer, dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei [-- Attachment #1: Type: text/plain, Size: 2988 bytes --] Hello Bart Firstly let me start with : You have always been kind, patient and helpful to me and myself the same to you so I am not keen to get in the middle of this. But its not true about Red Hat because I work very hard on this and I very often find bugs you are not seeing so Red Hat is adding value here. I emailed you a number of times asking if you can provide me the exact steps, but not via your srp-test suite. I have a setup that is not conducive to running your loop disconnects etc. and if you are seeing a stall on multiple loops of 02-mq I should be able to reproduce it with out having to run your test suite. Please let me know how I can help Laurence On Thu, Jan 18, 2018 at 4:39 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and > scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. > > > Please try to do more work analyzing the test case that only you can > > easily run (due to srp_test being a PITA). > > It is not correct that I'm the only one who is able to run that software. > Anyone who is willing to merge the latest SRP initiator and target driver > patches in his or her tree can run that software in > any VM. I'm working hard > on getting the patches upstream that make it possible to run the srp-test > software on a setup that is not equipped with InfiniBand hardware. > > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming > and > you work on patches together, it's your job to make sure that no > regressions > are introduced. Instead of blaming me because I report these regressions > you > should be grateful that I take the time and effort to report these > regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has > to > address. > > Bart. [-- Attachment #2: Type: text/html, Size: 3929 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 21:45 ` Laurence Oberman 0 siblings, 0 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 21:45 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, linux-block, snitzer, linux-kernel, ming.lei, hch, dm-devel, osandov [-- Attachment #1.1: Type: text/plain, Size: 2988 bytes --] Hello Bart Firstly let me start with : You have always been kind, patient and helpful to me and myself the same to you so I am not keen to get in the middle of this. But its not true about Red Hat because I work very hard on this and I very often find bugs you are not seeing so Red Hat is adding value here. I emailed you a number of times asking if you can provide me the exact steps, but not via your srp-test suite. I have a setup that is not conducive to running your loop disconnects etc. and if you are seeing a stall on multiple loops of 02-mq I should be able to reproduce it with out having to run your test suite. Please let me know how I can help Laurence On Thu, Jan 18, 2018 at 4:39 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and > scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. > > > Please try to do more work analyzing the test case that only you can > > easily run (due to srp_test being a PITA). > > It is not correct that I'm the only one who is able to run that software. > Anyone who is willing to merge the latest SRP initiator and target driver > patches in his or her tree can run that software in > any VM. I'm working hard > on getting the patches upstream that make it possible to run the srp-test > software on a setup that is not equipped with InfiniBand hardware. > > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming > and > you work on patches together, it's your job to make sure that no > regressions > are introduced. Instead of blaming me because I report these regressions > you > should be grateful that I take the time and effort to report these > regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has > to > address. > > Bart. [-- Attachment #1.2: Type: text/html, Size: 3929 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 21:39 ` Bart Van Assche (?) (?) @ 2018-01-18 22:01 ` Mike Snitzer 2018-01-18 22:18 ` Laurence Oberman 2018-01-18 22:20 ` Bart Van Assche -1 siblings, 2 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-18 22:01 UTC (permalink / raw) To: Bart Van Assche Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, Jan 18 2018 at 4:39pm -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. And yet Laurence cannot reproduce any such lockups with your test... Are you absolutely certain this patch doesn't help you? https://patchwork.kernel.org/patch/10174037/ If it doesn't then that is actually very useful to know. > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming and > you work on patches together, it's your job to make sure that no regressions > are introduced. Instead of blaming me because I report these regressions you > should be grateful that I take the time and effort to report these regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has to > address. You have no self-awareness of just how mypoic you're being about this. I'm not ignoring or blaming you for your test no longer passing. Far from it. I very much want to fix this. But I want it fixed in a way that doesn't paper over the real bug(s) while also introducing blind queue runs that compromise the blk-mq RESTART code's ability to work as intended. I'd have thought you could appreciate this. We need a root cause on this, not hand-waving justifications on why problematic delayed queue runs are correct. Please just focus on helping Laurence get his very capable testbed to reproduce this issue. Once we can reproduce these "unkillable" "stalls" in-house it'll be _much_ easier to analyze and fix. Thanks, Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:01 ` Mike Snitzer @ 2018-01-18 22:18 ` Laurence Oberman 2018-01-18 22:20 ` Laurence Oberman 2018-01-18 22:24 ` Bart Van Assche 2018-01-18 22:20 ` Bart Van Assche 1 sibling, 2 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 22:18 UTC (permalink / raw) To: Mike Snitzer, Bart Van Assche Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 4:39pm -0500, > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > > On Thu, Jan 18 2018 at 3:58P -0500, > > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > > For Bart's test the underlying scsi-mq driver is what is > > > > > regularly > > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > > > if (blk_mq_hctx_stopped(hctx) || > > > > > blk_queue_quiesced(q)) > > > > > > > > These lockups were all triggered by incorrect handling of > > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > > "Incorrect" because it no longer runs > > > blk_mq_delay_run_hw_queue()? > > > > In what I wrote I was referring to both dm_mq_queue_rq() and > > scsi_queue_rq(). > > With "incorrect" I meant that queue lockups are introduced that > > make user > > space processes unkillable. That's a severe bug. > > And yet Laurence cannot reproduce any such lockups with your test... > > Are you absolutely certain this patch doesn't help you? > https://patchwork.kernel.org/patch/10174037/ > > If it doesn't then that is actually very useful to know. > > > > We have time to get this right, please stop hyperventilating > > > about > > > "regressions". > > > > Sorry Mike but that's something I consider as an unfair comment. If > > Ming and > > you work on patches together, it's your job to make sure that no > > regressions > > are introduced. Instead of blaming me because I report these > > regressions you > > should be grateful that I take the time and effort to report these > > regressions > > early. And since you are employed by a large organization that > > sells Linux > > support services, your employer should invest in developing test > > cases that > > reach a higher coverage of the dm, SCSI and block layer code. I > > don't think > > that it's normal that my tests discovered several issues that were > > not > > discovered by Red Hat's internal test suite. That's something Red > > Hat has to > > address. > > You have no self-awareness of just how mypoic you're being about > this. > > I'm not ignoring or blaming you for your test no longer passing. Far > from it. I very much want to fix this. But I want it fixed in a way > that doesn't paper over the real bug(s) while also introducing blind > queue runs that compromise the blk-mq RESTART code's ability to work > as > intended. > > I'd have thought you could appreciate this. We need a root cause on > this, not hand-waving justifications on why problematic delayed queue > runs are correct. > > Please just focus on helping Laurence get his very capable testbed to > reproduce this issue. Once we can reproduce these "unkillable" > "stalls" > in-house it'll be _much_ easier to analyze and fix. > > Thanks, > Mike Hello Bart OK, I ran 5 at once of 5 separate mount points. I am using 4k block sizes Its solid consistent for me. No stalls no gaps. [global] name=02-mq filename=fio-output-02-mq.txt rw=randwrite verify=md5 ;rwmixread=60 ;rwmixwrite=40 bs=4K ;direct=1 ;numjobs=4 ;time_based=1 runtime=120 [file1] size=3G ioengine=libaio iodepth=16 I watch I/O and I see it ramp up but fio still runs and it kind of shuts down. #!/bin/bash for i in 1 2 3 4 5 do cd /data-$i fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 & done #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:16:09 34 12 34431 10393 0 0 249872 36094 17:16:10 31 10 13001 2679 0 0 57652 7980 17:16:11 32 11 19473 4416 0 0 143248 17362 17:16:12 31 9 8068 1606 0 0 20088 2026 17:16:13 31 9 7718 1518 0 0 15908 1354 17:16:14 36 14 41132 9710 0 0 686216 46661 17:16:15 39 18 63622 21033 0 0 1246K 75108 17:16:16 38 16 76114 9013 0 0 1146K 82338 17:16:17 33 11 31941 2735 0 0 237340 25034 17:16:18 36 14 23900 4982 0 1 1567K 43244 17:16:19 37 15 55884 4974 0 4 1003K 67253 17:16:20 28 12 7542 4975 0 0 1630K 3266 17:16:21 8 6 14650 34721 0 0 2535K 21206 17:16:22 2 2 6655 15839 0 0 2319K 9897 17:16:23 11 11 37305 134030 0 0 1275K 87010 17:16:24 23 22 59266 119350 6560 1640 1102K 78034 17:16:25 21 17 65699 144462 148052 37013 159900 22120 17:16:26 14 9 80700 164034 216588 54147 4 1 # <----CPU[HYPER]-----><----------Disks-----------> #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:16:27 14 9 78699 162095 214412 53603 0 0 17:16:28 14 9 75895 155352 204932 51233 0 0 17:16:29 14 9 75377 161871 214136 53534 0 0 17:16:30 16 11 79298 155967 206164 51541 0 0 17:16:31 14 8 82281 165266 218652 54662 0 0 17:16:32 14 9 88635 163302 215876 53970 0 0 17:16:33 14 8 88755 168958 223568 55892 0 0 17:16:34 15 9 82073 154321 203936 50984 0 0 17:16:35 17 10 87004 161467 213308 53327 0 0 17:16:36 15 9 86198 161247 212940 53235 4 1 17:16:37 15 9 83429 162890 215276 53819 0 0 17:16:38 16 10 83272 160611 212264 53066 20 2 17:16:39 16 10 93850 168179 222376 55594 0 0 17:16:40 15 9 86801 167507 221372 55343 0 0 17:16:41 14 8 82688 171249 226560 56640 0 0 17:16:42 14 8 86202 159171 210344 52586 0 0 17:16:43 16 10 86063 156009 206328 51582 0 0 17:16:44 16 10 90927 153961 203584 50896 0 0 17:16:45 15 9 95084 156370 206844 51710 132 13 17:16:46 14 8 85572 158469 209300 52326 4 1 17:16:47 14 8 92075 158469 209600 52400 0 0 17:16:48 16 10 95469 154772 204176 51044 0 0 # <----CPU[HYPER]-----><----------Disks-----------> #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:16:49 15 10 93856 154182 203568 50897 5 5 17:16:50 15 10 92308 151964 200680 50170 0 0 17:16:51 16 10 94035 155076 204868 51217 0 0 17:16:52 15 9 93949 164969 217980 54495 0 0 17:16:53 14 9 85744 149553 197428 49357 0 0 17:16:54 14 9 83064 155788 205684 51421 0 0 17:16:55 16 10 83579 166084 218480 54620 0 0 17:16:56 15 9 74766 158728 208556 52139 4 1 17:16:57 14 9 82944 162720 215116 53779 0 0 17:16:58 12 7 88529 163814 216440 54110 0 0 17:16:59 14 7 85936 166723 220372 55093 4 1 17:17:00 13 8 86701 153728 203052 50763 0 0 17:17:01 14 9 92859 155963 205972 51493 0 0 17:17:02 13 8 85624 143746 189756 47439 0 0 17:17:03 13 8 85029 142540 188044 47010 0 0 17:17:04 14 8 87291 144790 191148 47788 0 0 17:17:05 14 8 81922 144342 190544 47636 0 0 17:17:06 14 8 81587 161043 212576 53144 8 2 17:17:07 15 8 85080 142285 187688 46922 0 0 17:17:08 13 8 86980 150246 197852 49463 0 0 17:17:09 12 8 81388 144335 190312 47578 4 1 17:17:10 13 8 88436 146957 194040 48510 0 0 # <----CPU[HYPER]-----><----------Disks-----------> #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 17:17:11 11 7 78997 137847 181828 45457 0 0 17:17:12 12 8 85709 148847 196708 49177 0 0 17:17:13 17 10 97228 170646 225528 56382 0 0 17:17:14 14 9 96032 173474 229400 57350 0 0 17:17:15 15 9 95340 167951 221848 55462 0 0 17:17:16 16 10 87392 157489 207748 51937 38 5 17:17:17 16 10 84756 152753 201716 50429 0 0 17:17:18 17 10 85109 153000 202080 50520 0 0 17:17:19 16 10 92109 164514 217396 54349 80 5 17:17:20 15 9 91342 155674 205804 51451 0 0 17:17:21 12 7 73921 126577 167060 41765 0 0 17:17:22 13 8 76916 126701 167176 41794 0 0 17:17:23 13 8 85187 151074 199768 49942 0 0 17:17:24 15 9 85737 153664 203168 50792 0 0 17:17:25 15 9 87113 147636 195084 48771 0 0 17:17:26 14 8 89440 157433 208116 52029 4 1 17:17:27 14 9 95290 172224 227036 56759 0 0 17:17:28 15 10 93918 167611 220164 55041 0 0 17:17:29 13 8 82404 149274 197312 49328 0 0 17:17:30 10 7 75961 138276 182668 45667 0 0 17:17:31 12 8 84899 151829 200764 50191 0 0 17:17:32 13 8 81386 141091 186444 46611 0 0 17:17:33 15 9 100K 167489 221640 55410 0 0 17:17:34 14 9 89601 158765 210032 52508 0 0 17:17:35 14 9 92440 155287 205216 51304 0 0 17:17:36 15 10 90042 155934 206168 51542 4 1 17:17:37 14 9 91670 164456 217684 54421 0 0 17:17:38 14 9 78612 167368 221348 55337 0 0 17:17:39 15 9 91325 162599 214616 53654 0 0 17:17:40 14 9 79112 151595 199984 50000 21 8 17:17:41 9 3 32565 60936 79832 19964 32 12 ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:18 ` Laurence Oberman @ 2018-01-18 22:20 ` Laurence Oberman 2018-01-18 22:24 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 22:20 UTC (permalink / raw) To: Mike Snitzer, Bart Van Assche Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei [-- Attachment #1: Type: text/plain, Size: 10072 bytes --] That should have said fio still runs and it only shuts down when done I simply dont see the stalls On Thu, Jan 18, 2018 at 5:18 PM, Laurence Oberman <loberman@redhat.com> wrote: > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 4:39pm -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > > > On Thu, Jan 18 2018 at 3:58P -0500, > > > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > > > For Bart's test the underlying scsi-mq driver is what is > > > > > > regularly > > > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > > > > > if (blk_mq_hctx_stopped(hctx) || > > > > > > blk_queue_quiesced(q)) > > > > > > > > > > These lockups were all triggered by incorrect handling of > > > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > > > "Incorrect" because it no longer runs > > > > blk_mq_delay_run_hw_queue()? > > > > > > In what I wrote I was referring to both dm_mq_queue_rq() and > > > scsi_queue_rq(). > > > With "incorrect" I meant that queue lockups are introduced that > > > make user > > > space processes unkillable. That's a severe bug. > > > > And yet Laurence cannot reproduce any such lockups with your test... > > > > Are you absolutely certain this patch doesn't help you? > > https://patchwork.kernel.org/patch/10174037/ > > > > If it doesn't then that is actually very useful to know. > > > > > > We have time to get this right, please stop hyperventilating > > > > about > > > > "regressions". > > > > > > Sorry Mike but that's something I consider as an unfair comment. If > > > Ming and > > > you work on patches together, it's your job to make sure that no > > > regressions > > > are introduced. Instead of blaming me because I report these > > > regressions you > > > should be grateful that I take the time and effort to report these > > > regressions > > > early. And since you are employed by a large organization that > > > sells Linux > > > support services, your employer should invest in developing test > > > cases that > > > reach a higher coverage of the dm, SCSI and block layer code. I > > > don't think > > > that it's normal that my tests discovered several issues that were > > > not > > > discovered by Red Hat's internal test suite. That's something Red > > > Hat has to > > > address. > > > > You have no self-awareness of just how mypoic you're being about > > this. > > > > I'm not ignoring or blaming you for your test no longer passing. Far > > from it. I very much want to fix this. But I want it fixed in a way > > that doesn't paper over the real bug(s) while also introducing blind > > queue runs that compromise the blk-mq RESTART code's ability to work > > as > > intended. > > > > I'd have thought you could appreciate this. We need a root cause on > > this, not hand-waving justifications on why problematic delayed queue > > runs are correct. > > > > Please just focus on helping Laurence get his very capable testbed to > > reproduce this issue. Once we can reproduce these "unkillable" > > "stalls" > > in-house it'll be _much_ easier to analyze and fix. > > > > Thanks, > > Mike > > Hello Bart > > OK, I ran 5 at once of 5 separate mount points. > I am using 4k block sizes > Its solid consistent for me. No stalls no gaps. > > [global] > name=02-mq > filename=fio-output-02-mq.txt > rw=randwrite > verify=md5 > ;rwmixread=60 > ;rwmixwrite=40 > bs=4K > ;direct=1 > ;numjobs=4 > ;time_based=1 > runtime=120 > > [file1] > size=3G > ioengine=libaio > iodepth=16 > > I watch I/O and I see it ramp up but fio still runs and it kind of > shuts down. > > #!/bin/bash > for i in 1 2 3 4 5 > do > cd /data-$i > fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 & > done > > > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > > 17:16:09 34 12 34431 10393 0 0 249872 36094 > 17:16:10 31 10 13001 2679 0 0 57652 7980 > 17:16:11 32 11 19473 4416 0 0 143248 17362 > 17:16:12 31 9 8068 1606 0 0 20088 2026 > 17:16:13 31 9 7718 1518 0 0 15908 1354 > 17:16:14 36 14 41132 9710 0 0 686216 46661 > 17:16:15 39 18 63622 21033 0 0 1246K 75108 > 17:16:16 38 16 76114 9013 0 0 1146K 82338 > 17:16:17 33 11 31941 2735 0 0 237340 25034 > 17:16:18 36 14 23900 4982 0 1 1567K 43244 > 17:16:19 37 15 55884 4974 0 4 1003K 67253 > 17:16:20 28 12 7542 4975 0 0 1630K 3266 > 17:16:21 8 6 14650 34721 0 0 2535K 21206 > 17:16:22 2 2 6655 15839 0 0 2319K 9897 > 17:16:23 11 11 37305 134030 0 0 1275K 87010 > 17:16:24 23 22 59266 119350 6560 1640 1102K 78034 > 17:16:25 21 17 65699 144462 148052 37013 159900 22120 > 17:16:26 14 9 80700 164034 216588 54147 4 1 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:16:27 14 9 78699 162095 214412 53603 0 0 > 17:16:28 14 9 75895 155352 204932 51233 0 0 > 17:16:29 14 9 75377 161871 214136 53534 0 0 > 17:16:30 16 11 79298 155967 206164 51541 0 0 > 17:16:31 14 8 82281 165266 218652 54662 0 0 > 17:16:32 14 9 88635 163302 215876 53970 0 0 > 17:16:33 14 8 88755 168958 223568 55892 0 0 > 17:16:34 15 9 82073 154321 203936 50984 0 0 > 17:16:35 17 10 87004 161467 213308 53327 0 0 > 17:16:36 15 9 86198 161247 212940 53235 4 1 > 17:16:37 15 9 83429 162890 215276 53819 0 0 > 17:16:38 16 10 83272 160611 212264 53066 20 2 > 17:16:39 16 10 93850 168179 222376 55594 0 0 > 17:16:40 15 9 86801 167507 221372 55343 0 0 > 17:16:41 14 8 82688 171249 226560 56640 0 0 > 17:16:42 14 8 86202 159171 210344 52586 0 0 > 17:16:43 16 10 86063 156009 206328 51582 0 0 > 17:16:44 16 10 90927 153961 203584 50896 0 0 > 17:16:45 15 9 95084 156370 206844 51710 132 13 > 17:16:46 14 8 85572 158469 209300 52326 4 1 > 17:16:47 14 8 92075 158469 209600 52400 0 0 > 17:16:48 16 10 95469 154772 204176 51044 0 0 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:16:49 15 10 93856 154182 203568 50897 5 5 > 17:16:50 15 10 92308 151964 200680 50170 0 0 > 17:16:51 16 10 94035 155076 204868 51217 0 0 > 17:16:52 15 9 93949 164969 217980 54495 0 0 > 17:16:53 14 9 85744 149553 197428 49357 0 0 > 17:16:54 14 9 83064 155788 205684 51421 0 0 > 17:16:55 16 10 83579 166084 218480 54620 0 0 > 17:16:56 15 9 74766 158728 208556 52139 4 1 > 17:16:57 14 9 82944 162720 215116 53779 0 0 > 17:16:58 12 7 88529 163814 216440 54110 0 0 > 17:16:59 14 7 85936 166723 220372 55093 4 1 > 17:17:00 13 8 86701 153728 203052 50763 0 0 > 17:17:01 14 9 92859 155963 205972 51493 0 0 > 17:17:02 13 8 85624 143746 189756 47439 0 0 > 17:17:03 13 8 85029 142540 188044 47010 0 0 > 17:17:04 14 8 87291 144790 191148 47788 0 0 > 17:17:05 14 8 81922 144342 190544 47636 0 0 > 17:17:06 14 8 81587 161043 212576 53144 8 2 > 17:17:07 15 8 85080 142285 187688 46922 0 0 > 17:17:08 13 8 86980 150246 197852 49463 0 0 > 17:17:09 12 8 81388 144335 190312 47578 4 1 > 17:17:10 13 8 88436 146957 194040 48510 0 0 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:17:11 11 7 78997 137847 181828 45457 0 0 > 17:17:12 12 8 85709 148847 196708 49177 0 0 > 17:17:13 17 10 97228 170646 225528 56382 0 0 > 17:17:14 14 9 96032 173474 229400 57350 0 0 > 17:17:15 15 9 95340 167951 221848 55462 0 0 > 17:17:16 16 10 87392 157489 207748 51937 38 5 > 17:17:17 16 10 84756 152753 201716 50429 0 0 > 17:17:18 17 10 85109 153000 202080 50520 0 0 > 17:17:19 16 10 92109 164514 217396 54349 80 5 > 17:17:20 15 9 91342 155674 205804 51451 0 0 > 17:17:21 12 7 73921 126577 167060 41765 0 0 > 17:17:22 13 8 76916 126701 167176 41794 0 0 > 17:17:23 13 8 85187 151074 199768 49942 0 0 > 17:17:24 15 9 85737 153664 203168 50792 0 0 > 17:17:25 15 9 87113 147636 195084 48771 0 0 > 17:17:26 14 8 89440 157433 208116 52029 4 1 > 17:17:27 14 9 95290 172224 227036 56759 0 0 > 17:17:28 15 10 93918 167611 220164 55041 0 0 > 17:17:29 13 8 82404 149274 197312 49328 0 0 > 17:17:30 10 7 75961 138276 182668 45667 0 0 > 17:17:31 12 8 84899 151829 200764 50191 0 0 > 17:17:32 13 8 81386 141091 186444 46611 0 0 > 17:17:33 15 9 100K 167489 221640 55410 0 0 > 17:17:34 14 9 89601 158765 210032 52508 0 0 > 17:17:35 14 9 92440 155287 205216 51304 0 0 > 17:17:36 15 10 90042 155934 206168 51542 4 1 > 17:17:37 14 9 91670 164456 217684 54421 0 0 > 17:17:38 14 9 78612 167368 221348 55337 0 0 > 17:17:39 15 9 91325 162599 214616 53654 0 0 > 17:17:40 14 9 79112 151595 199984 50000 21 8 > 17:17:41 9 3 32565 60936 79832 19964 32 12 > > > [-- Attachment #2: Type: text/html, Size: 14214 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 22:20 ` Laurence Oberman 0 siblings, 0 replies; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 22:20 UTC (permalink / raw) To: Mike Snitzer, Bart Van Assche Cc: axboe, linux-block, linux-kernel, ming.lei, hch, dm-devel, osandov [-- Attachment #1.1: Type: text/plain, Size: 10072 bytes --] That should have said fio still runs and it only shuts down when done I simply dont see the stalls On Thu, Jan 18, 2018 at 5:18 PM, Laurence Oberman <loberman@redhat.com> wrote: > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 4:39pm -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > > > On Thu, Jan 18 2018 at 3:58P -0500, > > > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > > > For Bart's test the underlying scsi-mq driver is what is > > > > > > regularly > > > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > > > > > if (blk_mq_hctx_stopped(hctx) || > > > > > > blk_queue_quiesced(q)) > > > > > > > > > > These lockups were all triggered by incorrect handling of > > > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > > > "Incorrect" because it no longer runs > > > > blk_mq_delay_run_hw_queue()? > > > > > > In what I wrote I was referring to both dm_mq_queue_rq() and > > > scsi_queue_rq(). > > > With "incorrect" I meant that queue lockups are introduced that > > > make user > > > space processes unkillable. That's a severe bug. > > > > And yet Laurence cannot reproduce any such lockups with your test... > > > > Are you absolutely certain this patch doesn't help you? > > https://patchwork.kernel.org/patch/10174037/ > > > > If it doesn't then that is actually very useful to know. > > > > > > We have time to get this right, please stop hyperventilating > > > > about > > > > "regressions". > > > > > > Sorry Mike but that's something I consider as an unfair comment. If > > > Ming and > > > you work on patches together, it's your job to make sure that no > > > regressions > > > are introduced. Instead of blaming me because I report these > > > regressions you > > > should be grateful that I take the time and effort to report these > > > regressions > > > early. And since you are employed by a large organization that > > > sells Linux > > > support services, your employer should invest in developing test > > > cases that > > > reach a higher coverage of the dm, SCSI and block layer code. I > > > don't think > > > that it's normal that my tests discovered several issues that were > > > not > > > discovered by Red Hat's internal test suite. That's something Red > > > Hat has to > > > address. > > > > You have no self-awareness of just how mypoic you're being about > > this. > > > > I'm not ignoring or blaming you for your test no longer passing. Far > > from it. I very much want to fix this. But I want it fixed in a way > > that doesn't paper over the real bug(s) while also introducing blind > > queue runs that compromise the blk-mq RESTART code's ability to work > > as > > intended. > > > > I'd have thought you could appreciate this. We need a root cause on > > this, not hand-waving justifications on why problematic delayed queue > > runs are correct. > > > > Please just focus on helping Laurence get his very capable testbed to > > reproduce this issue. Once we can reproduce these "unkillable" > > "stalls" > > in-house it'll be _much_ easier to analyze and fix. > > > > Thanks, > > Mike > > Hello Bart > > OK, I ran 5 at once of 5 separate mount points. > I am using 4k block sizes > Its solid consistent for me. No stalls no gaps. > > [global] > name=02-mq > filename=fio-output-02-mq.txt > rw=randwrite > verify=md5 > ;rwmixread=60 > ;rwmixwrite=40 > bs=4K > ;direct=1 > ;numjobs=4 > ;time_based=1 > runtime=120 > > [file1] > size=3G > ioengine=libaio > iodepth=16 > > I watch I/O and I see it ramp up but fio still runs and it kind of > shuts down. > > #!/bin/bash > for i in 1 2 3 4 5 > do > cd /data-$i > fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 & > done > > > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > > 17:16:09 34 12 34431 10393 0 0 249872 36094 > 17:16:10 31 10 13001 2679 0 0 57652 7980 > 17:16:11 32 11 19473 4416 0 0 143248 17362 > 17:16:12 31 9 8068 1606 0 0 20088 2026 > 17:16:13 31 9 7718 1518 0 0 15908 1354 > 17:16:14 36 14 41132 9710 0 0 686216 46661 > 17:16:15 39 18 63622 21033 0 0 1246K 75108 > 17:16:16 38 16 76114 9013 0 0 1146K 82338 > 17:16:17 33 11 31941 2735 0 0 237340 25034 > 17:16:18 36 14 23900 4982 0 1 1567K 43244 > 17:16:19 37 15 55884 4974 0 4 1003K 67253 > 17:16:20 28 12 7542 4975 0 0 1630K 3266 > 17:16:21 8 6 14650 34721 0 0 2535K 21206 > 17:16:22 2 2 6655 15839 0 0 2319K 9897 > 17:16:23 11 11 37305 134030 0 0 1275K 87010 > 17:16:24 23 22 59266 119350 6560 1640 1102K 78034 > 17:16:25 21 17 65699 144462 148052 37013 159900 22120 > 17:16:26 14 9 80700 164034 216588 54147 4 1 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:16:27 14 9 78699 162095 214412 53603 0 0 > 17:16:28 14 9 75895 155352 204932 51233 0 0 > 17:16:29 14 9 75377 161871 214136 53534 0 0 > 17:16:30 16 11 79298 155967 206164 51541 0 0 > 17:16:31 14 8 82281 165266 218652 54662 0 0 > 17:16:32 14 9 88635 163302 215876 53970 0 0 > 17:16:33 14 8 88755 168958 223568 55892 0 0 > 17:16:34 15 9 82073 154321 203936 50984 0 0 > 17:16:35 17 10 87004 161467 213308 53327 0 0 > 17:16:36 15 9 86198 161247 212940 53235 4 1 > 17:16:37 15 9 83429 162890 215276 53819 0 0 > 17:16:38 16 10 83272 160611 212264 53066 20 2 > 17:16:39 16 10 93850 168179 222376 55594 0 0 > 17:16:40 15 9 86801 167507 221372 55343 0 0 > 17:16:41 14 8 82688 171249 226560 56640 0 0 > 17:16:42 14 8 86202 159171 210344 52586 0 0 > 17:16:43 16 10 86063 156009 206328 51582 0 0 > 17:16:44 16 10 90927 153961 203584 50896 0 0 > 17:16:45 15 9 95084 156370 206844 51710 132 13 > 17:16:46 14 8 85572 158469 209300 52326 4 1 > 17:16:47 14 8 92075 158469 209600 52400 0 0 > 17:16:48 16 10 95469 154772 204176 51044 0 0 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:16:49 15 10 93856 154182 203568 50897 5 5 > 17:16:50 15 10 92308 151964 200680 50170 0 0 > 17:16:51 16 10 94035 155076 204868 51217 0 0 > 17:16:52 15 9 93949 164969 217980 54495 0 0 > 17:16:53 14 9 85744 149553 197428 49357 0 0 > 17:16:54 14 9 83064 155788 205684 51421 0 0 > 17:16:55 16 10 83579 166084 218480 54620 0 0 > 17:16:56 15 9 74766 158728 208556 52139 4 1 > 17:16:57 14 9 82944 162720 215116 53779 0 0 > 17:16:58 12 7 88529 163814 216440 54110 0 0 > 17:16:59 14 7 85936 166723 220372 55093 4 1 > 17:17:00 13 8 86701 153728 203052 50763 0 0 > 17:17:01 14 9 92859 155963 205972 51493 0 0 > 17:17:02 13 8 85624 143746 189756 47439 0 0 > 17:17:03 13 8 85029 142540 188044 47010 0 0 > 17:17:04 14 8 87291 144790 191148 47788 0 0 > 17:17:05 14 8 81922 144342 190544 47636 0 0 > 17:17:06 14 8 81587 161043 212576 53144 8 2 > 17:17:07 15 8 85080 142285 187688 46922 0 0 > 17:17:08 13 8 86980 150246 197852 49463 0 0 > 17:17:09 12 8 81388 144335 190312 47578 4 1 > 17:17:10 13 8 88436 146957 194040 48510 0 0 > # <----CPU[HYPER]-----><----------Disks-----------> > #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes > 17:17:11 11 7 78997 137847 181828 45457 0 0 > 17:17:12 12 8 85709 148847 196708 49177 0 0 > 17:17:13 17 10 97228 170646 225528 56382 0 0 > 17:17:14 14 9 96032 173474 229400 57350 0 0 > 17:17:15 15 9 95340 167951 221848 55462 0 0 > 17:17:16 16 10 87392 157489 207748 51937 38 5 > 17:17:17 16 10 84756 152753 201716 50429 0 0 > 17:17:18 17 10 85109 153000 202080 50520 0 0 > 17:17:19 16 10 92109 164514 217396 54349 80 5 > 17:17:20 15 9 91342 155674 205804 51451 0 0 > 17:17:21 12 7 73921 126577 167060 41765 0 0 > 17:17:22 13 8 76916 126701 167176 41794 0 0 > 17:17:23 13 8 85187 151074 199768 49942 0 0 > 17:17:24 15 9 85737 153664 203168 50792 0 0 > 17:17:25 15 9 87113 147636 195084 48771 0 0 > 17:17:26 14 8 89440 157433 208116 52029 4 1 > 17:17:27 14 9 95290 172224 227036 56759 0 0 > 17:17:28 15 10 93918 167611 220164 55041 0 0 > 17:17:29 13 8 82404 149274 197312 49328 0 0 > 17:17:30 10 7 75961 138276 182668 45667 0 0 > 17:17:31 12 8 84899 151829 200764 50191 0 0 > 17:17:32 13 8 81386 141091 186444 46611 0 0 > 17:17:33 15 9 100K 167489 221640 55410 0 0 > 17:17:34 14 9 89601 158765 210032 52508 0 0 > 17:17:35 14 9 92440 155287 205216 51304 0 0 > 17:17:36 15 10 90042 155934 206168 51542 4 1 > 17:17:37 14 9 91670 164456 217684 54421 0 0 > 17:17:38 14 9 78612 167368 221348 55337 0 0 > 17:17:39 15 9 91325 162599 214616 53654 0 0 > 17:17:40 14 9 79112 151595 199984 50000 21 8 > 17:17:41 9 3 32565 60936 79832 19964 32 12 > > > [-- Attachment #1.2: Type: text/html, Size: 14214 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:18 ` Laurence Oberman @ 2018-01-18 22:24 ` Bart Van Assche 2018-01-18 22:24 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:24 UTC (permalink / raw) To: loberman, snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei T24gVGh1LCAyMDE4LTAxLTE4IGF0IDE3OjE4IC0wNTAwLCBMYXVyZW5jZSBPYmVybWFuIHdyb3Rl Og0KPiBPSywgSSByYW4gNSBhdCBvbmNlIG9mIDUgc2VwYXJhdGUgbW91bnQgcG9pbnRzLg0KPiBJ IGFtIHVzaW5nIDRrIGJsb2NrIHNpemVzDQo+IEl0cyBzb2xpZCBjb25zaXN0ZW50IGZvciBtZS4g Tm8gc3RhbGxzIG5vIGdhcHMuDQoNCkhpIExhdXJlbmNlLA0KDQpUaGF0J3MgZ3JlYXQgbmV3cyBh bmQgdGhhbmsgeW91IGZvciBoYXZpbmcgc2hhcmVkIHRoaXMgaW5mb3JtYXRpb24gYnV0IEkgdGhp bmsNCml0IHNob3VsZCBiZSBtZW50aW9uZWQgdGhhdCB5b3UgaGF2ZSBiZWVuIHJ1bm5pbmcgbXkg dHJlZSBpbiB3aGljaCBzb21lIHJlY2VudA0KYmxvY2sgYW5kIGRtIHBhdGNoZXMgd2VyZSByZXZl cnRlZA0KKGh0dHBzOi8vZ2l0aHViLmNvbS9idmFuYXNzY2hlL2xpbnV4L3RyZWUvYmxvY2stc2Nz aS1mb3ItbmV4dCkNCg0KPiANCj4gW2dsb2JhbF0NCj4gbmFtZT0wMi1tcQ0KPiBmaWxlbmFtZT1m aW8tb3V0cHV0LTAyLW1xLnR4dA0KPiBydz1yYW5kd3JpdGUNCj4gdmVyaWZ5PW1kNQ0KPiA7cndt aXhyZWFkPTYwDQo+IDtyd21peHdyaXRlPTQwDQo+IGJzPTRLDQo+IDtkaXJlY3Q9MQ0KPiA7bnVt am9icz00DQo+IDt0aW1lX2Jhc2VkPTENCj4gcnVudGltZT0xMjANCj4gDQo+IFtmaWxlMV0NCj4g c2l6ZT0zRw0KPiBpb2VuZ2luZT1saWJhaW8NCj4gaW9kZXB0aD0xNg0KPiANCj4gSSB3YXRjaCBJ L08gYW5kIEkgc2VlIGl0IHJhbXAgdXAgYnV0IGZpbyBzdGlsbCBydW5zIGFuZCBpdCBraW5kIG9m DQo+IHNodXRzIGRvd24uDQoNCkluIHRlc3QgImZpbGUxIiBJIHNlZSBhbiBJL08gc2l6ZSBvZiAz Ry4gRG9lcyB0aGF0IG1lYW4gdGhhdCB0aGUgcGF0Y2ggdGhhdA0Kc2hvdWxkIGZpeCB0aGUgc2ds X2FsbG9jKCkgaXNzdWUgaXMgd29ya2luZz8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 22:24 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:24 UTC (permalink / raw) To: loberman, snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: > OK, I ran 5 at once of 5 separate mount points. > I am using 4k block sizes > Its solid consistent for me. No stalls no gaps. Hi Laurence, That's great news and thank you for having shared this information but I think it should be mentioned that you have been running my tree in which some recent block and dm patches were reverted (https://github.com/bvanassche/linux/tree/block-scsi-for-next) > > [global] > name=02-mq > filename=fio-output-02-mq.txt > rw=randwrite > verify=md5 > ;rwmixread=60 > ;rwmixwrite=40 > bs=4K > ;direct=1 > ;numjobs=4 > ;time_based=1 > runtime=120 > > [file1] > size=3G > ioengine=libaio > iodepth=16 > > I watch I/O and I see it ramp up but fio still runs and it kind of > shuts down. In test "file1" I see an I/O size of 3G. Does that mean that the patch that should fix the sgl_alloc() issue is working? Thanks, Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:24 ` Bart Van Assche (?) @ 2018-01-18 22:35 ` Laurence Oberman 2018-01-18 22:39 ` Jens Axboe -1 siblings, 1 reply; 80+ messages in thread From: Laurence Oberman @ 2018-01-18 22:35 UTC (permalink / raw) To: Bart Van Assche, snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote: > On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: > > OK, I ran 5 at once of 5 separate mount points. > > I am using 4k block sizes > > Its solid consistent for me. No stalls no gaps. > > Hi Laurence, > > That's great news and thank you for having shared this information > but I think > it should be mentioned that you have been running my tree in which > some recent > block and dm patches were reverted > (https://github.com/bvanassche/linux/tree/block-scsi-for-next) > > > > > [global] > > name=02-mq > > filename=fio-output-02-mq.txt > > rw=randwrite > > verify=md5 > > ;rwmixread=60 > > ;rwmixwrite=40 > > bs=4K > > ;direct=1 > > ;numjobs=4 > > ;time_based=1 > > runtime=120 > > > > [file1] > > size=3G > > ioengine=libaio > > iodepth=16 > > > > I watch I/O and I see it ramp up but fio still runs and it kind of > > shuts down. > > In test "file1" I see an I/O size of 3G. Does that mean that the > patch that > should fix the sgl_alloc() issue is working? > > Thanks, > > Bart. Hello Bart Thank you. OK, so booting into Mike tree now and then hopefully I get the lockups. Can you give me some idea of what to look for. I assume I/O just stops. I want to get this happening in-house so we have an avenue to fix this. Following getting the stall I will attend to the slg patch test for the SRPT side. Please note I am not running latest SRPT right now as you know. Regards Back later with results ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:35 ` Laurence Oberman @ 2018-01-18 22:39 ` Jens Axboe 2018-01-18 22:55 ` Bart Van Assche 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-18 22:39 UTC (permalink / raw) To: Laurence Oberman, Bart Van Assche, snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, ming.lei On 1/18/18 3:35 PM, Laurence Oberman wrote: > On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote: >> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote: >>> OK, I ran 5 at once of 5 separate mount points. >>> I am using 4k block sizes >>> Its solid consistent for me. No stalls no gaps. >> >> Hi Laurence, >> >> That's great news and thank you for having shared this information >> but I think >> it should be mentioned that you have been running my tree in which >> some recent >> block and dm patches were reverted >> (https://github.com/bvanassche/linux/tree/block-scsi-for-next) >> >>> >>> [global] >>> name=02-mq >>> filename=fio-output-02-mq.txt >>> rw=randwrite >>> verify=md5 >>> ;rwmixread=60 >>> ;rwmixwrite=40 >>> bs=4K >>> ;direct=1 >>> ;numjobs=4 >>> ;time_based=1 >>> runtime=120 >>> >>> [file1] >>> size=3G >>> ioengine=libaio >>> iodepth=16 >>> >>> I watch I/O and I see it ramp up but fio still runs and it kind of >>> shuts down. >> >> In test "file1" I see an I/O size of 3G. Does that mean that the >> patch that >> should fix the sgl_alloc() issue is working? >> >> Thanks, >> >> Bart. > > Hello Bart Thank you. > > OK, so booting into Mike tree now and then hopefully I get the lockups. > Can you give me some idea of what to look for. > I assume I/O just stops. > > I want to get this happening in-house so we have an avenue to fix this. > Following getting the stall I will attend to the slg patch test for the > SRPT side. > > Please note I am not running latest SRPT right now as you know. When you do have a solid test case, please please submit a blktests test case for it! This needs to be something we can regularly in testing. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:39 ` Jens Axboe @ 2018-01-18 22:55 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:55 UTC (permalink / raw) To: snitzer, axboe, loberman Cc: dm-devel, linux-kernel, hch, linux-block, osandov, ming.lei T24gVGh1LCAyMDE4LTAxLTE4IGF0IDE1OjM5IC0wNzAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBX aGVuIHlvdSBkbyBoYXZlIGEgc29saWQgdGVzdCBjYXNlLCBwbGVhc2UgcGxlYXNlIHN1Ym1pdCBh IGJsa3Rlc3RzDQo+IHRlc3QgY2FzZSBmb3IgaXQhIFRoaXMgbmVlZHMgdG8gYmUgc29tZXRoaW5n IHdlIGNhbiByZWd1bGFybHkgaW4NCj4gdGVzdGluZy4NCg0KSGVsbG8gSmVucywNCg0KVGhhdCBz b3VuZHMgbGlrZSBhIGdvb2QgaWRlYSB0byBtZS4gQlRXLCBJIHRoaW5rIHRoZSByZWFzb24gd2h5 IHNvIGZhciBJDQpjYW4gcmVwcm9kdWNlIHRoZXNlIHF1ZXVlIHN0YWxscyBlYXNpZXIgdGhhbiBv dGhlcnMgaXMgYmVjYXVzZSBJIG1vZGlmaWVkIHRoZQ0KU1JQIGluaXRpYXRvciB0byBtYWtlIGl0 IGVhc3kgdG8gY2F1c2UgdGhlIC5nZXRfYnVkZ2V0KCkgY2FsbCB0byBzdWNjZWVkIGFuZA0KdGhl IHNjc2lfcXVldWVfcnEoKSB0byByZXR1cm4gQkxLX1NUU19CVVNZLiBBIHBvc3NpYmxlIG5leHQg c3RlcCBpcyB0byBhcHBseQ0KYSBzaW1pbGFyIGNoYW5nZSB0byB0aGUgc2NzaV9kZWJ1ZyBkcml2 ZXIuIFRoZSBjaGFuZ2UgSSBtYWRlIGlzIGFzIGZvbGxvd3M6DQoNCmRpZmYgLS1naXQgYS9kcml2 ZXJzL2luZmluaWJhbmQvdWxwL3NycC9pYl9zcnAuYyBiL2RyaXZlcnMvaW5maW5pYmFuZC91bHAv c3JwL2liX3NycC5jDQppbmRleCAwYzg4N2ViZmJjNjQuLjdmM2M0YTE5NzQyNSAxMDA2NDQNCi0t LSBhL2RyaXZlcnMvaW5maW5pYmFuZC91bHAvc3JwL2liX3NycC5jDQorKysgYi9kcml2ZXJzL2lu ZmluaWJhbmQvdWxwL3NycC9pYl9zcnAuYw0KQEAgLTMxMzUsNiArMzEzNSwxNiBAQCBzdGF0aWMg aW50IHNycF9yZXNldF9ob3N0KHN0cnVjdCBzY3NpX2NtbmQgKnNjbW5kKQ0KIAlyZXR1cm4gc3Jw X3JlY29ubmVjdF9ycG9ydCh0YXJnZXQtPnJwb3J0KSA9PSAwID8gU1VDQ0VTUyA6IEZBSUxFRDsN CiB9DQogDQorc3RhdGljIGludCBzcnBfdGFyZ2V0X2FsbG9jKHN0cnVjdCBzY3NpX3RhcmdldCAq c3RhcmdldCkNCit7DQorCXN0cnVjdCBTY3NpX0hvc3QgKnNob3N0ID0gZGV2X3RvX3Nob3N0KHN0 YXJnZXQtPmRldi5wYXJlbnQpOw0KKwlzdHJ1Y3Qgc3JwX3RhcmdldF9wb3J0ICp0YXJnZXQgPSBo b3N0X3RvX3RhcmdldChzaG9zdCk7DQorDQorCWlmICh0YXJnZXQtPnRhcmdldF9jYW5fcXVldWUp DQorCQlzdGFyZ2V0LT5jYW5fcXVldWUgPSB0YXJnZXQtPnRhcmdldF9jYW5fcXVldWU7DQorCXJl dHVybiAwOw0KK30NCisNCiBzdGF0aWMgaW50IHNycF9zbGF2ZV9hbGxvYyhzdHJ1Y3Qgc2NzaV9k ZXZpY2UgKnNkZXYpDQogew0KIAlzdHJ1Y3QgU2NzaV9Ib3N0ICpzaG9zdCA9IHNkZXYtPmhvc3Q7 DQpAQCAtMzM0OCw2ICszMzU4LDcgQEAgc3RhdGljIHN0cnVjdCBzY3NpX2hvc3RfdGVtcGxhdGUg c3JwX3RlbXBsYXRlID0gew0KIAkubW9kdWxlCQkJCT0gVEhJU19NT0RVTEUsDQogCS5uYW1lCQkJ CT0gIkluZmluaUJhbmQgU1JQIGluaXRpYXRvciIsDQogCS5wcm9jX25hbWUJCQk9IERSVl9OQU1F LA0KKwkudGFyZ2V0X2FsbG9jCQkJPSBzcnBfdGFyZ2V0X2FsbG9jLA0KIAkuc2xhdmVfYWxsb2MJ CQk9IHNycF9zbGF2ZV9hbGxvYywNCiAJLnNsYXZlX2NvbmZpZ3VyZQkJPSBzcnBfc2xhdmVfY29u ZmlndXJlLA0KIAkuaW5mbwkJCQk9IHNycF90YXJnZXRfaW5mbywNCkBAIC0zNTE1LDYgKzM1MjYs NyBAQCBlbnVtIHsNCiAJU1JQX09QVF9RVUVVRV9TSVpFCT0gMSA8PCAxNCwNCiAJU1JQX09QVF9J UF9TUkMJCT0gMSA8PCAxNSwNCiAJU1JQX09QVF9JUF9ERVNUCQk9IDEgPDwgMTYsDQorCVNSUF9P UFRfVEFSR0VUX0NBTl9RVUVVRT0gMSA8PCAxNywNCiB9Ow0KIA0KIHN0YXRpYyB1bnNpZ25lZCBp bnQgc3JwX29wdF9tYW5kYXRvcnlbXSA9IHsNCkBAIC0zNTM2LDYgKzM1NDgsNyBAQCBzdGF0aWMg Y29uc3QgbWF0Y2hfdGFibGVfdCBzcnBfb3B0X3Rva2VucyA9IHsNCiAJeyBTUlBfT1BUX1NFUlZJ Q0VfSUQsCQkic2VydmljZV9pZD0lcyIJCX0sDQogCXsgU1JQX09QVF9NQVhfU0VDVCwJCSJtYXhf c2VjdD0lZCIgCQl9LA0KIAl7IFNSUF9PUFRfTUFYX0NNRF9QRVJfTFVOLAkibWF4X2NtZF9wZXJf bHVuPSVkIiAJfSwNCisJeyBTUlBfT1BUX1RBUkdFVF9DQU5fUVVFVUUsCSJ0YXJnZXRfY2FuX3F1 ZXVlPSVkIgl9LA0KIAl7IFNSUF9PUFRfSU9fQ0xBU1MsCQkiaW9fY2xhc3M9JXgiCQl9LA0KIAl7 IFNSUF9PUFRfSU5JVElBVE9SX0VYVCwJImluaXRpYXRvcl9leHQ9JXMiCX0sDQogCXsgU1JQX09Q VF9DTURfU0dfRU5UUklFUywJImNtZF9zZ19lbnRyaWVzPSV1Igl9LA0KQEAgLTM3MjQsNiArMzcz NywxNSBAQCBzdGF0aWMgaW50IHNycF9wYXJzZV9vcHRpb25zKHN0cnVjdCBuZXQgKm5ldCwgY29u c3QgY2hhciAqYnVmLA0KIAkJCXRhcmdldC0+c2NzaV9ob3N0LT5jbWRfcGVyX2x1biA9IHRva2Vu Ow0KIAkJCWJyZWFrOw0KIA0KKwkJY2FzZSBTUlBfT1BUX1RBUkdFVF9DQU5fUVVFVUU6DQorCQkJ aWYgKG1hdGNoX2ludChhcmdzLCAmdG9rZW4pIHx8IHRva2VuIDwgMSkgew0KKwkJCQlwcl93YXJu KCJiYWQgbWF4IHRhcmdldF9jYW5fcXVldWUgcGFyYW1ldGVyICclcydcbiIsDQorCQkJCQlwKTsN CisJCQkJZ290byBvdXQ7DQorCQkJfQ0KKwkJCXRhcmdldC0+dGFyZ2V0X2Nhbl9xdWV1ZSA9IHRv a2VuOw0KKwkJCWJyZWFrOw0KKw0KIAkJY2FzZSBTUlBfT1BUX0lPX0NMQVNTOg0KIAkJCWlmICht YXRjaF9oZXgoYXJncywgJnRva2VuKSkgew0KIAkJCQlwcl93YXJuKCJiYWQgSU8gY2xhc3MgcGFy YW1ldGVyICclcydcbiIsIHApOw0KZGlmZiAtLWdpdCBhL2RyaXZlcnMvaW5maW5pYmFuZC91bHAv c3JwL2liX3NycC5oIGIvZHJpdmVycy9pbmZpbmliYW5kL3VscC9zcnAvaWJfc3JwLmgNCmluZGV4 IGQ2NmM5MDU3ZDVlYS4uNzAzMzRmYTNkZThlIDEwMDY0NA0KLS0tIGEvZHJpdmVycy9pbmZpbmli YW5kL3VscC9zcnAvaWJfc3JwLmgNCisrKyBiL2RyaXZlcnMvaW5maW5pYmFuZC91bHAvc3JwL2li X3NycC5oDQpAQCAtMjE2LDYgKzIxNiw3IEBAIHN0cnVjdCBzcnBfdGFyZ2V0X3BvcnQgew0KIAlj aGFyCQkJdGFyZ2V0X25hbWVbMzJdOw0KIAl1bnNpZ25lZCBpbnQJCXNjc2lfaWQ7DQogCXVuc2ln bmVkIGludAkJc2dfdGFibGVzaXplOw0KKwl1bnNpZ25lZCBpbnQJCXRhcmdldF9jYW5fcXVldWU7 DQogCWludAkJCW1yX3Bvb2xfc2l6ZTsNCiAJaW50CQkJbXJfcGVyX2NtZDsNCiAJaW50CQkJcXVl dWVfc2l6ZTsNCg0KQmFydC4= ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 22:55 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:55 UTC (permalink / raw) To: snitzer, axboe, loberman Cc: dm-devel, linux-kernel, hch, linux-block, osandov, ming.lei On Thu, 2018-01-18 at 15:39 -0700, Jens Axboe wrote: > When you do have a solid test case, please please submit a blktests > test case for it! This needs to be something we can regularly in > testing. Hello Jens, That sounds like a good idea to me. BTW, I think the reason why so far I can reproduce these queue stalls easier than others is because I modified the SRP initiator to make it easy to cause the .get_budget() call to succeed and the scsi_queue_rq() to return BLK_STS_BUSY. A possible next step is to apply a similar change to the scsi_debug driver. The change I made is as follows: diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0c887ebfbc64..7f3c4a197425 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3135,6 +3135,16 @@ static int srp_reset_host(struct scsi_cmnd *scmnd) return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED; } +static int srp_target_alloc(struct scsi_target *starget) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct srp_target_port *target = host_to_target(shost); + + if (target->target_can_queue) + starget->can_queue = target->target_can_queue; + return 0; +} + static int srp_slave_alloc(struct scsi_device *sdev) { struct Scsi_Host *shost = sdev->host; @@ -3348,6 +3358,7 @@ static struct scsi_host_template srp_template = { .module = THIS_MODULE, .name = "InfiniBand SRP initiator", .proc_name = DRV_NAME, + .target_alloc = srp_target_alloc, .slave_alloc = srp_slave_alloc, .slave_configure = srp_slave_configure, .info = srp_target_info, @@ -3515,6 +3526,7 @@ enum { SRP_OPT_QUEUE_SIZE = 1 << 14, SRP_OPT_IP_SRC = 1 << 15, SRP_OPT_IP_DEST = 1 << 16, + SRP_OPT_TARGET_CAN_QUEUE= 1 << 17, }; static unsigned int srp_opt_mandatory[] = { @@ -3536,6 +3548,7 @@ static const match_table_t srp_opt_tokens = { { SRP_OPT_SERVICE_ID, "service_id=%s" }, { SRP_OPT_MAX_SECT, "max_sect=%d" }, { SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d" }, + { SRP_OPT_TARGET_CAN_QUEUE, "target_can_queue=%d" }, { SRP_OPT_IO_CLASS, "io_class=%x" }, { SRP_OPT_INITIATOR_EXT, "initiator_ext=%s" }, { SRP_OPT_CMD_SG_ENTRIES, "cmd_sg_entries=%u" }, @@ -3724,6 +3737,15 @@ static int srp_parse_options(struct net *net, const char *buf, target->scsi_host->cmd_per_lun = token; break; + case SRP_OPT_TARGET_CAN_QUEUE: + if (match_int(args, &token) || token < 1) { + pr_warn("bad max target_can_queue parameter '%s'\n", + p); + goto out; + } + target->target_can_queue = token; + break; + case SRP_OPT_IO_CLASS: if (match_hex(args, &token)) { pr_warn("bad IO class parameter '%s'\n", p); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index d66c9057d5ea..70334fa3de8e 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -216,6 +216,7 @@ struct srp_target_port { char target_name[32]; unsigned int scsi_id; unsigned int sg_tablesize; + unsigned int target_can_queue; int mr_pool_size; int mr_per_cmd; int queue_size; Bart. ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 22:01 ` Mike Snitzer @ 2018-01-18 22:20 ` Bart Van Assche 2018-01-18 22:20 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:20 UTC (permalink / raw) To: snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei T24gVGh1LCAyMDE4LTAxLTE4IGF0IDE3OjAxIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IEFuZCB5ZXQgTGF1cmVuY2UgY2Fubm90IHJlcHJvZHVjZSBhbnkgc3VjaCBsb2NrdXBzIHdpdGgg eW91ciB0ZXN0Li4uDQoNCkhtbSAuLi4gbWF5YmUgSSBtaXN1bmRlcnN0b29kIExhdXJlbmNlIGJ1 dCBJIGRvbid0IHRoaW5rIHRoYXQgTGF1cmVuY2UgaGFzDQphbHJlYWR5IHN1Y2NlZWRlZCBhdCBy dW5uaW5nIGFuIHVubW9kaWZpZWQgdmVyc2lvbiBvZiBteSB0ZXN0cy4gSW4gb25lIG9mIHRoZQ0K ZS1tYWlscyBMYXVyZW5jZSBzZW50IG1lIHRoaXMgbW9ybmluZyBJIHJlYWQgdGhhdCBoZSBtb2Rp ZmllZCB0aGVzZSBzY3JpcHRzDQp0byBnZXQgcGFzdCBhIGtlcm5lbCBtb2R1bGUgdW5sb2FkIGZh aWx1cmUgdGhhdCB3YXMgcmVwb3J0ZWQgd2hpbGUgc3RhcnRpbmcNCnRoZXNlIHRlc3RzLiBTbyB0 aGUgbmV4dCBzdGVwIGlzIHRvIGNoZWNrIHdoaWNoIGNoYW5nZXMgd2VyZSBtYWRlIHRvIHRoZSB0 ZXN0DQpzY3JpcHRzIGFuZCBhbHNvIHdoZXRoZXIgdGhlIHRlc3QgcmVzdWx0cyBhcmUgc3RpbGwg dmFsaWQuDQoNCj4gQXJlIHlvdSBhYnNvbHV0ZWx5IGNlcnRhaW4gdGhpcyBwYXRjaCBkb2Vzbid0 IGhlbHAgeW91Pw0KPiBodHRwczovL3BhdGNod29yay5rZXJuZWwub3JnL3BhdGNoLzEwMTc0MDM3 Lw0KPiANCj4gSWYgaXQgZG9lc24ndCB0aGVuIHRoYXQgaXMgYWN0dWFsbHkgdmVyeSB1c2VmdWwg dG8ga25vdy4NCg0KVGhlIGZpcnN0IEkgdHJpZWQgdGhpcyBtb3JuaW5nIGlzIHRvIHJ1biB0aGUg c3JwLXRlc3Qgc29mdHdhcmUgYWdhaW5zdCBhIG1lcmdlDQpvZiBKZW5zJyBmb3ItbmV4dCBicmFu Y2ggYW5kIHlvdXIgZG0tNC4xNiBicmFuY2guIFNpbmNlIEkgbm90aWNlZCB0aGF0IHRoZSBkbQ0K cXVldWUgbG9ja2VkIHVwIEkgcmVpbnNlcnRlZCBhIGJsa19tcV9kZWxheV9ydW5faHdfcXVldWUo KSBjYWxsIGluIHRoZSBkbSBjb2RlLg0KU2luY2UgZXZlbiB0aGF0IHdhcyBub3Qgc3VmZmljaWVu dCBJIHRyaWVkIHRvIGtpY2sgdGhlIHF1ZXVlcyB2aWEgZGVidWdmcyAoZm9yDQpzIGluIC9zeXMv a2VybmVsL2RlYnVnL2Jsb2NrLyovc3RhdGU7IGRvIGVjaG8ga2ljayA+JHM7IGRvbmUpLiBTaW5j ZSB0aGF0IHdhcw0Kbm90IHN1ZmZpY2llbnQgdG8gcmVzb2x2ZSB0aGUgcXVldWUgc3RhbGwgSSBy ZXZlcnRlZCB0aGUgZm9sbG93aW5nIHRyZWUgcGF0Y2hlcw0KdGhhdCBhcmUgaW4gSmVucycgdHJl ZToNCiogImJsay1tcTogaW1wcm92ZSBETSdzIGJsay1tcSBJTyBtZXJnaW5nIHZpYSBibGtfaW5z ZXJ0X2Nsb25lZF9yZXF1ZXN0IGZlZWRiYWNrIg0KKiAiYmxrLW1xLXNjaGVkOiByZW1vdmUgdW51 c2VkICdjYW5fYmxvY2snIGFyZyBmcm9tIGJsa19tcV9zY2hlZF9pbnNlcnRfcmVxdWVzdCINCiog ImJsay1tcTogZG9uJ3QgZGlzcGF0Y2ggcmVxdWVzdCBpbiBibGtfbXFfcmVxdWVzdF9kaXJlY3Rf aXNzdWUgaWYgcXVldWUgaXMgYnVzeSINCg0KT25seSBhZnRlciBJIGhhZCBkb25lIHRoaXMgdGhl IHNycC10ZXN0IHNvZnR3YXJlIHJhbiBhZ2FpbiB3aXRob3V0IHRyaWdnZXJpbmcNCmRtIHF1ZXVl IGxvY2t1cHMuIFNvcnJ5IGJ1dCBJIGhhdmUgbm90IHlldCBoYWQgdGhlIHRpbWUgdG8gdGVzdCBw YXRjaCAiW1JGQ10NCmJsay1tcTogZml4dXAgUkVTVEFSVCB3aGVuIHF1ZXVlIGJlY29tZXMgaWRs ZSIuDQoNCj4gUGxlYXNlIGp1c3QgZm9jdXMgb24gaGVscGluZyBMYXVyZW5jZSBnZXQgaGlzIHZl cnkgY2FwYWJsZSB0ZXN0YmVkIHRvDQo+IHJlcHJvZHVjZSB0aGlzIGlzc3VlLiAgT25jZSB3ZSBj YW4gcmVwcm9kdWNlIHRoZXNlICJ1bmtpbGxhYmxlIiAic3RhbGxzIg0KPiBpbi1ob3VzZSBpdCds bCBiZSBfbXVjaF8gZWFzaWVyIHRvIGFuYWx5emUgYW5kIGZpeC4NCg0KT0ssIEkgd2lsbCB3b3Jr IHdpdGggTGF1cmVuY2Ugb24gdGhpcy4gTWF5YmUgTGF1cmVuY2UgYW5kIEkgc2hvdWxkIHdvcmsg b24gdGhpcw0KYmVmb3JlIGFuYWx5emluZyB0aGUgbG9ja3VwIHRoYXQgd2FzIG1lbnRpb25lZCBh Ym92ZSBmdXJ0aGVyPw0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-18 22:20 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-18 22:20 UTC (permalink / raw) To: snitzer Cc: dm-devel, linux-kernel, hch, linux-block, osandov, axboe, ming.lei On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > And yet Laurence cannot reproduce any such lockups with your test... Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has already succeeded at running an unmodified version of my tests. In one of the e-mails Laurence sent me this morning I read that he modified these scripts to get past a kernel module unload failure that was reported while starting these tests. So the next step is to check which changes were made to the test scripts and also whether the test results are still valid. > Are you absolutely certain this patch doesn't help you? > https://patchwork.kernel.org/patch/10174037/ > > If it doesn't then that is actually very useful to know. The first I tried this morning is to run the srp-test software against a merge of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. Since even that was not sufficient I tried to kick the queues via debugfs (for s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was not sufficient to resolve the queue stall I reverted the following tree patches that are in Jens' tree: * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" Only after I had done this the srp-test software ran again without triggering dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC] blk-mq: fixup RESTART when queue becomes idle". > Please just focus on helping Laurence get his very capable testbed to > reproduce this issue. Once we can reproduce these "unkillable" "stalls" > in-house it'll be _much_ easier to analyze and fix. OK, I will work with Laurence on this. Maybe Laurence and I should work on this before analyzing the lockup that was mentioned above further? Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-18 22:20 ` Bart Van Assche (?) @ 2018-01-23 9:22 ` Mike Snitzer 2018-01-23 10:53 ` Ming Lei 2018-01-23 16:43 ` Bart Van Assche -1 siblings, 2 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-23 9:22 UTC (permalink / raw) To: Bart Van Assche, axboe, ming.lei Cc: dm-devel, linux-kernel, hch, linux-block, osandov On Thu, Jan 18 2018 at 5:20pm -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > And yet Laurence cannot reproduce any such lockups with your test... > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has > already succeeded at running an unmodified version of my tests. In one of the > e-mails Laurence sent me this morning I read that he modified these scripts > to get past a kernel module unload failure that was reported while starting > these tests. So the next step is to check which changes were made to the test > scripts and also whether the test results are still valid. > > > Are you absolutely certain this patch doesn't help you? > > https://patchwork.kernel.org/patch/10174037/ > > > > If it doesn't then that is actually very useful to know. > > The first I tried this morning is to run the srp-test software against a merge > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. > Since even that was not sufficient I tried to kick the queues via debugfs (for > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was > not sufficient to resolve the queue stall I reverted the following tree patches > that are in Jens' tree: > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" > > Only after I had done this the srp-test software ran again without triggering > dm queue lockups. Given that Ming's notifier-based patchset needs more development time I think we're unfortunately past the point where we can comfortably wait for that to be ready. So we need to explore alternatives to fixing this IO stall regression. Rather than attempt the above block reverts (which is an incomplete listing given newer changes): might we develop a more targeted code change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")? -- which, given Bart's findings above, seems to be the most problematic block commit. To that end, assuming I drop this commit from dm-4.16: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92 Here is my proposal for putting this regression behind us for 4.16 (Ming's line of development would continue and hopefully be included in 4.17): From: Mike Snitzer <snitzer@redhat.com> Date: Tue, 23 Jan 2018 09:40:22 +0100 Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression The series of blk-mq changes intended to improve sequential IO performace (through improved merging with dm-mapth blk-mq stacked on underlying blk-mq device). Unfortunately these changes have caused dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to q->mq_ops->queue_rq() fails (due to device-specific resource unavailability). Fix this by reverting back to how blk_insert_cloned_request() functioned prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert() instead of blk_mq_request_issue_directly(). In the future, this commit should be reverted as the first change in a followup series of changes that implements a comprehensive solution to allowing an underlying blk-mq queue's resource limitation to trigger the upper blk-mq queue to run once that underlying limited resource is replenished. Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index cdae69be68e9..a224f282b4a6 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; } spin_lock_irqsave(q->queue_lock, flags); -- 2.15.0 ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-23 9:22 ` [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) Mike Snitzer @ 2018-01-23 10:53 ` Ming Lei 2018-01-23 12:15 ` Mike Snitzer 2018-01-23 16:43 ` Bart Van Assche 1 sibling, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-23 10:53 UTC (permalink / raw) To: Mike Snitzer Cc: Bart Van Assche, axboe, dm-devel, linux-kernel, hch, linux-block, osandov Hi Mike, On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 5:20pm -0500, > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > > And yet Laurence cannot reproduce any such lockups with your test... > > > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has > > already succeeded at running an unmodified version of my tests. In one of the > > e-mails Laurence sent me this morning I read that he modified these scripts > > to get past a kernel module unload failure that was reported while starting > > these tests. So the next step is to check which changes were made to the test > > scripts and also whether the test results are still valid. > > > > > Are you absolutely certain this patch doesn't help you? > > > https://patchwork.kernel.org/patch/10174037/ > > > > > > If it doesn't then that is actually very useful to know. > > > > The first I tried this morning is to run the srp-test software against a merge > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. > > Since even that was not sufficient I tried to kick the queues via debugfs (for > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was > > not sufficient to resolve the queue stall I reverted the following tree patches > > that are in Jens' tree: > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" > > > > Only after I had done this the srp-test software ran again without triggering > > dm queue lockups. > > Given that Ming's notifier-based patchset needs more development time I > think we're unfortunately past the point where we can comfortably wait > for that to be ready. > > So we need to explore alternatives to fixing this IO stall regression. The fix for IO stall doesn't need the notifier-based patchset, and only the 1st patch is enough for fixing the IO stall. And it is a generic issue, which need generic solution, that is the conclusion made by Jens and me. https://marc.info/?l=linux-kernel&m=151638176727612&w=2 And the notifier-based patchset is for solving the performance issue reported by Jens: - run IO on dm-mpath - run background IO on low depth underlying queue - then IO performance on dm-mpath is extremely slow I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' soon, but the notifier-based patchset shouldn't be very urgent, since the above test case isn't usual in reality. > Rather than attempt the above block reverts (which is an incomplete > listing given newer changes): might we develop a more targeted code > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO > merging via blk_insert_cloned_request feedback")? -- which, given Bart's > findings above, seems to be the most problematic block commit. The stall isn't related with commit 396eaf21ee too. > > To that end, assuming I drop this commit from dm-4.16: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92 > > Here is my proposal for putting this regression behind us for 4.16 > (Ming's line of development would continue and hopefully be included in > 4.17): Actually notifier based approach is ready, even cache for clone is ready too, but the test result isn't good enough on random IO on Jens's above case, and sequential IO is much better with both cache clone and notifier based allocation(much better than non-mq). And follows the tree if anyone is interested: https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath Now looks there is still one issue: the notifier can come early, just before the request is added to hctx->dispatch_list, and performance still gets hurt, especially on random IO in Jens's case. But queue won't stall, :-) > > From: Mike Snitzer <snitzer@redhat.com> > Date: Tue, 23 Jan 2018 09:40:22 +0100 > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression > > The series of blk-mq changes intended to improve sequential IO > performace (through improved merging with dm-mapth blk-mq stacked on > underlying blk-mq device). Unfortunately these changes have caused > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to > q->mq_ops->queue_rq() fails (due to device-specific resource > unavailability). > > Fix this by reverting back to how blk_insert_cloned_request() functioned > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert() > instead of blk_mq_request_issue_directly(). > > In the future, this commit should be reverted as the first change in a > followup series of changes that implements a comprehensive solution to > allowing an underlying blk-mq queue's resource limitation to trigger the > upper blk-mq queue to run once that underlying limited resource is > replenished. > > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/blk-core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index cdae69be68e9..a224f282b4a6 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * > * bypass a potential scheduler on the bottom device for > * insert. > */ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } If this patch is for fixing IO stall on DM, it isn't needed, and actually it can't fix the IO stall issue. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-23 10:53 ` Ming Lei @ 2018-01-23 12:15 ` Mike Snitzer 2018-01-23 12:17 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Mike Snitzer @ 2018-01-23 12:15 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, axboe, dm-devel, linux-kernel, hch, linux-block, osandov On Tue, Jan 23 2018 at 5:53am -0500, Ming Lei <ming.lei@redhat.com> wrote: > Hi Mike, > > On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 5:20pm -0500, > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > > > And yet Laurence cannot reproduce any such lockups with your test... > > > > > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has > > > already succeeded at running an unmodified version of my tests. In one of the > > > e-mails Laurence sent me this morning I read that he modified these scripts > > > to get past a kernel module unload failure that was reported while starting > > > these tests. So the next step is to check which changes were made to the test > > > scripts and also whether the test results are still valid. > > > > > > > Are you absolutely certain this patch doesn't help you? > > > > https://patchwork.kernel.org/patch/10174037/ > > > > > > > > If it doesn't then that is actually very useful to know. > > > > > > The first I tried this morning is to run the srp-test software against a merge > > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm > > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. > > > Since even that was not sufficient I tried to kick the queues via debugfs (for > > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was > > > not sufficient to resolve the queue stall I reverted the following tree patches > > > that are in Jens' tree: > > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" > > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" > > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" > > > > > > Only after I had done this the srp-test software ran again without triggering > > > dm queue lockups. > > > > Given that Ming's notifier-based patchset needs more development time I > > think we're unfortunately past the point where we can comfortably wait > > for that to be ready. > > > > So we need to explore alternatives to fixing this IO stall regression. > > The fix for IO stall doesn't need the notifier-based patchset, and only > the 1st patch is enough for fixing the IO stall. And it is a generic > issue, which need generic solution, that is the conclusion made by > Jens and me. > > https://marc.info/?l=linux-kernel&m=151638176727612&w=2 That's fine if Bart verifies it. > And the notifier-based patchset is for solving the performance issue > reported by Jens: > > - run IO on dm-mpath > - run background IO on low depth underlying queue > - then IO performance on dm-mpath is extremely slow > > I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' > soon, but the notifier-based patchset shouldn't be very urgent, since > the above test case isn't usual in reality. > > > Rather than attempt the above block reverts (which is an incomplete > > listing given newer changes): might we develop a more targeted code > > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO > > merging via blk_insert_cloned_request feedback")? -- which, given Bart's > > findings above, seems to be the most problematic block commit. > > The stall isn't related with commit 396eaf21ee too. > > > > > To that end, assuming I drop this commit from dm-4.16: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92 > > > > Here is my proposal for putting this regression behind us for 4.16 > > (Ming's line of development would continue and hopefully be included in > > 4.17): > > Actually notifier based approach is ready, even cache for clone is ready > too, but the test result isn't good enough on random IO on Jens's above > case, and sequential IO is much better with both cache clone and > notifier based allocation(much better than non-mq). And follows the tree > if anyone is interested: > > https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath > > Now looks there is still one issue: the notifier can come early, just > before the request is added to hctx->dispatch_list, and performance > still gets hurt, especially on random IO in Jens's case. But queue > won't stall, :-) > > > > > From: Mike Snitzer <snitzer@redhat.com> > > Date: Tue, 23 Jan 2018 09:40:22 +0100 > > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression > > > > The series of blk-mq changes intended to improve sequential IO > > performace (through improved merging with dm-mapth blk-mq stacked on > > underlying blk-mq device). Unfortunately these changes have caused > > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to > > q->mq_ops->queue_rq() fails (due to device-specific resource > > unavailability). > > > > Fix this by reverting back to how blk_insert_cloned_request() functioned > > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert() > > instead of blk_mq_request_issue_directly(). > > > > In the future, this commit should be reverted as the first change in a > > followup series of changes that implements a comprehensive solution to > > allowing an underlying blk-mq queue's resource limitation to trigger the > > upper blk-mq queue to run once that underlying limited resource is > > replenished. > > > > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > --- > > block/blk-core.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index cdae69be68e9..a224f282b4a6 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * > > * bypass a potential scheduler on the bottom device for > > * insert. > > */ > > - return blk_mq_request_issue_directly(rq); > > + blk_mq_request_bypass_insert(rq, true); > > + return BLK_STS_OK; > > } > > If this patch is for fixing IO stall on DM, it isn't needed, and actually > it can't fix the IO stall issue. It should "fix it" in conjunction with this: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509 (Bart already said as much, I just effectively enabled the equivalent of his reverts) ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-23 12:15 ` Mike Snitzer @ 2018-01-23 12:17 ` Ming Lei 2018-01-23 12:43 ` Mike Snitzer 0 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-23 12:17 UTC (permalink / raw) To: Mike Snitzer Cc: Ming Lei, Bart Van Assche, Jens Axboe, dm-devel, linux-kernel, hch, linux-block, osandov On Tue, Jan 23, 2018 at 8:15 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Jan 23 2018 at 5:53am -0500, > Ming Lei <ming.lei@redhat.com> wrote: > >> Hi Mike, >> >> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote: >> > On Thu, Jan 18 2018 at 5:20pm -0500, >> > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: >> > >> > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: >> > > > And yet Laurence cannot reproduce any such lockups with your test... >> > > >> > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has >> > > already succeeded at running an unmodified version of my tests. In one of the >> > > e-mails Laurence sent me this morning I read that he modified these scripts >> > > to get past a kernel module unload failure that was reported while starting >> > > these tests. So the next step is to check which changes were made to the test >> > > scripts and also whether the test results are still valid. >> > > >> > > > Are you absolutely certain this patch doesn't help you? >> > > > https://patchwork.kernel.org/patch/10174037/ >> > > > >> > > > If it doesn't then that is actually very useful to know. >> > > >> > > The first I tried this morning is to run the srp-test software against a merge >> > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm >> > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. >> > > Since even that was not sufficient I tried to kick the queues via debugfs (for >> > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was >> > > not sufficient to resolve the queue stall I reverted the following tree patches >> > > that are in Jens' tree: >> > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" >> > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" >> > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" >> > > >> > > Only after I had done this the srp-test software ran again without triggering >> > > dm queue lockups. >> > >> > Given that Ming's notifier-based patchset needs more development time I >> > think we're unfortunately past the point where we can comfortably wait >> > for that to be ready. >> > >> > So we need to explore alternatives to fixing this IO stall regression. >> >> The fix for IO stall doesn't need the notifier-based patchset, and only >> the 1st patch is enough for fixing the IO stall. And it is a generic >> issue, which need generic solution, that is the conclusion made by >> Jens and me. >> >> https://marc.info/?l=linux-kernel&m=151638176727612&w=2 > > That's fine if Bart verifies it. > >> And the notifier-based patchset is for solving the performance issue >> reported by Jens: >> >> - run IO on dm-mpath >> - run background IO on low depth underlying queue >> - then IO performance on dm-mpath is extremely slow >> >> I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' >> soon, but the notifier-based patchset shouldn't be very urgent, since >> the above test case isn't usual in reality. >> >> > Rather than attempt the above block reverts (which is an incomplete >> > listing given newer changes): might we develop a more targeted code >> > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO >> > merging via blk_insert_cloned_request feedback")? -- which, given Bart's >> > findings above, seems to be the most problematic block commit. >> >> The stall isn't related with commit 396eaf21ee too. >> >> > >> > To that end, assuming I drop this commit from dm-4.16: >> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92 >> > >> > Here is my proposal for putting this regression behind us for 4.16 >> > (Ming's line of development would continue and hopefully be included in >> > 4.17): >> >> Actually notifier based approach is ready, even cache for clone is ready >> too, but the test result isn't good enough on random IO on Jens's above >> case, and sequential IO is much better with both cache clone and >> notifier based allocation(much better than non-mq). And follows the tree >> if anyone is interested: >> >> https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath >> >> Now looks there is still one issue: the notifier can come early, just >> before the request is added to hctx->dispatch_list, and performance >> still gets hurt, especially on random IO in Jens's case. But queue >> won't stall, :-) >> >> > >> > From: Mike Snitzer <snitzer@redhat.com> >> > Date: Tue, 23 Jan 2018 09:40:22 +0100 >> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression >> > >> > The series of blk-mq changes intended to improve sequential IO >> > performace (through improved merging with dm-mapth blk-mq stacked on >> > underlying blk-mq device). Unfortunately these changes have caused >> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to >> > q->mq_ops->queue_rq() fails (due to device-specific resource >> > unavailability). >> > >> > Fix this by reverting back to how blk_insert_cloned_request() functioned >> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert() >> > instead of blk_mq_request_issue_directly(). >> > >> > In the future, this commit should be reverted as the first change in a >> > followup series of changes that implements a comprehensive solution to >> > allowing an underlying blk-mq queue's resource limitation to trigger the >> > upper blk-mq queue to run once that underlying limited resource is >> > replenished. >> > >> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") >> > Signed-off-by: Mike Snitzer <snitzer@redhat.com> >> > --- >> > block/blk-core.c | 3 ++- >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> > >> > diff --git a/block/blk-core.c b/block/blk-core.c >> > index cdae69be68e9..a224f282b4a6 100644 >> > --- a/block/blk-core.c >> > +++ b/block/blk-core.c >> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * >> > * bypass a potential scheduler on the bottom device for >> > * insert. >> > */ >> > - return blk_mq_request_issue_directly(rq); >> > + blk_mq_request_bypass_insert(rq, true); >> > + return BLK_STS_OK; >> > } >> >> If this patch is for fixing IO stall on DM, it isn't needed, and actually >> it can't fix the IO stall issue. > > It should "fix it" in conjunction with this: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509 > > (Bart already said as much, I just effectively enabled the equivalent of > his reverts) If the generic solution is take, Bart's revert isn't needed. -- Ming Lei ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-23 12:17 ` Ming Lei @ 2018-01-23 12:43 ` Mike Snitzer 0 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-23 12:43 UTC (permalink / raw) To: Ming Lei, Bart Van Assche Cc: Ming Lei, Jens Axboe, dm-devel, linux-kernel, hch, linux-block, osandov On Tue, Jan 23 2018 at 7:17am -0500, Ming Lei <tom.leiming@gmail.com> wrote: > On Tue, Jan 23, 2018 at 8:15 PM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Tue, Jan 23 2018 at 5:53am -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > >> Hi Mike, > >> > >> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote: > >> > > >> > From: Mike Snitzer <snitzer@redhat.com> > >> > Date: Tue, 23 Jan 2018 09:40:22 +0100 > >> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression > >> > > >> > The series of blk-mq changes intended to improve sequential IO > >> > performace (through improved merging with dm-mapth blk-mq stacked on > >> > underlying blk-mq device). Unfortunately these changes have caused > >> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to > >> > q->mq_ops->queue_rq() fails (due to device-specific resource > >> > unavailability). > >> > > >> > Fix this by reverting back to how blk_insert_cloned_request() functioned > >> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert() > >> > instead of blk_mq_request_issue_directly(). > >> > > >> > In the future, this commit should be reverted as the first change in a > >> > followup series of changes that implements a comprehensive solution to > >> > allowing an underlying blk-mq queue's resource limitation to trigger the > >> > upper blk-mq queue to run once that underlying limited resource is > >> > replenished. > >> > > >> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") > >> > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >> > --- > >> > block/blk-core.c | 3 ++- > >> > 1 file changed, 2 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/block/blk-core.c b/block/blk-core.c > >> > index cdae69be68e9..a224f282b4a6 100644 > >> > --- a/block/blk-core.c > >> > +++ b/block/blk-core.c > >> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * > >> > * bypass a potential scheduler on the bottom device for > >> > * insert. > >> > */ > >> > - return blk_mq_request_issue_directly(rq); > >> > + blk_mq_request_bypass_insert(rq, true); > >> > + return BLK_STS_OK; > >> > } > >> > >> If this patch is for fixing IO stall on DM, it isn't needed, and actually > >> it can't fix the IO stall issue. > > > > It should "fix it" in conjunction with this: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509 > > > > (Bart already said as much, I just effectively enabled the equivalent of > > his reverts) > > If the generic solution is take, Bart's revert isn't needed. Yes, we need Bart to verify your v2 patch: https://patchwork.kernel.org/patch/10179945/ Bart please test this tree: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next Please report back with your results (and include details on any changes you make to the tree, hopefully no changes are needed). Thanks, Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) 2018-01-23 9:22 ` [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) Mike Snitzer @ 2018-01-23 16:43 ` Bart Van Assche 2018-01-23 16:43 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-23 16:43 UTC (permalink / raw) To: ming.lei, snitzer, axboe Cc: dm-devel, linux-kernel, hch, linux-block, osandov T24gVHVlLCAyMDE4LTAxLTIzIGF0IDEwOjIyICswMTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ IE9uIFRodSwgSmFuIDE4IDIwMTggYXQgIDU6MjBwbSAtMDUwMCwNCj4gQmFydCBWYW4gQXNzY2hl IDxCYXJ0LlZhbkFzc2NoZUB3ZGMuY29tPiB3cm90ZToNCj4gDQo+ID4gT24gVGh1LCAyMDE4LTAx LTE4IGF0IDE3OjAxIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+ID4gPiBBbmQgeWV0IExh dXJlbmNlIGNhbm5vdCByZXByb2R1Y2UgYW55IHN1Y2ggbG9ja3VwcyB3aXRoIHlvdXIgdGVzdC4u Lg0KPiA+IA0KPiA+IEhtbSAuLi4gbWF5YmUgSSBtaXN1bmRlcnN0b29kIExhdXJlbmNlIGJ1dCBJ IGRvbid0IHRoaW5rIHRoYXQgTGF1cmVuY2UgaGFzDQo+ID4gYWxyZWFkeSBzdWNjZWVkZWQgYXQg cnVubmluZyBhbiB1bm1vZGlmaWVkIHZlcnNpb24gb2YgbXkgdGVzdHMuIEluIG9uZSBvZiB0aGUN Cj4gPiBlLW1haWxzIExhdXJlbmNlIHNlbnQgbWUgdGhpcyBtb3JuaW5nIEkgcmVhZCB0aGF0IGhl IG1vZGlmaWVkIHRoZXNlIHNjcmlwdHMNCj4gPiB0byBnZXQgcGFzdCBhIGtlcm5lbCBtb2R1bGUg dW5sb2FkIGZhaWx1cmUgdGhhdCB3YXMgcmVwb3J0ZWQgd2hpbGUgc3RhcnRpbmcNCj4gPiB0aGVz ZSB0ZXN0cy4gU28gdGhlIG5leHQgc3RlcCBpcyB0byBjaGVjayB3aGljaCBjaGFuZ2VzIHdlcmUg bWFkZSB0byB0aGUgdGVzdA0KPiA+IHNjcmlwdHMgYW5kIGFsc28gd2hldGhlciB0aGUgdGVzdCBy ZXN1bHRzIGFyZSBzdGlsbCB2YWxpZC4NCj4gPiANCj4gPiA+IEFyZSB5b3UgYWJzb2x1dGVseSBj ZXJ0YWluIHRoaXMgcGF0Y2ggZG9lc24ndCBoZWxwIHlvdT8NCj4gPiA+IGh0dHBzOi8vcGF0Y2h3 b3JrLmtlcm5lbC5vcmcvcGF0Y2gvMTAxNzQwMzcvDQo+ID4gPiANCj4gPiA+IElmIGl0IGRvZXNu J3QgdGhlbiB0aGF0IGlzIGFjdHVhbGx5IHZlcnkgdXNlZnVsIHRvIGtub3cuDQo+ID4gDQo+ID4g VGhlIGZpcnN0IEkgdHJpZWQgdGhpcyBtb3JuaW5nIGlzIHRvIHJ1biB0aGUgc3JwLXRlc3Qgc29m dHdhcmUgYWdhaW5zdCBhIG1lcmdlDQo+ID4gb2YgSmVucycgZm9yLW5leHQgYnJhbmNoIGFuZCB5 b3VyIGRtLTQuMTYgYnJhbmNoLiBTaW5jZSBJIG5vdGljZWQgdGhhdCB0aGUgZG0NCj4gPiBxdWV1 ZSBsb2NrZWQgdXAgSSByZWluc2VydGVkIGEgYmxrX21xX2RlbGF5X3J1bl9od19xdWV1ZSgpIGNh bGwgaW4gdGhlIGRtIGNvZGUuDQo+ID4gU2luY2UgZXZlbiB0aGF0IHdhcyBub3Qgc3VmZmljaWVu dCBJIHRyaWVkIHRvIGtpY2sgdGhlIHF1ZXVlcyB2aWEgZGVidWdmcyAoZm9yDQo+ID4gcyBpbiAv c3lzL2tlcm5lbC9kZWJ1Zy9ibG9jay8qL3N0YXRlOyBkbyBlY2hvIGtpY2sgPiRzOyBkb25lKS4g U2luY2UgdGhhdCB3YXMNCj4gPiBub3Qgc3VmZmljaWVudCB0byByZXNvbHZlIHRoZSBxdWV1ZSBz dGFsbCBJIHJldmVydGVkIHRoZSBmb2xsb3dpbmcgdHJlZSBwYXRjaGVzDQo+ID4gdGhhdCBhcmUg aW4gSmVucycgdHJlZToNCj4gPiAqICJibGstbXE6IGltcHJvdmUgRE0ncyBibGstbXEgSU8gbWVy Z2luZyB2aWEgYmxrX2luc2VydF9jbG9uZWRfcmVxdWVzdCBmZWVkYmFjayINCj4gPiAqICJibGst bXEtc2NoZWQ6IHJlbW92ZSB1bnVzZWQgJ2Nhbl9ibG9jaycgYXJnIGZyb20gYmxrX21xX3NjaGVk X2luc2VydF9yZXF1ZXN0Ig0KPiA+ICogImJsay1tcTogZG9uJ3QgZGlzcGF0Y2ggcmVxdWVzdCBp biBibGtfbXFfcmVxdWVzdF9kaXJlY3RfaXNzdWUgaWYgcXVldWUgaXMgYnVzeSINCj4gPiANCj4g PiBPbmx5IGFmdGVyIEkgaGFkIGRvbmUgdGhpcyB0aGUgc3JwLXRlc3Qgc29mdHdhcmUgcmFuIGFn YWluIHdpdGhvdXQgdHJpZ2dlcmluZw0KPiA+IGRtIHF1ZXVlIGxvY2t1cHMuDQoNCkhlbGxvIE1p a2UsDQoNCkkgd2FudCB0byB0YWtlIGJhY2sgd2hhdCBJIHdyb3RlIGFib3V0IHRoZSByZXZlcnRz LiBJIGhhdmUgbm90IHlldCB0cmllZCB0bw0KYW5hbHl6ZSBleGFjdGx5IHdoaWNoIGNoYW5nZSBt YWRlIGJsa19pbnNlcnRfY2xvbmVkX3JlcXVlc3QoKSB3b3JrIHJlbGlhYmx5DQpmb3IgbWUgYnV0 IHdpdGggSmVucycgbGF0ZXN0IGZvci1uZXh0IGJyYW5jaCBhbmQgeW91ciBmb3ItNC4xNiBicmFu Y2ggbWVyZ2VkDQphbGwgSSBuZWVkIHRvIGF2b2lkIHF1ZXVlIHN0YWxscyBpcyB0aGUgZm9sbG93 aW5nOg0KDQpkaWZmIC0tZ2l0IGEvZHJpdmVycy9tZC9kbS1ycS5jIGIvZHJpdmVycy9tZC9kbS1y cS5jDQppbmRleCBmMTYwOTZhZjg3OWEuLmM1OWM1OWNmZDJhNSAxMDA2NDQNCi0tLSBhL2RyaXZl cnMvbWQvZG0tcnEuYw0KKysrIGIvZHJpdmVycy9tZC9kbS1ycS5jDQpAQCAtNzYxLDYgKzc2MSw3 IEBAIHN0YXRpYyBibGtfc3RhdHVzX3QgZG1fbXFfcXVldWVfcnEoc3RydWN0IGJsa19tcV9od19j dHggKmhjdHgsDQogCQkvKiBVbmRvIGRtX3N0YXJ0X3JlcXVlc3QoKSBiZWZvcmUgcmVxdWV1aW5n ICovDQogCQlycV9lbmRfc3RhdHMobWQsIHJxKTsNCiAJCXJxX2NvbXBsZXRlZChtZCwgcnFfZGF0 YV9kaXIocnEpLCBmYWxzZSk7DQorCQlibGtfbXFfZGVsYXlfcnVuX2h3X3F1ZXVlKGhjdHgsIDEw MC8qbXMqLyk7DQogCQlyZXR1cm4gQkxLX1NUU19SRVNPVVJDRTsNCiAJfQ0KIA0KQmFydC4= ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) @ 2018-01-23 16:43 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-23 16:43 UTC (permalink / raw) To: ming.lei, snitzer, axboe Cc: dm-devel, linux-kernel, hch, linux-block, osandov On Tue, 2018-01-23 at 10:22 +0100, Mike Snitzer wrote: > On Thu, Jan 18 2018 at 5:20pm -0500, > Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote: > > > And yet Laurence cannot reproduce any such lockups with your test... > > > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has > > already succeeded at running an unmodified version of my tests. In one of the > > e-mails Laurence sent me this morning I read that he modified these scripts > > to get past a kernel module unload failure that was reported while starting > > these tests. So the next step is to check which changes were made to the test > > scripts and also whether the test results are still valid. > > > > > Are you absolutely certain this patch doesn't help you? > > > https://patchwork.kernel.org/patch/10174037/ > > > > > > If it doesn't then that is actually very useful to know. > > > > The first I tried this morning is to run the srp-test software against a merge > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code. > > Since even that was not sufficient I tried to kick the queues via debugfs (for > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was > > not sufficient to resolve the queue stall I reverted the following tree patches > > that are in Jens' tree: > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback" > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request" > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy" > > > > Only after I had done this the srp-test software ran again without triggering > > dm queue lockups. Hello Mike, I want to take back what I wrote about the reverts. I have not yet tried to analyze exactly which change made blk_insert_cloned_request() work reliably for me but with Jens' latest for-next branch and your for-4.16 branch merged all I need to avoid queue stalls is the following: diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f16096af879a..c59c59cfd2a5 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; } Bart. ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-18 20:11 ` Jens Axboe (?) (?) @ 2018-01-19 2:32 ` Ming Lei 2018-01-19 4:02 ` Jens Axboe 2018-01-19 5:09 ` Bart Van Assche -1 siblings, 2 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 2:32 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > On 1/18/18 11:47 AM, Bart Van Assche wrote: > >> This is all very tiresome. > > > > Yes, this is tiresome. It is very annoying to me that others keep > > introducing so many regressions in such important parts of the kernel. > > It is also annoying to me that I get blamed if I report a regression > > instead of seeing that the regression gets fixed. > > I agree, it sucks that any change there introduces the regression. I'm > fine with doing the delay insert again until a new patch is proven to be > better. That way is still buggy as I explained, since rerun queue before adding request to hctx->dispatch_list isn't correct. Who can make sure the request is visible when __blk_mq_run_hw_queue() is called? Not mention this way will cause performance regression again. > > From the original topic of this email, we have conditions that can cause > the driver to not be able to submit an IO. A set of those conditions can > only happen if IO is in flight, and those cases we have covered just > fine. Another set can potentially trigger without IO being in flight. > These are cases where a non-device resource is unavailable at the time > of submission. This might be iommu running out of space, for instance, > or it might be a memory allocation of some sort. For these cases, we > don't get any notification when the shortage clears. All we can do is > ensure that we restart operations at some point in the future. We're SOL > at that point, but we have to ensure that we make forward progress. Right, it is a generic issue, not DM-specific one, almost all drivers call kmalloc(GFP_ATOMIC) in IO path. IMO, there is enough time for figuring out a generic solution before 4.16 release. > > That last set of conditions better not be a a common occurence, since > performance is down the toilet at that point. I don't want to introduce > hot path code to rectify it. Have the driver return if that happens in a > way that is DIFFERENT from needing a normal restart. The driver knows if > this is a resource that will become available when IO completes on this > device or not. If we get that return, we have a generic run-again delay. Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and it should be DM-only which returns STS_RESOURCE so often. > > This basically becomes the same as doing the delay queue thing from DM, > but just in a generic fashion. Yeah, it is right. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 2:32 ` [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei @ 2018-01-19 4:02 ` Jens Axboe 2018-01-19 7:26 ` Ming Lei 2018-01-19 5:09 ` Bart Van Assche 1 sibling, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 4:02 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/18/18 7:32 PM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: >> On 1/18/18 11:47 AM, Bart Van Assche wrote: >>>> This is all very tiresome. >>> >>> Yes, this is tiresome. It is very annoying to me that others keep >>> introducing so many regressions in such important parts of the kernel. >>> It is also annoying to me that I get blamed if I report a regression >>> instead of seeing that the regression gets fixed. >> >> I agree, it sucks that any change there introduces the regression. I'm >> fine with doing the delay insert again until a new patch is proven to be >> better. > > That way is still buggy as I explained, since rerun queue before adding > request to hctx->dispatch_list isn't correct. Who can make sure the request > is visible when __blk_mq_run_hw_queue() is called? That race basically doesn't exist for a 10ms gap. > Not mention this way will cause performance regression again. How so? It's _exactly_ the same as what you are proposing, except mine will potentially run the queue when it need not do so. But given that these are random 10ms queue kicks because we are screwed, it should not matter. The key point is that it only should be if we have NO better options. If it's a frequently occurring event that we have to return BLK_STS_RESOURCE, then we need to get a way to register an event for when that condition clears. That event will then kick the necessary queue(s). >> From the original topic of this email, we have conditions that can cause >> the driver to not be able to submit an IO. A set of those conditions can >> only happen if IO is in flight, and those cases we have covered just >> fine. Another set can potentially trigger without IO being in flight. >> These are cases where a non-device resource is unavailable at the time >> of submission. This might be iommu running out of space, for instance, >> or it might be a memory allocation of some sort. For these cases, we >> don't get any notification when the shortage clears. All we can do is >> ensure that we restart operations at some point in the future. We're SOL >> at that point, but we have to ensure that we make forward progress. > > Right, it is a generic issue, not DM-specific one, almost all drivers > call kmalloc(GFP_ATOMIC) in IO path. GFP_ATOMIC basically never fails, unless we are out of memory. The exception is higher order allocations. If a driver has a higher order atomic allocation in its IO path, the device driver writer needs to be taken out behind the barn and shot. Simple as that. It will NEVER work well in a production environment. Witness the disaster that so many NIC driver writers have learned. This is NOT the case we care about here. It's resources that are more readily depleted because other devices are using them. If it's a high frequency or generally occurring event, then we simply must have a callback to restart the queue from that. The condition then becomes identical to device private starvation, the only difference being from where we restart the queue. > IMO, there is enough time for figuring out a generic solution before > 4.16 release. I would hope so, but the proposed solutions have not filled me with a lot of confidence in the end result so far. >> That last set of conditions better not be a a common occurence, since >> performance is down the toilet at that point. I don't want to introduce >> hot path code to rectify it. Have the driver return if that happens in a >> way that is DIFFERENT from needing a normal restart. The driver knows if >> this is a resource that will become available when IO completes on this >> device or not. If we get that return, we have a generic run-again delay. > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > it should be DM-only which returns STS_RESOURCE so often. Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 4:02 ` Jens Axboe @ 2018-01-19 7:26 ` Ming Lei 2018-01-19 15:20 ` Bart Van Assche 2018-01-19 15:24 ` Jens Axboe 0 siblings, 2 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 7:26 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > On 1/18/18 7:32 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >> On 1/18/18 11:47 AM, Bart Van Assche wrote: > >>>> This is all very tiresome. > >>> > >>> Yes, this is tiresome. It is very annoying to me that others keep > >>> introducing so many regressions in such important parts of the kernel. > >>> It is also annoying to me that I get blamed if I report a regression > >>> instead of seeing that the regression gets fixed. > >> > >> I agree, it sucks that any change there introduces the regression. I'm > >> fine with doing the delay insert again until a new patch is proven to be > >> better. > > > > That way is still buggy as I explained, since rerun queue before adding > > request to hctx->dispatch_list isn't correct. Who can make sure the request > > is visible when __blk_mq_run_hw_queue() is called? > > That race basically doesn't exist for a 10ms gap. > > > Not mention this way will cause performance regression again. > > How so? It's _exactly_ the same as what you are proposing, except mine > will potentially run the queue when it need not do so. But given that > these are random 10ms queue kicks because we are screwed, it should not > matter. The key point is that it only should be if we have NO better > options. If it's a frequently occurring event that we have to return > BLK_STS_RESOURCE, then we need to get a way to register an event for > when that condition clears. That event will then kick the necessary > queue(s). Please see queue_delayed_work_on(), hctx->run_work is shared by all scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new scheduling can make progress during the 100ms. > > >> From the original topic of this email, we have conditions that can cause > >> the driver to not be able to submit an IO. A set of those conditions can > >> only happen if IO is in flight, and those cases we have covered just > >> fine. Another set can potentially trigger without IO being in flight. > >> These are cases where a non-device resource is unavailable at the time > >> of submission. This might be iommu running out of space, for instance, > >> or it might be a memory allocation of some sort. For these cases, we > >> don't get any notification when the shortage clears. All we can do is > >> ensure that we restart operations at some point in the future. We're SOL > >> at that point, but we have to ensure that we make forward progress. > > > > Right, it is a generic issue, not DM-specific one, almost all drivers > > call kmalloc(GFP_ATOMIC) in IO path. > > GFP_ATOMIC basically never fails, unless we are out of memory. The I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be possible, and it is mentioned[1] there is such code in mm allocation path, also OOM can happen too. if (some randomly generated condition) && (request is atomic) return NULL; [1] https://lwn.net/Articles/276731/ > exception is higher order allocations. If a driver has a higher order > atomic allocation in its IO path, the device driver writer needs to be > taken out behind the barn and shot. Simple as that. It will NEVER work > well in a production environment. Witness the disaster that so many NIC > driver writers have learned. > > This is NOT the case we care about here. It's resources that are more > readily depleted because other devices are using them. If it's a high > frequency or generally occurring event, then we simply must have a > callback to restart the queue from that. The condition then becomes > identical to device private starvation, the only difference being from > where we restart the queue. > > > IMO, there is enough time for figuring out a generic solution before > > 4.16 release. > > I would hope so, but the proposed solutions have not filled me with > a lot of confidence in the end result so far. > > >> That last set of conditions better not be a a common occurence, since > >> performance is down the toilet at that point. I don't want to introduce > >> hot path code to rectify it. Have the driver return if that happens in a > >> way that is DIFFERENT from needing a normal restart. The driver knows if > >> this is a resource that will become available when IO completes on this > >> device or not. If we get that return, we have a generic run-again delay. > > > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? It is from blk_get_request(underlying queue), see multipath_clone_and_map(). Thanks, Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 7:26 ` Ming Lei @ 2018-01-19 15:20 ` Bart Van Assche 2018-01-19 15:24 ` Jens Axboe 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 15:20 UTC (permalink / raw) To: ming.lei, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, snitzer T24gRnJpLCAyMDE4LTAxLTE5IGF0IDE1OjI2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gUGxl YXNlIHNlZSBxdWV1ZV9kZWxheWVkX3dvcmtfb24oKSwgaGN0eC0+cnVuX3dvcmsgaXMgc2hhcmVk IGJ5IGFsbA0KPiBzY2hlZHVsaW5nLCBvbmNlIGJsa19tcV9kZWxheV9ydW5faHdfcXVldWUoMTAw bXMpIHJldHVybnMsIG5vIG5ldw0KPiBzY2hlZHVsaW5nIGNhbiBtYWtlIHByb2dyZXNzIGR1cmlu ZyB0aGUgMTAwbXMuDQoNCkhvdyBhYm91dCBhZGRyZXNzaW5nIHRoYXQgYXMgZm9sbG93czoNCg0K ZGlmZiAtLWdpdCBhL2Jsb2NrL2Jsay1tcS5jIGIvYmxvY2svYmxrLW1xLmMNCmluZGV4IGY3NTE1 ZGQ5NWEzNi4uNTdmODM3OWE0NzZkIDEwMDY0NA0KLS0tIGEvYmxvY2svYmxrLW1xLmMNCisrKyBi L2Jsb2NrL2Jsay1tcS5jDQpAQCAtMTQwMyw5ICsxNDAzLDkgQEAgc3RhdGljIHZvaWQgX19ibGtf bXFfZGVsYXlfcnVuX2h3X3F1ZXVlKHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4LCBib29sIGFz eW5jLA0KIAkJcHV0X2NwdSgpOw0KIAl9DQogDQotCWtibG9ja2Rfc2NoZWR1bGVfZGVsYXllZF93 b3JrX29uKGJsa19tcV9oY3R4X25leHRfY3B1KGhjdHgpLA0KLQkJCQkJICZoY3R4LT5ydW5fd29y aywNCi0JCQkJCSBtc2Vjc190b19qaWZmaWVzKG1zZWNzKSk7DQorCWtibG9ja2RfbW9kX2RlbGF5 ZWRfd29ya19vbihibGtfbXFfaGN0eF9uZXh0X2NwdShoY3R4KSwNCisJCQkJICAgICZoY3R4LT5y dW5fd29yaywNCisJCQkJICAgIG1zZWNzX3RvX2ppZmZpZXMobXNlY3MpKTsNCiB9DQogDQogdm9p ZCBibGtfbXFfZGVsYXlfcnVuX2h3X3F1ZXVlKHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4LCB1 bnNpZ25lZCBsb25nIG1zZWNzKQ0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 15:20 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 15:20 UTC (permalink / raw) To: ming.lei, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, snitzer On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: > Please see queue_delayed_work_on(), hctx->run_work is shared by all > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > scheduling can make progress during the 100ms. How about addressing that as follows: diff --git a/block/blk-mq.c b/block/blk-mq.c index f7515dd95a36..57f8379a476d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, put_cpu(); } - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->run_work, - msecs_to_jiffies(msecs)); + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + &hctx->run_work, + msecs_to_jiffies(msecs)); } void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) Bart. ^ permalink raw reply related [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:20 ` Bart Van Assche (?) @ 2018-01-19 15:25 ` Jens Axboe -1 siblings, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 15:25 UTC (permalink / raw) To: Bart Van Assche, ming.lei Cc: dm-devel, hch, linux-kernel, linux-block, osandov, snitzer On 1/19/18 8:20 AM, Bart Van Assche wrote: > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: >> Please see queue_delayed_work_on(), hctx->run_work is shared by all >> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new >> scheduling can make progress during the 100ms. > > How about addressing that as follows: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f7515dd95a36..57f8379a476d 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, > put_cpu(); > } > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > - &hctx->run_work, > - msecs_to_jiffies(msecs)); > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > + &hctx->run_work, > + msecs_to_jiffies(msecs)); > } Exactly. That's why I said it was just a bug in my previous email, not honoring a newer run is just stupid. Only other thing you have to be careful with here is the STOPPED bit. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:20 ` Bart Van Assche (?) (?) @ 2018-01-19 15:33 ` Ming Lei 2018-01-19 16:06 ` Bart Van Assche -1 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 15:33 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, dm-devel, hch, linux-kernel, linux-block, osandov, snitzer On Fri, Jan 19, 2018 at 03:20:13PM +0000, Bart Van Assche wrote: > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > scheduling can make progress during the 100ms. > > How about addressing that as follows: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f7515dd95a36..57f8379a476d 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, > put_cpu(); > } > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > - &hctx->run_work, > - msecs_to_jiffies(msecs)); > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > + &hctx->run_work, > + msecs_to_jiffies(msecs)); > } > > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) > > Bart. Yes, this one together with Jen's suggestion with returning BLK_STS_NO_DEV_RESOURCE should fix this issue. Could you cook a fix for this issue? Otherwise I am happy to do that. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:33 ` Ming Lei @ 2018-01-19 16:06 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 16:06 UTC (permalink / raw) To: ming.lei Cc: dm-devel, hch, linux-kernel, linux-block, osandov, axboe, snitzer T24gRnJpLCAyMDE4LTAxLTE5IGF0IDIzOjMzICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g RnJpLCBKYW4gMTksIDIwMTggYXQgMDM6MjA6MTNQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy b3RlOg0KPiA+IE9uIEZyaSwgMjAxOC0wMS0xOSBhdCAxNToyNiArMDgwMCwgTWluZyBMZWkgd3Jv dGU6DQo+ID4gPiBQbGVhc2Ugc2VlIHF1ZXVlX2RlbGF5ZWRfd29ya19vbigpLCBoY3R4LT5ydW5f d29yayBpcyBzaGFyZWQgYnkgYWxsDQo+ID4gPiBzY2hlZHVsaW5nLCBvbmNlIGJsa19tcV9kZWxh eV9ydW5faHdfcXVldWUoMTAwbXMpIHJldHVybnMsIG5vIG5ldw0KPiA+ID4gc2NoZWR1bGluZyBj YW4gbWFrZSBwcm9ncmVzcyBkdXJpbmcgdGhlIDEwMG1zLg0KPiA+IA0KPiA+IEhvdyBhYm91dCBh ZGRyZXNzaW5nIHRoYXQgYXMgZm9sbG93czoNCj4gPiANCj4gPiBkaWZmIC0tZ2l0IGEvYmxvY2sv YmxrLW1xLmMgYi9ibG9jay9ibGstbXEuYw0KPiA+IGluZGV4IGY3NTE1ZGQ5NWEzNi4uNTdmODM3 OWE0NzZkIDEwMDY0NA0KPiA+IC0tLSBhL2Jsb2NrL2Jsay1tcS5jDQo+ID4gKysrIGIvYmxvY2sv YmxrLW1xLmMNCj4gPiBAQCAtMTQwMyw5ICsxNDAzLDkgQEAgc3RhdGljIHZvaWQgX19ibGtfbXFf ZGVsYXlfcnVuX2h3X3F1ZXVlKHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4LCBib29sIGFzeW5j LA0KPiA+ICAJCXB1dF9jcHUoKTsNCj4gPiAgCX0NCj4gPiAgDQo+ID4gLQlrYmxvY2tkX3NjaGVk dWxlX2RlbGF5ZWRfd29ya19vbihibGtfbXFfaGN0eF9uZXh0X2NwdShoY3R4KSwNCj4gPiAtCQkJ CQkgJmhjdHgtPnJ1bl93b3JrLA0KPiA+IC0JCQkJCSBtc2Vjc190b19qaWZmaWVzKG1zZWNzKSk7 DQo+ID4gKwlrYmxvY2tkX21vZF9kZWxheWVkX3dvcmtfb24oYmxrX21xX2hjdHhfbmV4dF9jcHUo aGN0eCksDQo+ID4gKwkJCQkgICAgJmhjdHgtPnJ1bl93b3JrLA0KPiA+ICsJCQkJICAgIG1zZWNz X3RvX2ppZmZpZXMobXNlY3MpKTsNCj4gPiAgfQ0KPiA+ICANCj4gPiAgdm9pZCBibGtfbXFfZGVs YXlfcnVuX2h3X3F1ZXVlKHN0cnVjdCBibGtfbXFfaHdfY3R4ICpoY3R4LCB1bnNpZ25lZCBsb25n IG1zZWNzKQ0KPiA+IA0KPiA+IEJhcnQuDQo+IA0KPiBZZXMsIHRoaXMgb25lIHRvZ2V0aGVyIHdp dGggSmVuJ3Mgc3VnZ2VzdGlvbiB3aXRoIHJldHVybmluZw0KPiBCTEtfU1RTX05PX0RFVl9SRVNP VVJDRSBzaG91bGQgZml4IHRoaXMgaXNzdWUuDQo+IA0KPiBDb3VsZCB5b3UgY29vayBhIGZpeCBm b3IgdGhpcyBpc3N1ZT8gT3RoZXJ3aXNlIEkgYW0gaGFwcHkgdG8gZG8NCj4gdGhhdC4NCg0KSGVs bG8gTWluZywNCg0KSSB3aWxsIGxvb2sgZnVydGhlciBpbnRvIHRoaXMuDQoNCkJhcnQu ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 16:06 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 16:06 UTC (permalink / raw) To: ming.lei Cc: dm-devel, hch, linux-kernel, linux-block, osandov, axboe, snitzer On Fri, 2018-01-19 at 23:33 +0800, Ming Lei wrote: > On Fri, Jan 19, 2018 at 03:20:13PM +0000, Bart Van Assche wrote: > > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: > > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > > scheduling can make progress during the 100ms. > > > > How about addressing that as follows: > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index f7515dd95a36..57f8379a476d 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, > > put_cpu(); > > } > > > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > > - &hctx->run_work, > > - msecs_to_jiffies(msecs)); > > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > > + &hctx->run_work, > > + msecs_to_jiffies(msecs)); > > } > > > > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) > > > > Bart. > > Yes, this one together with Jen's suggestion with returning > BLK_STS_NO_DEV_RESOURCE should fix this issue. > > Could you cook a fix for this issue? Otherwise I am happy to do > that. Hello Ming, I will look further into this. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 7:26 ` Ming Lei 2018-01-19 15:20 ` Bart Van Assche @ 2018-01-19 15:24 ` Jens Axboe 2018-01-19 15:40 ` Ming Lei 2018-01-29 22:37 ` Bart Van Assche 1 sibling, 2 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 15:24 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 12:26 AM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: >> On 1/18/18 7:32 PM, Ming Lei wrote: >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote: >>>>>> This is all very tiresome. >>>>> >>>>> Yes, this is tiresome. It is very annoying to me that others keep >>>>> introducing so many regressions in such important parts of the kernel. >>>>> It is also annoying to me that I get blamed if I report a regression >>>>> instead of seeing that the regression gets fixed. >>>> >>>> I agree, it sucks that any change there introduces the regression. I'm >>>> fine with doing the delay insert again until a new patch is proven to be >>>> better. >>> >>> That way is still buggy as I explained, since rerun queue before adding >>> request to hctx->dispatch_list isn't correct. Who can make sure the request >>> is visible when __blk_mq_run_hw_queue() is called? >> >> That race basically doesn't exist for a 10ms gap. >> >>> Not mention this way will cause performance regression again. >> >> How so? It's _exactly_ the same as what you are proposing, except mine >> will potentially run the queue when it need not do so. But given that >> these are random 10ms queue kicks because we are screwed, it should not >> matter. The key point is that it only should be if we have NO better >> options. If it's a frequently occurring event that we have to return >> BLK_STS_RESOURCE, then we need to get a way to register an event for >> when that condition clears. That event will then kick the necessary >> queue(s). > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > scheduling can make progress during the 100ms. That's a bug, plain and simple. If someone does "run this queue in 100ms" and someone else comes in and says "run this queue now", the correct outcome is running this queue now. >>>> From the original topic of this email, we have conditions that can cause >>>> the driver to not be able to submit an IO. A set of those conditions can >>>> only happen if IO is in flight, and those cases we have covered just >>>> fine. Another set can potentially trigger without IO being in flight. >>>> These are cases where a non-device resource is unavailable at the time >>>> of submission. This might be iommu running out of space, for instance, >>>> or it might be a memory allocation of some sort. For these cases, we >>>> don't get any notification when the shortage clears. All we can do is >>>> ensure that we restart operations at some point in the future. We're SOL >>>> at that point, but we have to ensure that we make forward progress. >>> >>> Right, it is a generic issue, not DM-specific one, almost all drivers >>> call kmalloc(GFP_ATOMIC) in IO path. >> >> GFP_ATOMIC basically never fails, unless we are out of memory. The > > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be > possible, and it is mentioned[1] there is such code in mm allocation > path, also OOM can happen too. > > if (some randomly generated condition) && (request is atomic) > return NULL; > > [1] https://lwn.net/Articles/276731/ That article is 10 years old. Once you run large scale production, you see what the real failures are. Fact is, for zero order allocation, if the atomic alloc fails the shit has really hit the fan. In that case, a delay of 10ms is not your main issue. It's a total red herring when you compare to the frequency of what Bart is seeing. It's noise, and irrelevant here. For an atomic zero order allocation failure, doing a short random sleep is perfectly fine. >> exception is higher order allocations. If a driver has a higher order >> atomic allocation in its IO path, the device driver writer needs to be >> taken out behind the barn and shot. Simple as that. It will NEVER work >> well in a production environment. Witness the disaster that so many NIC >> driver writers have learned. >> >> This is NOT the case we care about here. It's resources that are more >> readily depleted because other devices are using them. If it's a high >> frequency or generally occurring event, then we simply must have a >> callback to restart the queue from that. The condition then becomes >> identical to device private starvation, the only difference being from >> where we restart the queue. >> >>> IMO, there is enough time for figuring out a generic solution before >>> 4.16 release. >> >> I would hope so, but the proposed solutions have not filled me with >> a lot of confidence in the end result so far. >> >>>> That last set of conditions better not be a a common occurence, since >>>> performance is down the toilet at that point. I don't want to introduce >>>> hot path code to rectify it. Have the driver return if that happens in a >>>> way that is DIFFERENT from needing a normal restart. The driver knows if >>>> this is a resource that will become available when IO completes on this >>>> device or not. If we get that return, we have a generic run-again delay. >>> >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and >>> it should be DM-only which returns STS_RESOURCE so often. >> >> Where does the dm STS_RESOURCE error usually come from - what's exact >> resource are we running out of? > > It is from blk_get_request(underlying queue), see > multipath_clone_and_map(). That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. I'd probably prefer #1. It's a classic case of trying to get the request, and if it fails, add ourselves to the sbitmap tag wait queue head, retry, and bail if that also fails. Connecting the scarce resource and the consumer is the only way to really fix this, without bogus arbitrary delays. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:24 ` Jens Axboe @ 2018-01-19 15:40 ` Ming Lei 2018-01-29 22:37 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 15:40 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote: > On 1/19/18 12:26 AM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > >> On 1/18/18 7:32 PM, Ming Lei wrote: > >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote: > >>>>>> This is all very tiresome. > >>>>> > >>>>> Yes, this is tiresome. It is very annoying to me that others keep > >>>>> introducing so many regressions in such important parts of the kernel. > >>>>> It is also annoying to me that I get blamed if I report a regression > >>>>> instead of seeing that the regression gets fixed. > >>>> > >>>> I agree, it sucks that any change there introduces the regression. I'm > >>>> fine with doing the delay insert again until a new patch is proven to be > >>>> better. > >>> > >>> That way is still buggy as I explained, since rerun queue before adding > >>> request to hctx->dispatch_list isn't correct. Who can make sure the request > >>> is visible when __blk_mq_run_hw_queue() is called? > >> > >> That race basically doesn't exist for a 10ms gap. > >> > >>> Not mention this way will cause performance regression again. > >> > >> How so? It's _exactly_ the same as what you are proposing, except mine > >> will potentially run the queue when it need not do so. But given that > >> these are random 10ms queue kicks because we are screwed, it should not > >> matter. The key point is that it only should be if we have NO better > >> options. If it's a frequently occurring event that we have to return > >> BLK_STS_RESOURCE, then we need to get a way to register an event for > >> when that condition clears. That event will then kick the necessary > >> queue(s). > > > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > scheduling can make progress during the 100ms. > > That's a bug, plain and simple. If someone does "run this queue in > 100ms" and someone else comes in and says "run this queue now", the > correct outcome is running this queue now. > > >>>> From the original topic of this email, we have conditions that can cause > >>>> the driver to not be able to submit an IO. A set of those conditions can > >>>> only happen if IO is in flight, and those cases we have covered just > >>>> fine. Another set can potentially trigger without IO being in flight. > >>>> These are cases where a non-device resource is unavailable at the time > >>>> of submission. This might be iommu running out of space, for instance, > >>>> or it might be a memory allocation of some sort. For these cases, we > >>>> don't get any notification when the shortage clears. All we can do is > >>>> ensure that we restart operations at some point in the future. We're SOL > >>>> at that point, but we have to ensure that we make forward progress. > >>> > >>> Right, it is a generic issue, not DM-specific one, almost all drivers > >>> call kmalloc(GFP_ATOMIC) in IO path. > >> > >> GFP_ATOMIC basically never fails, unless we are out of memory. The > > > > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be > > possible, and it is mentioned[1] there is such code in mm allocation > > path, also OOM can happen too. > > > > if (some randomly generated condition) && (request is atomic) > > return NULL; > > > > [1] https://lwn.net/Articles/276731/ > > That article is 10 years old. Once you run large scale production, you > see what the real failures are. Fact is, for zero order allocation, if > the atomic alloc fails the shit has really hit the fan. In that case, a > delay of 10ms is not your main issue. It's a total red herring when you > compare to the frequency of what Bart is seeing. It's noise, and > irrelevant here. For an atomic zero order allocation failure, doing a > short random sleep is perfectly fine. > > >> exception is higher order allocations. If a driver has a higher order > >> atomic allocation in its IO path, the device driver writer needs to be > >> taken out behind the barn and shot. Simple as that. It will NEVER work > >> well in a production environment. Witness the disaster that so many NIC > >> driver writers have learned. > >> > >> This is NOT the case we care about here. It's resources that are more > >> readily depleted because other devices are using them. If it's a high > >> frequency or generally occurring event, then we simply must have a > >> callback to restart the queue from that. The condition then becomes > >> identical to device private starvation, the only difference being from > >> where we restart the queue. > >> > >>> IMO, there is enough time for figuring out a generic solution before > >>> 4.16 release. > >> > >> I would hope so, but the proposed solutions have not filled me with > >> a lot of confidence in the end result so far. > >> > >>>> That last set of conditions better not be a a common occurence, since > >>>> performance is down the toilet at that point. I don't want to introduce > >>>> hot path code to rectify it. Have the driver return if that happens in a > >>>> way that is DIFFERENT from needing a normal restart. The driver knows if > >>>> this is a resource that will become available when IO completes on this > >>>> device or not. If we get that return, we have a generic run-again delay. > >>> > >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > >>> it should be DM-only which returns STS_RESOURCE so often. > >> > >> Where does the dm STS_RESOURCE error usually come from - what's exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, it's > quite possible that this situation can happen. Two potential solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. Right, as I have replied to Bart, using mod_delayed_work_on() with returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce resource should fix this issue. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 15:40 ` Ming Lei 0 siblings, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 15:40 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, snitzer, linux-kernel, hch, dm-devel, Bart Van Assche, osandov On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote: > On 1/19/18 12:26 AM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > >> On 1/18/18 7:32 PM, Ming Lei wrote: > >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote: > >>>>>> This is all very tiresome. > >>>>> > >>>>> Yes, this is tiresome. It is very annoying to me that others keep > >>>>> introducing so many regressions in such important parts of the kernel. > >>>>> It is also annoying to me that I get blamed if I report a regression > >>>>> instead of seeing that the regression gets fixed. > >>>> > >>>> I agree, it sucks that any change there introduces the regression. I'm > >>>> fine with doing the delay insert again until a new patch is proven to be > >>>> better. > >>> > >>> That way is still buggy as I explained, since rerun queue before adding > >>> request to hctx->dispatch_list isn't correct. Who can make sure the request > >>> is visible when __blk_mq_run_hw_queue() is called? > >> > >> That race basically doesn't exist for a 10ms gap. > >> > >>> Not mention this way will cause performance regression again. > >> > >> How so? It's _exactly_ the same as what you are proposing, except mine > >> will potentially run the queue when it need not do so. But given that > >> these are random 10ms queue kicks because we are screwed, it should not > >> matter. The key point is that it only should be if we have NO better > >> options. If it's a frequently occurring event that we have to return > >> BLK_STS_RESOURCE, then we need to get a way to register an event for > >> when that condition clears. That event will then kick the necessary > >> queue(s). > > > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > scheduling can make progress during the 100ms. > > That's a bug, plain and simple. If someone does "run this queue in > 100ms" and someone else comes in and says "run this queue now", the > correct outcome is running this queue now. > > >>>> From the original topic of this email, we have conditions that can cause > >>>> the driver to not be able to submit an IO. A set of those conditions can > >>>> only happen if IO is in flight, and those cases we have covered just > >>>> fine. Another set can potentially trigger without IO being in flight. > >>>> These are cases where a non-device resource is unavailable at the time > >>>> of submission. This might be iommu running out of space, for instance, > >>>> or it might be a memory allocation of some sort. For these cases, we > >>>> don't get any notification when the shortage clears. All we can do is > >>>> ensure that we restart operations at some point in the future. We're SOL > >>>> at that point, but we have to ensure that we make forward progress. > >>> > >>> Right, it is a generic issue, not DM-specific one, almost all drivers > >>> call kmalloc(GFP_ATOMIC) in IO path. > >> > >> GFP_ATOMIC basically never fails, unless we are out of memory. The > > > > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be > > possible, and it is mentioned[1] there is such code in mm allocation > > path, also OOM can happen too. > > > > if (some randomly generated condition) && (request is atomic) > > return NULL; > > > > [1] https://lwn.net/Articles/276731/ > > That article is 10 years old. Once you run large scale production, you > see what the real failures are. Fact is, for zero order allocation, if > the atomic alloc fails the shit has really hit the fan. In that case, a > delay of 10ms is not your main issue. It's a total red herring when you > compare to the frequency of what Bart is seeing. It's noise, and > irrelevant here. For an atomic zero order allocation failure, doing a > short random sleep is perfectly fine. > > >> exception is higher order allocations. If a driver has a higher order > >> atomic allocation in its IO path, the device driver writer needs to be > >> taken out behind the barn and shot. Simple as that. It will NEVER work > >> well in a production environment. Witness the disaster that so many NIC > >> driver writers have learned. > >> > >> This is NOT the case we care about here. It's resources that are more > >> readily depleted because other devices are using them. If it's a high > >> frequency or generally occurring event, then we simply must have a > >> callback to restart the queue from that. The condition then becomes > >> identical to device private starvation, the only difference being from > >> where we restart the queue. > >> > >>> IMO, there is enough time for figuring out a generic solution before > >>> 4.16 release. > >> > >> I would hope so, but the proposed solutions have not filled me with > >> a lot of confidence in the end result so far. > >> > >>>> That last set of conditions better not be a a common occurence, since > >>>> performance is down the toilet at that point. I don't want to introduce > >>>> hot path code to rectify it. Have the driver return if that happens in a > >>>> way that is DIFFERENT from needing a normal restart. The driver knows if > >>>> this is a resource that will become available when IO completes on this > >>>> device or not. If we get that return, we have a generic run-again delay. > >>> > >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > >>> it should be DM-only which returns STS_RESOURCE so often. > >> > >> Where does the dm STS_RESOURCE error usually come from - what's exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, it's > quite possible that this situation can happen. Two potential solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. Right, as I have replied to Bart, using mod_delayed_work_on() with returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce resource should fix this issue. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:40 ` Ming Lei (?) @ 2018-01-19 15:48 ` Jens Axboe 2018-01-19 16:05 ` Ming Lei 2018-01-19 16:13 ` Mike Snitzer -1 siblings, 2 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 15:48 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 8:40 AM, Ming Lei wrote: >>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>> resource are we running out of? >>> >>> It is from blk_get_request(underlying queue), see >>> multipath_clone_and_map(). >> >> That's what I thought. So for a low queue depth underlying queue, it's >> quite possible that this situation can happen. Two potential solutions >> I see: >> >> 1) As described earlier in this thread, having a mechanism for being >> notified when the scarce resource becomes available. It would not >> be hard to tap into the existing sbitmap wait queue for that. >> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >> allocation. I haven't read the dm code to know if this is a >> possibility or not. >> >> I'd probably prefer #1. It's a classic case of trying to get the >> request, and if it fails, add ourselves to the sbitmap tag wait >> queue head, retry, and bail if that also fails. Connecting the >> scarce resource and the consumer is the only way to really fix >> this, without bogus arbitrary delays. > > Right, as I have replied to Bart, using mod_delayed_work_on() with > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > resource should fix this issue. It'll fix the forever stall, but it won't really fix it, as we'll slow down the dm device by some random amount. A simple test case would be to have a null_blk device with a queue depth of one, and dm on top of that. Start a fio job that runs two jobs: one that does IO to the underlying device, and one that does IO to the dm device. If the job on the dm device runs substantially slower than the one to the underlying device, then the problem isn't really fixed. That said, I'm fine with ensuring that we make forward progress always first, and then we can come up with a proper solution to the issue. The forward progress guarantee will be needed for the more rare failure cases, like allocation failures. nvme needs that too, for instance, for the discard range struct allocation. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:48 ` Jens Axboe @ 2018-01-19 16:05 ` Ming Lei 2018-01-19 16:19 ` Jens Axboe 2018-01-19 16:13 ` Mike Snitzer 1 sibling, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 16:05 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > >>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>> resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >> notified when the scarce resource becomes available. It would not > >> be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >> allocation. I haven't read the dm code to know if this is a > >> possibility or not. > >> > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. > > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, seems not observed this issue, could you explain a bit why IO over dm-mpath may be slower? Because both two IO contexts call same get_request(), and in theory dm-mpath should be a bit quicker since it uses direct issue for underlying queue, without io scheduler involved. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:05 ` Ming Lei @ 2018-01-19 16:19 ` Jens Axboe 2018-01-19 16:26 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:19 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:05 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>> resource are we running out of? >>>>> >>>>> It is from blk_get_request(underlying queue), see >>>>> multipath_clone_and_map(). >>>> >>>> That's what I thought. So for a low queue depth underlying queue, it's >>>> quite possible that this situation can happen. Two potential solutions >>>> I see: >>>> >>>> 1) As described earlier in this thread, having a mechanism for being >>>> notified when the scarce resource becomes available. It would not >>>> be hard to tap into the existing sbitmap wait queue for that. >>>> >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>> allocation. I haven't read the dm code to know if this is a >>>> possibility or not. >>>> >>>> I'd probably prefer #1. It's a classic case of trying to get the >>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>> queue head, retry, and bail if that also fails. Connecting the >>>> scarce resource and the consumer is the only way to really fix >>>> this, without bogus arbitrary delays. >>> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>> resource should fix this issue. >> >> It'll fix the forever stall, but it won't really fix it, as we'll slow >> down the dm device by some random amount. >> >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > seems not observed this issue, could you explain a bit why IO over dm-mpath > may be slower? Because both two IO contexts call same get_request(), and > in theory dm-mpath should be a bit quicker since it uses direct issue for > underlying queue, without io scheduler involved. Because if you lose the race for getting the request, you'll have some arbitrary delay before trying again, potentially. Compared to the direct user of the underlying device, who will simply sleep on the resource and get woken the instant it's available. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:19 ` Jens Axboe @ 2018-01-19 16:26 ` Ming Lei 2018-01-19 16:27 ` Jens Axboe 0 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 16:26 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > On 1/19/18 9:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>> resource are we running out of? > >>>>> > >>>>> It is from blk_get_request(underlying queue), see > >>>>> multipath_clone_and_map(). > >>>> > >>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>> quite possible that this situation can happen. Two potential solutions > >>>> I see: > >>>> > >>>> 1) As described earlier in this thread, having a mechanism for being > >>>> notified when the scarce resource becomes available. It would not > >>>> be hard to tap into the existing sbitmap wait queue for that. > >>>> > >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>> allocation. I haven't read the dm code to know if this is a > >>>> possibility or not. > >>>> > >>>> I'd probably prefer #1. It's a classic case of trying to get the > >>>> request, and if it fails, add ourselves to the sbitmap tag wait > >>>> queue head, retry, and bail if that also fails. Connecting the > >>>> scarce resource and the consumer is the only way to really fix > >>>> this, without bogus arbitrary delays. > >>> > >>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>> resource should fix this issue. > >> > >> It'll fix the forever stall, but it won't really fix it, as we'll slow > >> down the dm device by some random amount. > >> > >> A simple test case would be to have a null_blk device with a queue depth > >> of one, and dm on top of that. Start a fio job that runs two jobs: one > >> that does IO to the underlying device, and one that does IO to the dm > >> device. If the job on the dm device runs substantially slower than the > >> one to the underlying device, then the problem isn't really fixed. > > > > I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > > seems not observed this issue, could you explain a bit why IO over dm-mpath > > may be slower? Because both two IO contexts call same get_request(), and > > in theory dm-mpath should be a bit quicker since it uses direct issue for > > underlying queue, without io scheduler involved. > > Because if you lose the race for getting the request, you'll have some > arbitrary delay before trying again, potentially. Compared to the direct But the restart still works, one request is completed, then the queue is return immediately because we use mod_delayed_work_on(0), so looks no such issue. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:26 ` Ming Lei @ 2018-01-19 16:27 ` Jens Axboe 2018-01-19 16:37 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:27 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:26 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >> On 1/19/18 9:05 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >>>> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>>>> resource are we running out of? >>>>>>> >>>>>>> It is from blk_get_request(underlying queue), see >>>>>>> multipath_clone_and_map(). >>>>>> >>>>>> That's what I thought. So for a low queue depth underlying queue, it's >>>>>> quite possible that this situation can happen. Two potential solutions >>>>>> I see: >>>>>> >>>>>> 1) As described earlier in this thread, having a mechanism for being >>>>>> notified when the scarce resource becomes available. It would not >>>>>> be hard to tap into the existing sbitmap wait queue for that. >>>>>> >>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>>>> allocation. I haven't read the dm code to know if this is a >>>>>> possibility or not. >>>>>> >>>>>> I'd probably prefer #1. It's a classic case of trying to get the >>>>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>>>> queue head, retry, and bail if that also fails. Connecting the >>>>>> scarce resource and the consumer is the only way to really fix >>>>>> this, without bogus arbitrary delays. >>>>> >>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>>>> resource should fix this issue. >>>> >>>> It'll fix the forever stall, but it won't really fix it, as we'll slow >>>> down the dm device by some random amount. >>>> >>>> A simple test case would be to have a null_blk device with a queue depth >>>> of one, and dm on top of that. Start a fio job that runs two jobs: one >>>> that does IO to the underlying device, and one that does IO to the dm >>>> device. If the job on the dm device runs substantially slower than the >>>> one to the underlying device, then the problem isn't really fixed. >>> >>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, >>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>> may be slower? Because both two IO contexts call same get_request(), and >>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>> underlying queue, without io scheduler involved. >> >> Because if you lose the race for getting the request, you'll have some >> arbitrary delay before trying again, potentially. Compared to the direct > > But the restart still works, one request is completed, then the queue > is return immediately because we use mod_delayed_work_on(0), so looks > no such issue. There are no pending requests for this case, nothing to restart the queue. When you fail that blk_get_request(), you are idle, nothing is pending. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:27 ` Jens Axboe @ 2018-01-19 16:37 ` Ming Lei 2018-01-19 16:41 ` Jens Axboe 2018-01-19 17:38 ` Jens Axboe 0 siblings, 2 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 16:37 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > On 1/19/18 9:26 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:05 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >>>> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>>>> resource are we running out of? > >>>>>>> > >>>>>>> It is from blk_get_request(underlying queue), see > >>>>>>> multipath_clone_and_map(). > >>>>>> > >>>>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>>>> quite possible that this situation can happen. Two potential solutions > >>>>>> I see: > >>>>>> > >>>>>> 1) As described earlier in this thread, having a mechanism for being > >>>>>> notified when the scarce resource becomes available. It would not > >>>>>> be hard to tap into the existing sbitmap wait queue for that. > >>>>>> > >>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>>>> allocation. I haven't read the dm code to know if this is a > >>>>>> possibility or not. > >>>>>> > >>>>>> I'd probably prefer #1. It's a classic case of trying to get the > >>>>>> request, and if it fails, add ourselves to the sbitmap tag wait > >>>>>> queue head, retry, and bail if that also fails. Connecting the > >>>>>> scarce resource and the consumer is the only way to really fix > >>>>>> this, without bogus arbitrary delays. > >>>>> > >>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>>>> resource should fix this issue. > >>>> > >>>> It'll fix the forever stall, but it won't really fix it, as we'll slow > >>>> down the dm device by some random amount. > >>>> > >>>> A simple test case would be to have a null_blk device with a queue depth > >>>> of one, and dm on top of that. Start a fio job that runs two jobs: one > >>>> that does IO to the underlying device, and one that does IO to the dm > >>>> device. If the job on the dm device runs substantially slower than the > >>>> one to the underlying device, then the problem isn't really fixed. > >>> > >>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > >>> seems not observed this issue, could you explain a bit why IO over dm-mpath > >>> may be slower? Because both two IO contexts call same get_request(), and > >>> in theory dm-mpath should be a bit quicker since it uses direct issue for > >>> underlying queue, without io scheduler involved. > >> > >> Because if you lose the race for getting the request, you'll have some > >> arbitrary delay before trying again, potentially. Compared to the direct > > > > But the restart still works, one request is completed, then the queue > > is return immediately because we use mod_delayed_work_on(0), so looks > > no such issue. > > There are no pending requests for this case, nothing to restart the > queue. When you fail that blk_get_request(), you are idle, nothing > is pending. I think we needn't worry about that, once a device is attached to dm-rq, it can't be mounted any more, and usually user don't use the device directly and by dm-mpath at the same time. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:37 ` Ming Lei @ 2018-01-19 16:41 ` Jens Axboe 2018-01-19 17:38 ` Jens Axboe 1 sibling, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:41 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:37 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >> On 1/19/18 9:26 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>> On 1/19/18 9:05 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>>>>>> resource are we running out of? >>>>>>>>> >>>>>>>>> It is from blk_get_request(underlying queue), see >>>>>>>>> multipath_clone_and_map(). >>>>>>>> >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's >>>>>>>> quite possible that this situation can happen. Two potential solutions >>>>>>>> I see: >>>>>>>> >>>>>>>> 1) As described earlier in this thread, having a mechanism for being >>>>>>>> notified when the scarce resource becomes available. It would not >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. >>>>>>>> >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>>>>>> allocation. I haven't read the dm code to know if this is a >>>>>>>> possibility or not. >>>>>>>> >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>>>>>> queue head, retry, and bail if that also fails. Connecting the >>>>>>>> scarce resource and the consumer is the only way to really fix >>>>>>>> this, without bogus arbitrary delays. >>>>>>> >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>>>>>> resource should fix this issue. >>>>>> >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow >>>>>> down the dm device by some random amount. >>>>>> >>>>>> A simple test case would be to have a null_blk device with a queue depth >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one >>>>>> that does IO to the underlying device, and one that does IO to the dm >>>>>> device. If the job on the dm device runs substantially slower than the >>>>>> one to the underlying device, then the problem isn't really fixed. >>>>> >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>>>> may be slower? Because both two IO contexts call same get_request(), and >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>>>> underlying queue, without io scheduler involved. >>>> >>>> Because if you lose the race for getting the request, you'll have some >>>> arbitrary delay before trying again, potentially. Compared to the direct >>> >>> But the restart still works, one request is completed, then the queue >>> is return immediately because we use mod_delayed_work_on(0), so looks >>> no such issue. >> >> There are no pending requests for this case, nothing to restart the >> queue. When you fail that blk_get_request(), you are idle, nothing >> is pending. > > I think we needn't worry about that, once a device is attached to > dm-rq, it can't be mounted any more, and usually user don't use the device > directly and by dm-mpath at the same time. Even if it doesn't happen for a normal dm setup, it is a case that needs to be handled. The request allocation is just one example of a wider scope resource that can be unavailable. If the driver returns NO_DEV_RESOURCE (or whatever name), it will be a possibility that the device itself is currently idle. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 16:41 ` Jens Axboe 0 siblings, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:41 UTC (permalink / raw) To: Ming Lei Cc: linux-block, snitzer, linux-kernel, hch, dm-devel, Bart Van Assche, osandov On 1/19/18 9:37 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >> On 1/19/18 9:26 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>> On 1/19/18 9:05 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>>>>>> resource are we running out of? >>>>>>>>> >>>>>>>>> It is from blk_get_request(underlying queue), see >>>>>>>>> multipath_clone_and_map(). >>>>>>>> >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's >>>>>>>> quite possible that this situation can happen. Two potential solutions >>>>>>>> I see: >>>>>>>> >>>>>>>> 1) As described earlier in this thread, having a mechanism for being >>>>>>>> notified when the scarce resource becomes available. It would not >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. >>>>>>>> >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>>>>>> allocation. I haven't read the dm code to know if this is a >>>>>>>> possibility or not. >>>>>>>> >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>>>>>> queue head, retry, and bail if that also fails. Connecting the >>>>>>>> scarce resource and the consumer is the only way to really fix >>>>>>>> this, without bogus arbitrary delays. >>>>>>> >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>>>>>> resource should fix this issue. >>>>>> >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow >>>>>> down the dm device by some random amount. >>>>>> >>>>>> A simple test case would be to have a null_blk device with a queue depth >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one >>>>>> that does IO to the underlying device, and one that does IO to the dm >>>>>> device. If the job on the dm device runs substantially slower than the >>>>>> one to the underlying device, then the problem isn't really fixed. >>>>> >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>>>> may be slower? Because both two IO contexts call same get_request(), and >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>>>> underlying queue, without io scheduler involved. >>>> >>>> Because if you lose the race for getting the request, you'll have some >>>> arbitrary delay before trying again, potentially. Compared to the direct >>> >>> But the restart still works, one request is completed, then the queue >>> is return immediately because we use mod_delayed_work_on(0), so looks >>> no such issue. >> >> There are no pending requests for this case, nothing to restart the >> queue. When you fail that blk_get_request(), you are idle, nothing >> is pending. > > I think we needn't worry about that, once a device is attached to > dm-rq, it can't be mounted any more, and usually user don't use the device > directly and by dm-mpath at the same time. Even if it doesn't happen for a normal dm setup, it is a case that needs to be handled. The request allocation is just one example of a wider scope resource that can be unavailable. If the driver returns NO_DEV_RESOURCE (or whatever name), it will be a possibility that the device itself is currently idle. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:41 ` Jens Axboe (?) @ 2018-01-19 16:47 ` Mike Snitzer 2018-01-19 16:52 ` Jens Axboe -1 siblings, 1 reply; 80+ messages in thread From: Mike Snitzer @ 2018-01-19 16:47 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19 2018 at 11:41am -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Even if it doesn't happen for a normal dm setup, it is a case that > needs to be handled. The request allocation is just one example of > a wider scope resource that can be unavailable. If the driver returns > NO_DEV_RESOURCE (or whatever name), it will be a possibility that > the device itself is currently idle. How would a driver's resources be exhausted yet the device is idle (so as not to be able to benefit from RESTART)? ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:47 ` Mike Snitzer @ 2018-01-19 16:52 ` Jens Axboe 2018-01-19 17:05 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:52 UTC (permalink / raw) To: Mike Snitzer Cc: Ming Lei, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:47 AM, Mike Snitzer wrote: > On Fri, Jan 19 2018 at 11:41am -0500, > Jens Axboe <axboe@kernel.dk> wrote: > >> On 1/19/18 9:37 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >>>> On 1/19/18 9:26 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>> >>>> There are no pending requests for this case, nothing to restart the >>>> queue. When you fail that blk_get_request(), you are idle, nothing >>>> is pending. >>> >>> I think we needn't worry about that, once a device is attached to >>> dm-rq, it can't be mounted any more, and usually user don't use the device >>> directly and by dm-mpath at the same time. >> >> Even if it doesn't happen for a normal dm setup, it is a case that >> needs to be handled. The request allocation is just one example of >> a wider scope resource that can be unavailable. If the driver returns >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that >> the device itself is currently idle. > > How would a driver's resources be exhausted yet the device is idle (so > as not to be able to benefit from RESTART)? I've outlined a number of these examples already. Another case might be: 1) Device is idle 2) Device gets request 3) Device attempts to DMA map 4) DMA map fails because the IOMMU is out of space (nic is using it all) 5) Device returns STS_RESOURCE 6) Queue is marked as needing a restart All's well, except there is no IO on this device that will notice the restart bit and retry the operation. Replace IOMMU failure with any other resource that the driver might need for an IO, which isn't tied to a device specific resource. blk_get_request() on dm is an example, as is any allocation failure occurring in the queue IO path for the driver. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:52 ` Jens Axboe @ 2018-01-19 17:05 ` Ming Lei 2018-01-19 17:09 ` Jens Axboe 0 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 17:05 UTC (permalink / raw) To: Jens Axboe Cc: Mike Snitzer, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote: > On 1/19/18 9:47 AM, Mike Snitzer wrote: > > On Fri, Jan 19 2018 at 11:41am -0500, > > Jens Axboe <axboe@kernel.dk> wrote: > > > >> On 1/19/18 9:37 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >>>> On 1/19/18 9:26 AM, Ming Lei wrote: > >>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >>>> > >>>> There are no pending requests for this case, nothing to restart the > >>>> queue. When you fail that blk_get_request(), you are idle, nothing > >>>> is pending. > >>> > >>> I think we needn't worry about that, once a device is attached to > >>> dm-rq, it can't be mounted any more, and usually user don't use the device > >>> directly and by dm-mpath at the same time. > >> > >> Even if it doesn't happen for a normal dm setup, it is a case that > >> needs to be handled. The request allocation is just one example of > >> a wider scope resource that can be unavailable. If the driver returns > >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that > >> the device itself is currently idle. > > > > How would a driver's resources be exhausted yet the device is idle (so > > as not to be able to benefit from RESTART)? > > I've outlined a number of these examples already. Another case might be: > > 1) Device is idle > 2) Device gets request > 3) Device attempts to DMA map > 4) DMA map fails because the IOMMU is out of space (nic is using it all) > 5) Device returns STS_RESOURCE > 6) Queue is marked as needing a restart > > All's well, except there is no IO on this device that will notice the > restart bit and retry the operation. > > Replace IOMMU failure with any other resource that the driver might need > for an IO, which isn't tied to a device specific resource. > blk_get_request() on dm is an example, as is any allocation failure > occurring in the queue IO path for the driver. Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all the current STS_RESOURCE for non-device resource allocation(kmalloc, dma map, get_request, ...) should be converted to NO_DEV_RESOURCE. And it is a generic issue, which need generic solution. Seems running queue after arbitrary in this case is the only way we thought of, or other solutions? If the decision is made, let's make/review the patch, :-) -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 17:05 ` Ming Lei @ 2018-01-19 17:09 ` Jens Axboe 2018-01-19 17:20 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 17:09 UTC (permalink / raw) To: Ming Lei Cc: Mike Snitzer, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 10:05 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote: >> On 1/19/18 9:47 AM, Mike Snitzer wrote: >>> On Fri, Jan 19 2018 at 11:41am -0500, >>> Jens Axboe <axboe@kernel.dk> wrote: >>> >>>> On 1/19/18 9:37 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >>>>>> On 1/19/18 9:26 AM, Ming Lei wrote: >>>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>>>> >>>>>> There are no pending requests for this case, nothing to restart the >>>>>> queue. When you fail that blk_get_request(), you are idle, nothing >>>>>> is pending. >>>>> >>>>> I think we needn't worry about that, once a device is attached to >>>>> dm-rq, it can't be mounted any more, and usually user don't use the device >>>>> directly and by dm-mpath at the same time. >>>> >>>> Even if it doesn't happen for a normal dm setup, it is a case that >>>> needs to be handled. The request allocation is just one example of >>>> a wider scope resource that can be unavailable. If the driver returns >>>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that >>>> the device itself is currently idle. >>> >>> How would a driver's resources be exhausted yet the device is idle (so >>> as not to be able to benefit from RESTART)? >> >> I've outlined a number of these examples already. Another case might be: >> >> 1) Device is idle >> 2) Device gets request >> 3) Device attempts to DMA map >> 4) DMA map fails because the IOMMU is out of space (nic is using it all) >> 5) Device returns STS_RESOURCE >> 6) Queue is marked as needing a restart >> >> All's well, except there is no IO on this device that will notice the >> restart bit and retry the operation. >> >> Replace IOMMU failure with any other resource that the driver might need >> for an IO, which isn't tied to a device specific resource. >> blk_get_request() on dm is an example, as is any allocation failure >> occurring in the queue IO path for the driver. > > Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all > the current STS_RESOURCE for non-device resource allocation(kmalloc, dma > map, get_request, ...) should be converted to NO_DEV_RESOURCE. > > And it is a generic issue, which need generic solution. Precisely. > Seems running queue after arbitrary in this case is the only way we > thought of, or other solutions? I think that is the only solution. If it's a frequent enough occurence to cause performance issues, then it's likely down to a specific resource shortage, and we can tackle that independently (we need to, since each of those will need a specialized solution). > If the decision is made, let's make/review the patch, :-) Let 'er rip. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 17:09 ` Jens Axboe @ 2018-01-19 17:20 ` Ming Lei 0 siblings, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 17:20 UTC (permalink / raw) To: Jens Axboe Cc: Mike Snitzer, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 10:09:11AM -0700, Jens Axboe wrote: > On 1/19/18 10:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:47 AM, Mike Snitzer wrote: > >>> On Fri, Jan 19 2018 at 11:41am -0500, > >>> Jens Axboe <axboe@kernel.dk> wrote: > >>> > >>>> On 1/19/18 9:37 AM, Ming Lei wrote: > >>>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >>>>>> On 1/19/18 9:26 AM, Ming Lei wrote: > >>>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >>>>>> > >>>>>> There are no pending requests for this case, nothing to restart the > >>>>>> queue. When you fail that blk_get_request(), you are idle, nothing > >>>>>> is pending. > >>>>> > >>>>> I think we needn't worry about that, once a device is attached to > >>>>> dm-rq, it can't be mounted any more, and usually user don't use the device > >>>>> directly and by dm-mpath at the same time. > >>>> > >>>> Even if it doesn't happen for a normal dm setup, it is a case that > >>>> needs to be handled. The request allocation is just one example of > >>>> a wider scope resource that can be unavailable. If the driver returns > >>>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that > >>>> the device itself is currently idle. > >>> > >>> How would a driver's resources be exhausted yet the device is idle (so > >>> as not to be able to benefit from RESTART)? > >> > >> I've outlined a number of these examples already. Another case might be: > >> > >> 1) Device is idle > >> 2) Device gets request > >> 3) Device attempts to DMA map > >> 4) DMA map fails because the IOMMU is out of space (nic is using it all) > >> 5) Device returns STS_RESOURCE > >> 6) Queue is marked as needing a restart > >> > >> All's well, except there is no IO on this device that will notice the > >> restart bit and retry the operation. > >> > >> Replace IOMMU failure with any other resource that the driver might need > >> for an IO, which isn't tied to a device specific resource. > >> blk_get_request() on dm is an example, as is any allocation failure > >> occurring in the queue IO path for the driver. > > > > Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all > > the current STS_RESOURCE for non-device resource allocation(kmalloc, dma > > map, get_request, ...) should be converted to NO_DEV_RESOURCE. Or simply introduce BLK_STS_DEV_RESOURCE and convert out of real device resource into it, this way may be much easy to do. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:37 ` Ming Lei 2018-01-19 16:41 ` Jens Axboe @ 2018-01-19 17:38 ` Jens Axboe 2018-01-19 18:24 ` Ming Lei ` (2 more replies) 1 sibling, 3 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-19 17:38 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:37 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >> On 1/19/18 9:26 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>> On 1/19/18 9:05 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>>>>>> resource are we running out of? >>>>>>>>> >>>>>>>>> It is from blk_get_request(underlying queue), see >>>>>>>>> multipath_clone_and_map(). >>>>>>>> >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's >>>>>>>> quite possible that this situation can happen. Two potential solutions >>>>>>>> I see: >>>>>>>> >>>>>>>> 1) As described earlier in this thread, having a mechanism for being >>>>>>>> notified when the scarce resource becomes available. It would not >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. >>>>>>>> >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>>>>>> allocation. I haven't read the dm code to know if this is a >>>>>>>> possibility or not. >>>>>>>> >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>>>>>> queue head, retry, and bail if that also fails. Connecting the >>>>>>>> scarce resource and the consumer is the only way to really fix >>>>>>>> this, without bogus arbitrary delays. >>>>>>> >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>>>>>> resource should fix this issue. >>>>>> >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow >>>>>> down the dm device by some random amount. >>>>>> >>>>>> A simple test case would be to have a null_blk device with a queue depth >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one >>>>>> that does IO to the underlying device, and one that does IO to the dm >>>>>> device. If the job on the dm device runs substantially slower than the >>>>>> one to the underlying device, then the problem isn't really fixed. >>>>> >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>>>> may be slower? Because both two IO contexts call same get_request(), and >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>>>> underlying queue, without io scheduler involved. >>>> >>>> Because if you lose the race for getting the request, you'll have some >>>> arbitrary delay before trying again, potentially. Compared to the direct >>> >>> But the restart still works, one request is completed, then the queue >>> is return immediately because we use mod_delayed_work_on(0), so looks >>> no such issue. >> >> There are no pending requests for this case, nothing to restart the >> queue. When you fail that blk_get_request(), you are idle, nothing >> is pending. > > I think we needn't worry about that, once a device is attached to > dm-rq, it can't be mounted any more, and usually user don't use the device > directly and by dm-mpath at the same time. Here's an example of that, using my current block tree (merged into master). The setup is dm-mpath on top of null_blk, the latter having just a single request. Both are mq devices. Fio direct 4k random reads on dm_mq: ~250K iops Start dd on underlying device (or partition on same device), just doing sequential reads. Fio direct 4k random reads on dm_mq with dd running: 9 iops No schedulers involved. https://i.imgur.com/WTDnnwE.gif -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 17:38 ` Jens Axboe @ 2018-01-19 18:24 ` Ming Lei 2018-01-19 18:33 ` Mike Snitzer 2018-01-19 23:52 ` Ming Lei 2 siblings, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 18:24 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >>>> On 1/19/18 9:05 AM, Ming Lei wrote: > >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>>>>>> resource are we running out of? > >>>>>>>>> > >>>>>>>>> It is from blk_get_request(underlying queue), see > >>>>>>>>> multipath_clone_and_map(). > >>>>>>>> > >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>>>>>> quite possible that this situation can happen. Two potential solutions > >>>>>>>> I see: > >>>>>>>> > >>>>>>>> 1) As described earlier in this thread, having a mechanism for being > >>>>>>>> notified when the scarce resource becomes available. It would not > >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. > >>>>>>>> > >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>>>>>> allocation. I haven't read the dm code to know if this is a > >>>>>>>> possibility or not. > >>>>>>>> > >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the > >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait > >>>>>>>> queue head, retry, and bail if that also fails. Connecting the > >>>>>>>> scarce resource and the consumer is the only way to really fix > >>>>>>>> this, without bogus arbitrary delays. > >>>>>>> > >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>>>>>> resource should fix this issue. > >>>>>> > >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow > >>>>>> down the dm device by some random amount. > >>>>>> > >>>>>> A simple test case would be to have a null_blk device with a queue depth > >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one > >>>>>> that does IO to the underlying device, and one that does IO to the dm > >>>>>> device. If the job on the dm device runs substantially slower than the > >>>>>> one to the underlying device, then the problem isn't really fixed. > >>>>> > >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath > >>>>> may be slower? Because both two IO contexts call same get_request(), and > >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for > >>>>> underlying queue, without io scheduler involved. > >>>> > >>>> Because if you lose the race for getting the request, you'll have some > >>>> arbitrary delay before trying again, potentially. Compared to the direct > >>> > >>> But the restart still works, one request is completed, then the queue > >>> is return immediately because we use mod_delayed_work_on(0), so looks > >>> no such issue. > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif This DM specific issue might be addressed by applying notifier_chain (or similar mechanism)between the two queues, will think about the details tomorrow. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 18:24 ` Ming Lei 0 siblings, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 18:24 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, snitzer, linux-kernel, hch, dm-devel, Bart Van Assche, osandov On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >>>> On 1/19/18 9:05 AM, Ming Lei wrote: > >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>>>>>> resource are we running out of? > >>>>>>>>> > >>>>>>>>> It is from blk_get_request(underlying queue), see > >>>>>>>>> multipath_clone_and_map(). > >>>>>>>> > >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>>>>>> quite possible that this situation can happen. Two potential solutions > >>>>>>>> I see: > >>>>>>>> > >>>>>>>> 1) As described earlier in this thread, having a mechanism for being > >>>>>>>> notified when the scarce resource becomes available. It would not > >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. > >>>>>>>> > >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>>>>>> allocation. I haven't read the dm code to know if this is a > >>>>>>>> possibility or not. > >>>>>>>> > >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the > >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait > >>>>>>>> queue head, retry, and bail if that also fails. Connecting the > >>>>>>>> scarce resource and the consumer is the only way to really fix > >>>>>>>> this, without bogus arbitrary delays. > >>>>>>> > >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>>>>>> resource should fix this issue. > >>>>>> > >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow > >>>>>> down the dm device by some random amount. > >>>>>> > >>>>>> A simple test case would be to have a null_blk device with a queue depth > >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one > >>>>>> that does IO to the underlying device, and one that does IO to the dm > >>>>>> device. If the job on the dm device runs substantially slower than the > >>>>>> one to the underlying device, then the problem isn't really fixed. > >>>>> > >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath > >>>>> may be slower? Because both two IO contexts call same get_request(), and > >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for > >>>>> underlying queue, without io scheduler involved. > >>>> > >>>> Because if you lose the race for getting the request, you'll have some > >>>> arbitrary delay before trying again, potentially. Compared to the direct > >>> > >>> But the restart still works, one request is completed, then the queue > >>> is return immediately because we use mod_delayed_work_on(0), so looks > >>> no such issue. > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif This DM specific issue might be addressed by applying notifier_chain (or similar mechanism)between the two queues, will think about the details tomorrow. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 17:38 ` Jens Axboe 2018-01-19 18:24 ` Ming Lei @ 2018-01-19 18:33 ` Mike Snitzer 2018-01-19 23:52 ` Ming Lei 2 siblings, 0 replies; 80+ messages in thread From: Mike Snitzer @ 2018-01-19 18:33 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19 2018 at 12:38pm -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif FYI, your tree doesn't have these dm-4.16 changes (which are needed to make Ming's blk-mq chnages that are in your tree, commit 396eaf21e et al, have Ming's desired effect on DM): https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=050af08ffb1b62af69196d61c22a0755f9a3cdbd https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=459b54019cfeb7330ed4863ad40f78489e0ff23d https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec Fact that you're seeing such shit results without dm-4.16 commit ec3eaf9a67 (that reverts older commit 6077c2d706) means: yeap, this is really awful, let's fix it! But it is a different flavor of awful because the dm-rq.c:map_request() will handle the DM_MAPIO_DELAY_REQUEUE result from the null_blk's blk_get_request() failure using dm_mq_delay_requeue_request() against the dm-mq mpath device: blk_mq_requeue_request(rq, false); __dm_mq_kick_requeue_list(rq->q, msecs); So begs the question: why are we stalling _exactly_? (you may have it all figured out, as your gif implies.. but I'm not there yet). Might be interesting to see how your same test behaves without all of the churn we've staged for 4.16, e.g. against v4.15-rc8 Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 17:38 ` Jens Axboe 2018-01-19 18:24 ` Ming Lei 2018-01-19 18:33 ` Mike Snitzer @ 2018-01-19 23:52 ` Ming Lei 2018-01-20 4:27 ` Jens Axboe 2 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 23:52 UTC (permalink / raw) To: Jens Axboe Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >>>> On 1/19/18 9:05 AM, Ming Lei wrote: > >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>>>>>> resource are we running out of? > >>>>>>>>> > >>>>>>>>> It is from blk_get_request(underlying queue), see > >>>>>>>>> multipath_clone_and_map(). > >>>>>>>> > >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>>>>>> quite possible that this situation can happen. Two potential solutions > >>>>>>>> I see: > >>>>>>>> > >>>>>>>> 1) As described earlier in this thread, having a mechanism for being > >>>>>>>> notified when the scarce resource becomes available. It would not > >>>>>>>> be hard to tap into the existing sbitmap wait queue for that. > >>>>>>>> > >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>>>>>> allocation. I haven't read the dm code to know if this is a > >>>>>>>> possibility or not. > >>>>>>>> > >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the > >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait > >>>>>>>> queue head, retry, and bail if that also fails. Connecting the > >>>>>>>> scarce resource and the consumer is the only way to really fix > >>>>>>>> this, without bogus arbitrary delays. > >>>>>>> > >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>>>>>> resource should fix this issue. > >>>>>> > >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow > >>>>>> down the dm device by some random amount. > >>>>>> > >>>>>> A simple test case would be to have a null_blk device with a queue depth > >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one > >>>>>> that does IO to the underlying device, and one that does IO to the dm > >>>>>> device. If the job on the dm device runs substantially slower than the > >>>>>> one to the underlying device, then the problem isn't really fixed. > >>>>> > >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath > >>>>> may be slower? Because both two IO contexts call same get_request(), and > >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for > >>>>> underlying queue, without io scheduler involved. > >>>> > >>>> Because if you lose the race for getting the request, you'll have some > >>>> arbitrary delay before trying again, potentially. Compared to the direct > >>> > >>> But the restart still works, one request is completed, then the queue > >>> is return immediately because we use mod_delayed_work_on(0), so looks > >>> no such issue. > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif If null_blk's timer mode is used with a bit delay introduced, I guess the effect from direct access to underlying queue shouldn't be so serious. But it still won't be good as direct access. Another way may be to introduce a variants blk_get_request(), such as blk_get_request_with_notify(), then pass the current dm-rq's hctx to it, and use the tag's waitqueue to handle that. But the change can be a bit big. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 23:52 ` Ming Lei @ 2018-01-20 4:27 ` Jens Axboe 0 siblings, 0 replies; 80+ messages in thread From: Jens Axboe @ 2018-01-20 4:27 UTC (permalink / raw) To: Ming Lei Cc: Bart Van Assche, snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 4:52 PM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: >> On 1/19/18 9:37 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >>>> On 1/19/18 9:26 AM, Ming Lei wrote: >>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >>>>>> On 1/19/18 9:05 AM, Ming Lei wrote: >>>>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >>>>>>>> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>>>>>>>> resource are we running out of? >>>>>>>>>>> >>>>>>>>>>> It is from blk_get_request(underlying queue), see >>>>>>>>>>> multipath_clone_and_map(). >>>>>>>>>> >>>>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's >>>>>>>>>> quite possible that this situation can happen. Two potential solutions >>>>>>>>>> I see: >>>>>>>>>> >>>>>>>>>> 1) As described earlier in this thread, having a mechanism for being >>>>>>>>>> notified when the scarce resource becomes available. It would not >>>>>>>>>> be hard to tap into the existing sbitmap wait queue for that. >>>>>>>>>> >>>>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>>>>>>>> allocation. I haven't read the dm code to know if this is a >>>>>>>>>> possibility or not. >>>>>>>>>> >>>>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the >>>>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait >>>>>>>>>> queue head, retry, and bail if that also fails. Connecting the >>>>>>>>>> scarce resource and the consumer is the only way to really fix >>>>>>>>>> this, without bogus arbitrary delays. >>>>>>>>> >>>>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>>>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>>>>>>>> resource should fix this issue. >>>>>>>> >>>>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow >>>>>>>> down the dm device by some random amount. >>>>>>>> >>>>>>>> A simple test case would be to have a null_blk device with a queue depth >>>>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one >>>>>>>> that does IO to the underlying device, and one that does IO to the dm >>>>>>>> device. If the job on the dm device runs substantially slower than the >>>>>>>> one to the underlying device, then the problem isn't really fixed. >>>>>>> >>>>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, >>>>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>>>>>> may be slower? Because both two IO contexts call same get_request(), and >>>>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>>>>>> underlying queue, without io scheduler involved. >>>>>> >>>>>> Because if you lose the race for getting the request, you'll have some >>>>>> arbitrary delay before trying again, potentially. Compared to the direct >>>>> >>>>> But the restart still works, one request is completed, then the queue >>>>> is return immediately because we use mod_delayed_work_on(0), so looks >>>>> no such issue. >>>> >>>> There are no pending requests for this case, nothing to restart the >>>> queue. When you fail that blk_get_request(), you are idle, nothing >>>> is pending. >>> >>> I think we needn't worry about that, once a device is attached to >>> dm-rq, it can't be mounted any more, and usually user don't use the device >>> directly and by dm-mpath at the same time. >> >> Here's an example of that, using my current block tree (merged into >> master). The setup is dm-mpath on top of null_blk, the latter having >> just a single request. Both are mq devices. >> >> Fio direct 4k random reads on dm_mq: ~250K iops >> >> Start dd on underlying device (or partition on same device), just doing >> sequential reads. >> >> Fio direct 4k random reads on dm_mq with dd running: 9 iops >> >> No schedulers involved. >> >> https://i.imgur.com/WTDnnwE.gif > > If null_blk's timer mode is used with a bit delay introduced, I guess > the effect from direct access to underlying queue shouldn't be so > serious. But it still won't be good as direct access. Doesn't matter if it's used at the default of 10usec completion latency, or inline complete. Same result, I ran both. > Another way may be to introduce a variants blk_get_request(), such as > blk_get_request_with_notify(), then pass the current dm-rq's hctx to > it, and use the tag's waitqueue to handle that. But the change can be > a bit big. Yes, that's exactly the solution I suggested both yesterday and today. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:48 ` Jens Axboe 2018-01-19 16:05 ` Ming Lei @ 2018-01-19 16:13 ` Mike Snitzer 2018-01-19 16:23 ` Jens Axboe 1 sibling, 1 reply; 80+ messages in thread From: Mike Snitzer @ 2018-01-19 16:13 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19 2018 at 10:48am -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > >>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>> resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >> notified when the scarce resource becomes available. It would not > >> be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >> allocation. I haven't read the dm code to know if this is a > >> possibility or not. Right, #2 is _not_ the way forward. Historically request-based DM used its own mempool for requests, this was to be able to have some measure of control and resiliency in the face of low memory conditions that might be affecting the broader system. Then Christoph switched over to adding per-request-data; which ushered in the use of blk_get_request using ATOMIC allocations. I like the result of that line of development. But taking the next step of setting BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this dm-mpath.c code is common to old .request_fn and blk-mq, at least the call to blk_get_request is). Ultimately dm-mpath like to avoid blocking for a request because for this dm-mpath device we have multiple queues to allocate from if need be (provided we have an active-active storage network topology). > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. Agreed. > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. Not sure DM will allow the underlying device to be opened (due to master/slave ownership that is part of loading a DM table)? > That said, I'm fine with ensuring that we make forward progress always > first, and then we can come up with a proper solution to the issue. The > forward progress guarantee will be needed for the more rare failure > cases, like allocation failures. nvme needs that too, for instance, for > the discard range struct allocation. Yeap, I'd be OK with that too. We'd be better for revisted this and then have some time to develop the ultimate robust fix (#1, callback from above). Mike ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:13 ` Mike Snitzer @ 2018-01-19 16:23 ` Jens Axboe 2018-01-19 23:57 ` Ming Lei 0 siblings, 1 reply; 80+ messages in thread From: Jens Axboe @ 2018-01-19 16:23 UTC (permalink / raw) To: Mike Snitzer Cc: Ming Lei, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On 1/19/18 9:13 AM, Mike Snitzer wrote: > On Fri, Jan 19 2018 at 10:48am -0500, > Jens Axboe <axboe@kernel.dk> wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>> resource are we running out of? >>>>> >>>>> It is from blk_get_request(underlying queue), see >>>>> multipath_clone_and_map(). >>>> >>>> That's what I thought. So for a low queue depth underlying queue, it's >>>> quite possible that this situation can happen. Two potential solutions >>>> I see: >>>> >>>> 1) As described earlier in this thread, having a mechanism for being >>>> notified when the scarce resource becomes available. It would not >>>> be hard to tap into the existing sbitmap wait queue for that. >>>> >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>> allocation. I haven't read the dm code to know if this is a >>>> possibility or not. > > Right, #2 is _not_ the way forward. Historically request-based DM used > its own mempool for requests, this was to be able to have some measure > of control and resiliency in the face of low memory conditions that > might be affecting the broader system. > > Then Christoph switched over to adding per-request-data; which ushered > in the use of blk_get_request using ATOMIC allocations. I like the > result of that line of development. But taking the next step of setting > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this > dm-mpath.c code is common to old .request_fn and blk-mq, at least the > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking > for a request because for this dm-mpath device we have multiple queues > to allocate from if need be (provided we have an active-active storage > network topology). If you can go to multiple devices, obviously it should not block on a single device. That's only true for the case where you can only go to one device, blocking at that point would probably be fine. Or if all your paths are busy, then blocking would also be OK. But it's a much larger change, and would entail changing more than just the actual call to blk_get_request(). >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > Not sure DM will allow the underlying device to be opened (due to > master/slave ownership that is part of loading a DM table)? There are many ways it could be setup - just partition the underlying device then, and have one partition be part of the dm setup and the other used directly. >> That said, I'm fine with ensuring that we make forward progress always >> first, and then we can come up with a proper solution to the issue. The >> forward progress guarantee will be needed for the more rare failure >> cases, like allocation failures. nvme needs that too, for instance, for >> the discard range struct allocation. > > Yeap, I'd be OK with that too. We'd be better for revisted this and > then have some time to develop the ultimate robust fix (#1, callback > from above). Yeah, we need the quick and dirty sooner, which just brings us back to what we had before, essentially. -- Jens Axboe ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 16:23 ` Jens Axboe @ 2018-01-19 23:57 ` Ming Lei 0 siblings, 0 replies; 80+ messages in thread From: Ming Lei @ 2018-01-19 23:57 UTC (permalink / raw) To: Jens Axboe Cc: Mike Snitzer, Bart Van Assche, dm-devel, hch, linux-kernel, linux-block, osandov On Fri, Jan 19, 2018 at 09:23:35AM -0700, Jens Axboe wrote: > On 1/19/18 9:13 AM, Mike Snitzer wrote: > > On Fri, Jan 19 2018 at 10:48am -0500, > > Jens Axboe <axboe@kernel.dk> wrote: > > > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>>>> resource are we running out of? > >>>>> > >>>>> It is from blk_get_request(underlying queue), see > >>>>> multipath_clone_and_map(). > >>>> > >>>> That's what I thought. So for a low queue depth underlying queue, it's > >>>> quite possible that this situation can happen. Two potential solutions > >>>> I see: > >>>> > >>>> 1) As described earlier in this thread, having a mechanism for being > >>>> notified when the scarce resource becomes available. It would not > >>>> be hard to tap into the existing sbitmap wait queue for that. > >>>> > >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>>> allocation. I haven't read the dm code to know if this is a > >>>> possibility or not. > > > > Right, #2 is _not_ the way forward. Historically request-based DM used > > its own mempool for requests, this was to be able to have some measure > > of control and resiliency in the face of low memory conditions that > > might be affecting the broader system. > > > > Then Christoph switched over to adding per-request-data; which ushered > > in the use of blk_get_request using ATOMIC allocations. I like the > > result of that line of development. But taking the next step of setting > > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this > > dm-mpath.c code is common to old .request_fn and blk-mq, at least the > > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking > > for a request because for this dm-mpath device we have multiple queues > > to allocate from if need be (provided we have an active-active storage > > network topology). > > If you can go to multiple devices, obviously it should not block on a > single device. That's only true for the case where you can only go to > one device, blocking at that point would probably be fine. Or if all > your paths are busy, then blocking would also be OK. Introducing one extra block point will hurt AIO performance, in which there is usually much less jobs/processes to submit IO. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 15:24 ` Jens Axboe 2018-01-19 15:40 ` Ming Lei @ 2018-01-29 22:37 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-29 22:37 UTC (permalink / raw) To: Jens Axboe, Ming Lei Cc: snitzer, dm-devel, hch, linux-kernel, linux-block, osandov On 01/19/18 07:24, Jens Axboe wrote: > That's what I thought. So for a low queue depth underlying queue, it's > quite possible that this situation can happen. Two potential solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. (replying to an e-mail from ten days ago) Implementing a notification mechanism for all cases in which blk_insert_cloned_request() returns BLK_STS_RESOURCE today would require a lot of work. If e.g. a SCSI LLD returns one of the SCSI_MLQUEUE_*_BUSY return codes from its .queuecommand() implementation then the SCSI core will translate that return code into BLK_STS_RESOURCE. From scsi_queue_rq(): reason = scsi_dispatch_cmd(cmd); if (reason) { scsi_set_blocked(cmd, reason); ret = BLK_STS_RESOURCE; goto out_dec_host_busy; } In other words, implementing a notification mechanism for all cases in which blk_insert_cloned_request() can return BLK_STS_RESOURCE would require to modify all SCSI LLDs. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 2:32 ` [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei @ 2018-01-19 5:09 ` Bart Van Assche 2018-01-19 5:09 ` Bart Van Assche 1 sibling, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 5:09 UTC (permalink / raw) To: ming.lei, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, snitzer T24gRnJpLCAyMDE4LTAxLTE5IGF0IDEwOjMyICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gTm93 IG1vc3Qgb2YgdGltZXMgYm90aCBOVk1lIGFuZCBTQ1NJIHdvbid0IHJldHVybiBCTEtfU1RTX1JF U09VUkNFLCBhbmQNCj4gaXQgc2hvdWxkIGJlIERNLW9ubHkgd2hpY2ggcmV0dXJucyBTVFNfUkVT T1VSQ0Ugc28gb2Z0ZW4uDQoNClRoYXQncyB3cm9uZyBhdCBsZWFzdCBmb3IgU0NTSS4gU2VlIGFs c28gaHR0cHM6Ly9tYXJjLmluZm8vP2w9bGludXgtYmxvY2smbT0xNTE1NzgzMjk0MTcwNzYuDQoN CkJhcnQu ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 5:09 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 5:09 UTC (permalink / raw) To: ming.lei, axboe Cc: dm-devel, hch, linux-kernel, linux-block, osandov, snitzer On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote: > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > it should be DM-only which returns STS_RESOURCE so often. That's wrong at least for SCSI. See also https://marc.info/?l=linux-block&m=151578329417076. Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 5:09 ` Bart Van Assche (?) @ 2018-01-19 7:34 ` Ming Lei 2018-01-19 19:47 ` Bart Van Assche -1 siblings, 1 reply; 80+ messages in thread From: Ming Lei @ 2018-01-19 7:34 UTC (permalink / raw) To: Bart Van Assche Cc: axboe, dm-devel, hch, linux-kernel, linux-block, osandov, snitzer On Fri, Jan 19, 2018 at 05:09:46AM +0000, Bart Van Assche wrote: > On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote: > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > That's wrong at least for SCSI. See also https://marc.info/?l=linux-block&m=151578329417076. > > For other scenario's, e.g. if a SCSI initiator submits a > SCSI request over a fabric and the SCSI target replies with "BUSY" then the Could you explain a bit when SCSI target replies with BUSY very often? Inside initiator, we have limited the max per-LUN requests and per-host requests already before calling .queue_rq(). > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the > maximum number of retries has been reached (see also scsi_io_completion()). > In that last case, if a SCSI target sends a "BUSY" reply over the wire back > to the initiator, there is no other approach for the SCSI initiator to > figure out whether it can queue another request than to resubmit the > request. The worst possible strategy is to resubmit a request immediately > because that will cause a significant fraction of the fabric bandwidth to > be used just for replying "BUSY" to requests that can't be processed > immediately. -- Ming ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle 2018-01-19 7:34 ` Ming Lei @ 2018-01-19 19:47 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 19:47 UTC (permalink / raw) To: ming.lei Cc: dm-devel, hch, linux-kernel, linux-block, osandov, axboe, snitzer T24gRnJpLCAyMDE4LTAxLTE5IGF0IDE1OjM0ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gQ291 bGQgeW91IGV4cGxhaW4gYSBiaXQgd2hlbiBTQ1NJIHRhcmdldCByZXBsaWVzIHdpdGggQlVTWSB2 ZXJ5IG9mdGVuPw0KPiANCj4gSW5zaWRlIGluaXRpYXRvciwgd2UgaGF2ZSBsaW1pdGVkIHRoZSBt YXggcGVyLUxVTiByZXF1ZXN0cyBhbmQgcGVyLWhvc3QNCj4gcmVxdWVzdHMgYWxyZWFkeSBiZWZv cmUgY2FsbGluZyAucXVldWVfcnEoKS4NCg0KVGhhdCdzIGNvcnJlY3QuIEhvd2V2ZXIsIHdoZW4g YSBTQ1NJIGluaXRpYXRvciBhbmQgdGFyZ2V0IHN5c3RlbSBhcmUNCmNvbW11bmljYXRpbmcgd2l0 aCBlYWNoIG90aGVyIHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0aGF0IGluaXRpYXRvciBhbmQgdGFy Z2V0DQpxdWV1ZSBkZXB0aCBoYXZlIGJlZW4gdHVuZWQgcHJvcGVybHkuIElmIHRoZSBpbml0aWF0 b3IgcXVldWUgZGVwdGggYW5kIHRoZQ0KbnVtYmVyIG9mIHJlcXVlc3RzIHRoYXQgY2FuIGJlIGlu IGZsaWdodCBhY2NvcmRpbmcgdG8gdGhlIG5ldHdvcmsgcHJvdG9jb2wNCmFyZSBib3RoIGxhcmdl ciB0aGFuIHRoZSB0YXJnZXQgcXVldWUgZGVwdGggYW5kIGlmIHRoZSB0YXJnZXQgc3lzdGVtIHVz ZXMNCnJlbGF0aXZlbHkgc2xvdyBzdG9yYWdlIChlLmcuIGhhcmRkaXNrcykgdGhlbiBpdCBjYW4g aGFwcGVuIHRoYXQgdGhlIHRhcmdldA0KcmVwbGllcyB3aXRoIEJVU1kgb2Z0ZW4uDQoNClRoZSBM aW51eCBpU0NTSSBpbml0aWF0b3IgbGltaXRzIE1heE91dHN0YW5kaW5nUjJUICh0aGUgbnVtYmVy IG9mIHJlcXVlc3RzDQphbiBpbml0aWF0b3IgbWF5IHNlbnQgd2l0aG91dCBoYXZpbmcgcmVjZWl2 ZWQgYW4gYW5zd2VyIGZyb20gdGhlIHRhcmdldCkgdG8NCm9uZSBzbyBJIGRvbid0IHRoaW5rIHRo aXMgY2FuIGhhcHBlbiB3aGVuIHVzaW5nIGlTQ1NJL1RDUC4NCg0KV2l0aCB0aGUgU1JQIGluaXRp YXRvciBob3dldmVyIHRoZSBtYXhpbXVtIHJlcXVlc3RzIHRoYXQgY2FuIGJlIGluIGZsaWdodA0K YmV0d2VlbiBpbml0aWF0b3IgYW5kIHRhcmdldCBkZXBlbmRzIG9uIHRoZSBudW1iZXIgb2YgY3Jl ZGl0cyB0aGF0IHdlcmUNCm5lZ290aWF0ZWQgZHVyaW5nIGxvZ2luIGJldHdlZW4gaW5pdGlhdG9y IGFuZCB0YXJnZXQuIFNvbWUgdGltZSBhZ28gSSBtb2RpZmllZA0KdGhlIFNSUCBpbml0aWF0b3Ig c3VjaCB0aGF0IGl0IGxpbWl0cyB0aGUgaW5pdGlhdG9yIHF1ZXVlIGRlcHRoIHRvIHRoZSBudW1i ZXINCm9mIFNSUCBjcmVkaXRzIG1pbnVzIG9uZSAoZm9yIHRhc2sgbWFuYWdlbWVudCkuIFRoYXQg cmVzdWx0ZWQgaW4gYSBwZXJmb3JtYW5jZQ0KaW1wcm92ZW1lbnQgZHVlIHRvIGZld2VyIEJVU1kg Y29uZGl0aW9ucyBhdCB0aGUgaW5pdGlhdG9yIHNpZGUgKHNlZSBhbHNvIGNvbW1pdA0KN2FkZTQw MGFiYTlhICgiSUIvc3JwOiBSZWR1Y2UgbnVtYmVyIG9mIEJVU1kgY29uZGl0aW9ucyIpKS4gQW5v dGhlciBwYXRjaCBmb3INCnRoZSBTQ1NUIFNSUCB0YXJnZXQgZHJpdmVyIGxpbWl0ZWQgdGhlIG51 bWJlciBvZiBTUlAgY3JlZGl0cyB0byB0aGUgcXVldWUgZGVwdGgNCm9mIHRoZSBibG9jayBkZXZp Y2UgYXQgdGhlIHRhcmdldCBzaWRlLiBJJ20gcmVmZXJyaW5nIHRvIHRoZSBmb2xsb3dpbmcgY29k ZToNCmNoLT5ycV9zaXplID0gbWluKE1BWF9TUlBUX1JRX1NJWkUsIHNjc3RfZ2V0X21heF9sdW5f Y29tbWFuZHMoTlVMTCwgMCkpIChJIGhhdmUNCm5vdCB5ZXQgaGFkIHRoZSB0aW1lIHRvIHBvcnQg dGhpcyBjaGFuZ2UgdG8gTElPKS4NCg0KV2l0aG91dCBzdWNoIHR1bmluZyBhY3Jvc3MgaW5pdGlh dG9yIGFuZCB0YXJnZXQgaXQgY2FuIGhhcHBlbiBvZnRlbiB0aGF0IHRoZQ0KdGFyZ2V0IHN5c3Rl bSBzZW5kcyB0aGUgcmVwbHkgIkJVU1kiIGJhY2sgdG8gdGhlIGluaXRpYXRvci4gSSB0aGluayB0 aGF0J3Mgd2h5DQp0aGVyZSBpcyBjb2RlIGluIHRoZSBTQ1NJIGNvcmUgdG8gYXV0b21hdGljYWxs eSBhZGp1c3QgdGhlIGluaXRpYXRvciBxdWV1ZQ0KZGVwdGggaWYgdGhlICJCVVNZIiBjb25kaXRp b24gaXMgZW5jb3VudGVyZWQgZnJlcXVlbnRseS4gU2VlIGFsc28NCnNjc2lfdHJhY2tfcXVldWVf ZnVsbCgpLg0KDQpCYXJ0Lg== ^ permalink raw reply [flat|nested] 80+ messages in thread
* Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle @ 2018-01-19 19:47 ` Bart Van Assche 0 siblings, 0 replies; 80+ messages in thread From: Bart Van Assche @ 2018-01-19 19:47 UTC (permalink / raw) To: ming.lei Cc: dm-devel, hch, linux-kernel, linux-block, osandov, axboe, snitzer On Fri, 2018-01-19 at 15:34 +0800, Ming Lei wrote: > Could you explain a bit when SCSI target replies with BUSY very often? > > Inside initiator, we have limited the max per-LUN requests and per-host > requests already before calling .queue_rq(). That's correct. However, when a SCSI initiator and target system are communicating with each other there is no guarantee that initiator and target queue depth have been tuned properly. If the initiator queue depth and the number of requests that can be in flight according to the network protocol are both larger than the target queue depth and if the target system uses relatively slow storage (e.g. harddisks) then it can happen that the target replies with BUSY often. The Linux iSCSI initiator limits MaxOutstandingR2T (the number of requests an initiator may sent without having received an answer from the target) to one so I don't think this can happen when using iSCSI/TCP. With the SRP initiator however the maximum requests that can be in flight between initiator and target depends on the number of credits that were negotiated during login between initiator and target. Some time ago I modified the SRP initiator such that it limits the initiator queue depth to the number of SRP credits minus one (for task management). That resulted in a performance improvement due to fewer BUSY conditions at the initiator side (see also commit 7ade400aba9a ("IB/srp: Reduce number of BUSY conditions")). Another patch for the SCST SRP target driver limited the number of SRP credits to the queue depth of the block device at the target side. I'm referring to the following code: ch->rq_size = min(MAX_SRPT_RQ_SIZE, scst_get_max_lun_commands(NULL, 0)) (I have not yet had the time to port this change to LIO). Without such tuning across initiator and target it can happen often that the target system sends the reply "BUSY" back to the initiator. I think that's why there is code in the SCSI core to automatically adjust the initiator queue depth if the "BUSY" condition is encountered frequently. See also scsi_track_queue_full(). Bart. ^ permalink raw reply [flat|nested] 80+ messages in thread
end of thread, other threads:[~2018-01-29 22:37 UTC | newest] Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-18 2:41 [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei 2018-01-18 16:50 ` Bart Van Assche 2018-01-18 17:03 ` Mike Snitzer 2018-01-18 17:03 ` Mike Snitzer 2018-01-18 17:20 ` Bart Van Assche 2018-01-18 17:20 ` Bart Van Assche 2018-01-18 18:30 ` Mike Snitzer 2018-01-18 18:47 ` Bart Van Assche 2018-01-18 18:47 ` Bart Van Assche 2018-01-18 20:11 ` Jens Axboe 2018-01-18 20:11 ` Jens Axboe 2018-01-18 20:48 ` Mike Snitzer 2018-01-18 20:58 ` Bart Van Assche 2018-01-18 20:58 ` Bart Van Assche 2018-01-18 21:23 ` Mike Snitzer 2018-01-18 21:23 ` Mike Snitzer 2018-01-18 21:37 ` Laurence Oberman 2018-01-18 21:39 ` [dm-devel] " Bart Van Assche 2018-01-18 21:39 ` Bart Van Assche 2018-01-18 21:45 ` Laurence Oberman 2018-01-18 21:45 ` Laurence Oberman 2018-01-18 22:01 ` Mike Snitzer 2018-01-18 22:18 ` Laurence Oberman 2018-01-18 22:20 ` Laurence Oberman 2018-01-18 22:20 ` Laurence Oberman 2018-01-18 22:24 ` Bart Van Assche 2018-01-18 22:24 ` Bart Van Assche 2018-01-18 22:35 ` Laurence Oberman 2018-01-18 22:39 ` Jens Axboe 2018-01-18 22:55 ` Bart Van Assche 2018-01-18 22:55 ` Bart Van Assche 2018-01-18 22:20 ` Bart Van Assche 2018-01-18 22:20 ` Bart Van Assche 2018-01-23 9:22 ` [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) Mike Snitzer 2018-01-23 10:53 ` Ming Lei 2018-01-23 12:15 ` Mike Snitzer 2018-01-23 12:17 ` Ming Lei 2018-01-23 12:43 ` Mike Snitzer 2018-01-23 16:43 ` [PATCH] " Bart Van Assche 2018-01-23 16:43 ` Bart Van Assche 2018-01-19 2:32 ` [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei 2018-01-19 4:02 ` Jens Axboe 2018-01-19 7:26 ` Ming Lei 2018-01-19 15:20 ` Bart Van Assche 2018-01-19 15:20 ` Bart Van Assche 2018-01-19 15:25 ` Jens Axboe 2018-01-19 15:33 ` Ming Lei 2018-01-19 16:06 ` Bart Van Assche 2018-01-19 16:06 ` Bart Van Assche 2018-01-19 15:24 ` Jens Axboe 2018-01-19 15:40 ` Ming Lei 2018-01-19 15:40 ` Ming Lei 2018-01-19 15:48 ` Jens Axboe 2018-01-19 16:05 ` Ming Lei 2018-01-19 16:19 ` Jens Axboe 2018-01-19 16:26 ` Ming Lei 2018-01-19 16:27 ` Jens Axboe 2018-01-19 16:37 ` Ming Lei 2018-01-19 16:41 ` Jens Axboe 2018-01-19 16:41 ` Jens Axboe 2018-01-19 16:47 ` Mike Snitzer 2018-01-19 16:52 ` Jens Axboe 2018-01-19 17:05 ` Ming Lei 2018-01-19 17:09 ` Jens Axboe 2018-01-19 17:20 ` Ming Lei 2018-01-19 17:38 ` Jens Axboe 2018-01-19 18:24 ` Ming Lei 2018-01-19 18:24 ` Ming Lei 2018-01-19 18:33 ` Mike Snitzer 2018-01-19 23:52 ` Ming Lei 2018-01-20 4:27 ` Jens Axboe 2018-01-19 16:13 ` Mike Snitzer 2018-01-19 16:23 ` Jens Axboe 2018-01-19 23:57 ` Ming Lei 2018-01-29 22:37 ` Bart Van Assche 2018-01-19 5:09 ` Bart Van Assche 2018-01-19 5:09 ` Bart Van Assche 2018-01-19 7:34 ` Ming Lei 2018-01-19 19:47 ` Bart Van Assche 2018-01-19 19:47 ` Bart Van Assche
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.