* [PATCH] blk-mq: avoid to iterate over stale request
@ 2021-09-06 6:50 Ming Lei
2021-09-06 22:44 ` Bart Van Assche
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Ming Lei @ 2021-09-06 6:50 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Ming Lei, linux-scsi, Martin K. Petersen, luojiaxing
blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.
So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].
scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.
Fixes the issue by not iterating over stale request.
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-tag.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 86f87346232a..ff5caeb82542 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -208,7 +208,7 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
spin_lock_irqsave(&tags->lock, flags);
rq = tags->rqs[bitnr];
- if (!rq || !refcount_inc_not_zero(&rq->ref))
+ if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
rq = NULL;
spin_unlock_irqrestore(&tags->lock, flags);
return rq;
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-09-06 6:50 [PATCH] blk-mq: avoid to iterate over stale request Ming Lei
@ 2021-09-06 22:44 ` Bart Van Assche
2021-09-07 1:14 ` Ming Lei
2021-09-13 1:26 ` Ming Lei
2021-12-14 18:41 ` Kashyap Desai
2 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2021-09-06 22:44 UTC (permalink / raw)
To: Ming Lei, Jens Axboe
Cc: linux-block, linux-scsi, Martin K. Petersen, luojiaxing
On 9/5/21 23:50, Ming Lei wrote:
> - if (!rq || !refcount_inc_not_zero(&rq->ref))
> + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
> rq = NULL;
Shouldn't the rq->tag != bitnr test happen after the refcount has been
incremented since otherwise rq->tag can change after it has been read
and before the refcount is incremented?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-09-06 22:44 ` Bart Van Assche
@ 2021-09-07 1:14 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2021-09-07 1:14 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, linux-scsi, Martin K. Petersen, luojiaxing
On Mon, Sep 06, 2021 at 03:44:08PM -0700, Bart Van Assche wrote:
> On 9/5/21 23:50, Ming Lei wrote:
> > - if (!rq || !refcount_inc_not_zero(&rq->ref))
> > + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
> > rq = NULL;
>
> Shouldn't the rq->tag != bitnr test happen after the refcount has been
> incremented since otherwise rq->tag can change after it has been read and
> before the refcount is incremented?
rq->tag can change too after its refcount is grabbed. If the rq is released
during the iterating, either SCMD_STATE_INFLIGHT is cleared or
refcount_inc_not_zero() fails. So this way works.
The use case for scsi_host_queue_ready() and scsi EH handling is a bit
special. For others, either the iterating needn't to be exact, or queue
is frozen.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-09-06 6:50 [PATCH] blk-mq: avoid to iterate over stale request Ming Lei
2021-09-06 22:44 ` Bart Van Assche
@ 2021-09-13 1:26 ` Ming Lei
2021-09-13 1:31 ` Jens Axboe
2021-12-14 18:41 ` Kashyap Desai
2 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-09-13 1:26 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, linux-scsi, Martin K. Petersen, luojiaxing
On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote:
> blk-mq can't run allocating driver tag and updating ->rqs[tag]
> atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
> tag is released.
>
> So there is chance to iterating over one stale request just after the
> tag is allocated and before updating ->rqs[tag].
>
> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
> in-flight requests after scsi host is blocked, so no new scsi command can
> be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
> be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
> but this request may have been kept in another slot of ->rqs[], meantime
> the slot can be allocated out but ->rqs[] isn't updated yet. Then this
> in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
> trouble in handling scsi error.
>
> Fixes the issue by not iterating over stale request.
>
> Cc: linux-scsi@vger.kernel.org
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Reported-by: luojiaxing <luojiaxing@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Hello Jens,
luojiaxiang has verified that this patch fixes his issue, any chance to
merge it?
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-09-13 1:26 ` Ming Lei
@ 2021-09-13 1:31 ` Jens Axboe
0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2021-09-13 1:31 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, linux-scsi, Martin K. Petersen, luojiaxing
On 9/12/21 7:26 PM, Ming Lei wrote:
> On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote:
>> blk-mq can't run allocating driver tag and updating ->rqs[tag]
>> atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
>> tag is released.
>>
>> So there is chance to iterating over one stale request just after the
>> tag is allocated and before updating ->rqs[tag].
>>
>> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
>> in-flight requests after scsi host is blocked, so no new scsi command can
>> be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
>> be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
>> but this request may have been kept in another slot of ->rqs[], meantime
>> the slot can be allocated out but ->rqs[] isn't updated yet. Then this
>> in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
>> trouble in handling scsi error.
>>
>> Fixes the issue by not iterating over stale request.
>>
>> Cc: linux-scsi@vger.kernel.org
>> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
>> Reported-by: luojiaxing <luojiaxing@huawei.com>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> Hello Jens,
>
> luojiaxiang has verified that this patch fixes his issue, any chance to
> merge it?
I'll queue it up for 5.15-rc2, thanks.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] blk-mq: avoid to iterate over stale request
2021-09-06 6:50 [PATCH] blk-mq: avoid to iterate over stale request Ming Lei
2021-09-06 22:44 ` Bart Van Assche
2021-09-13 1:26 ` Ming Lei
@ 2021-12-14 18:41 ` Kashyap Desai
2021-12-15 3:36 ` Ming Lei
2 siblings, 1 reply; 13+ messages in thread
From: Kashyap Desai @ 2021-12-14 18:41 UTC (permalink / raw)
To: Ming Lei, luojiaxing
Cc: Jens Axboe, linux-block, linux-scsi, Martin K. Petersen, John Garry
[-- Attachment #1: Type: text/plain, Size: 3139 bytes --]
+ John Garry
> blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically,
> meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
released.
>
> So there is chance to iterating over one stale request just after the
tag is
> allocated and before updating ->rqs[tag].
>
> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight
> requests after scsi host is blocked, so no new scsi command can be
marked as
> SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by
blk-
> mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request
> may have been kept in another slot of ->rqs[], meantime the slot can be
> allocated out but ->rqs[] isn't updated yet. Then this in-flight request
is
> counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
handling
> scsi error.
Hi Ming,
We found similar issue on RHEL8.5 (kernel does not have this patch in
discussion.). Issue reproduced on 5.15 kernel as well.
I understood this commit will fix specific race condition and avoid
reading incorrect host_busy value.
As per commit message - That incorrect host_busy will be just transient.
If we read after some delay, correct host_busy count will be available.
Right ?
In my case (I am using shared host tag enabled driver), it is not race
condition issue but stale rqs[] entries create permanent incorrect count
of host_busy.
Example - There are two pending IOs. This IOs are timed out. Bitmap of
pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs
to hctx1). Note - This is a shared bit map.
If hctx0 has same address of the request at 5th and 10th index, we will
count total 2 inflight commands instead of 1 from hctx0 context + From
hctx1 context, we will count 1 inflight command = Total is 3.
Even though we read after some delay, host_busy will be incorrect. We
expect host_busy = 2 but it will return 3.
This patch fix my issue explained above for shared host-tag case. I am
confused reading the commit message. You may not have intentionally fix
the issue as I explained but indirectly it fixes my issue. Am I correct ?
What was an issue reported by Luojiaxiang ? I am interested to know if
issue reported by Luojiaxiang had shared host tagset enabled ?
Kashyap
>
> Fixes the issue by not iterating over stale request.
>
> Cc: linux-scsi@vger.kernel.org
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Reported-by: luojiaxing <luojiaxing@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq-tag.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> 86f87346232a..ff5caeb82542 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -208,7 +208,7 @@ static struct request
> *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>
> spin_lock_irqsave(&tags->lock, flags);
> rq = tags->rqs[bitnr];
> - if (!rq || !refcount_inc_not_zero(&rq->ref))
> + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
> rq = NULL;
> spin_unlock_irqrestore(&tags->lock, flags);
> return rq;
> --
> 2.31.1
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-14 18:41 ` Kashyap Desai
@ 2021-12-15 3:36 ` Ming Lei
2021-12-15 7:30 ` Kashyap Desai
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-12-15 3:36 UTC (permalink / raw)
To: Kashyap Desai
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
Hello Kashyap,
On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> + John Garry
>
> > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> atomically,
> > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> released.
> >
> > So there is chance to iterating over one stale request just after the
> tag is
> > allocated and before updating ->rqs[tag].
> >
> > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
> in-flight
> > requests after scsi host is blocked, so no new scsi command can be
> marked as
> > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by
> blk-
> > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request
> > may have been kept in another slot of ->rqs[], meantime the slot can be
> > allocated out but ->rqs[] isn't updated yet. Then this in-flight request
> is
> > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> handling
> > scsi error.
>
> Hi Ming,
>
> We found similar issue on RHEL8.5 (kernel does not have this patch in
> discussion.). Issue reproduced on 5.15 kernel as well.
> I understood this commit will fix specific race condition and avoid
> reading incorrect host_busy value.
> As per commit message - That incorrect host_busy will be just transient.
> If we read after some delay, correct host_busy count will be available.
> Right ?
Yeah, any counter(include atomic counter) works in this way.
But here it may be 'permanent' because one stale request pointer may stay in
one slot of ->rqs[] for long enough time if this slot isn't reused, meantime
the same request can be reallocated in case of real io scheduler. Maybe the
commit log should be improved a bit for making it explicit.
>
> In my case (I am using shared host tag enabled driver), it is not race
> condition issue but stale rqs[] entries create permanent incorrect count
> of host_busy.
> Example - There are two pending IOs. This IOs are timed out. Bitmap of
> pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs
> to hctx1). Note - This is a shared bit map.
> If hctx0 has same address of the request at 5th and 10th index, we will
It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
map, both tag#5 and tag#10 are set, and both shared_tags->rqs[5] &
shared_tags->rqs[10] should point to the updated requests(timed out).
> count total 2 inflight commands instead of 1 from hctx0 context + From
> hctx1 context, we will count 1 inflight command = Total is 3.
> Even though we read after some delay, host_busy will be incorrect. We
> expect host_busy = 2 but it will return 3.
>
> This patch fix my issue explained above for shared host-tag case. I am
> confused reading the commit message. You may not have intentionally fix
> the issue as I explained but indirectly it fixes my issue. Am I correct ?
>
> What was an issue reported by Luojiaxiang ? I am interested to know if
> issue reported by Luojiaxiang had shared host tagset enabled ?
https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@huawei.com/
It should be same with yours, and the original report described & explained
the issue in details.
thanks,
Ming
>
> Kashyap
>
> >
> > Fixes the issue by not iterating over stale request.
> >
> > Cc: linux-scsi@vger.kernel.org
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Reported-by: luojiaxing <luojiaxing@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > block/blk-mq-tag.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> > 86f87346232a..ff5caeb82542 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -208,7 +208,7 @@ static struct request
> > *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> >
> > spin_lock_irqsave(&tags->lock, flags);
> > rq = tags->rqs[bitnr];
> > - if (!rq || !refcount_inc_not_zero(&rq->ref))
> > + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
> > rq = NULL;
> > spin_unlock_irqrestore(&tags->lock, flags);
> > return rq;
> > --
> > 2.31.1
--
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-15 3:36 ` Ming Lei
@ 2021-12-15 7:30 ` Kashyap Desai
2021-12-15 8:02 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Kashyap Desai @ 2021-12-15 7:30 UTC (permalink / raw)
To: Ming Lei
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
[-- Attachment #1: Type: text/plain, Size: 4133 bytes --]
>
> Hello Kashyap,
>
> On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> > + John Garry
> >
> > > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> > atomically,
> > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> > released.
> > >
> > > So there is chance to iterating over one stale request just after
> > > the
> > tag is
> > > allocated and before updating ->rqs[tag].
> > >
> > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count
> > > scsi
> > in-flight
> > > requests after scsi host is blocked, so no new scsi command can be
> > marked as
> > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run
> > > by
> > blk-
> > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this
> > > request may have been kept in another slot of ->rqs[], meantime the
> > > slot can be allocated out but ->rqs[] isn't updated yet. Then this
> > > in-flight request
> > is
> > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> > handling
> > > scsi error.
> >
> > Hi Ming,
> >
> > We found similar issue on RHEL8.5 (kernel does not have this patch in
> > discussion.). Issue reproduced on 5.15 kernel as well.
> > I understood this commit will fix specific race condition and avoid
> > reading incorrect host_busy value.
> > As per commit message - That incorrect host_busy will be just
transient.
> > If we read after some delay, correct host_busy count will be
available.
> > Right ?
>
> Yeah, any counter(include atomic counter) works in this way.
>
> But here it may be 'permanent' because one stale request pointer may
stay in
> one slot of ->rqs[] for long enough time if this slot isn't reused,
meantime the
> same request can be reallocated in case of real io scheduler. Maybe the
> commit log should be improved a bit for making it explicit.
Changing commit log description will help.
>
> >
> > In my case (I am using shared host tag enabled driver), it is not race
> > condition issue but stale rqs[] entries create permanent incorrect
> > count of host_busy.
> > Example - There are two pending IOs. This IOs are timed out. Bitmap of
> > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually
> > belongs to hctx1). Note - This is a shared bit map.
> > If hctx0 has same address of the request at 5th and 10th index, we
> > will
>
> It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
map, both
> tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags-
> >rqs[10] should point to the updated requests(timed out).
Updated pointers will be there for actual hctx. Below is possible and
that is what causing problem in original issue.
shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) <-
This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry
is also found which is actually outstanding on hctx1.
shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
Issue noticed by me is the exact same issue described @ below -
https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu
awei.com/
Issue is only exposed to shared host tagset. I got the required
information. Thanks.
Kashyap
>
> > count total 2 inflight commands instead of 1 from hctx0 context + From
> > hctx1 context, we will count 1 inflight command = Total is 3.
> > Even though we read after some delay, host_busy will be incorrect. We
> > expect host_busy = 2 but it will return 3.
> >
> > This patch fix my issue explained above for shared host-tag case. I
> > am confused reading the commit message. You may not have intentionally
> > fix the issue as I explained but indirectly it fixes my issue. Am I
correct ?
> >
> > What was an issue reported by Luojiaxiang ? I am interested to know if
> > issue reported by Luojiaxiang had shared host tagset enabled ?
>
> https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> 5c10539492cb@huawei.com/
I check this. It is same issue as what I am seeing on Broadcom controller
only if shared host tagset is enabled.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-15 7:30 ` Kashyap Desai
@ 2021-12-15 8:02 ` Ming Lei
2021-12-15 8:45 ` Kashyap Desai
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-12-15 8:02 UTC (permalink / raw)
To: Kashyap Desai
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
On Wed, Dec 15, 2021 at 01:00:49PM +0530, Kashyap Desai wrote:
> >
> > Hello Kashyap,
> >
> > On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> > > + John Garry
> > >
> > > > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> > > atomically,
> > > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> > > released.
> > > >
> > > > So there is chance to iterating over one stale request just after
> > > > the
> > > tag is
> > > > allocated and before updating ->rqs[tag].
> > > >
> > > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count
> > > > scsi
> > > in-flight
> > > > requests after scsi host is blocked, so no new scsi command can be
> > > marked as
> > > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run
> > > > by
> > > blk-
> > > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this
> > > > request may have been kept in another slot of ->rqs[], meantime the
> > > > slot can be allocated out but ->rqs[] isn't updated yet. Then this
> > > > in-flight request
> > > is
> > > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> > > handling
> > > > scsi error.
> > >
> > > Hi Ming,
> > >
> > > We found similar issue on RHEL8.5 (kernel does not have this patch in
> > > discussion.). Issue reproduced on 5.15 kernel as well.
> > > I understood this commit will fix specific race condition and avoid
> > > reading incorrect host_busy value.
> > > As per commit message - That incorrect host_busy will be just
> transient.
> > > If we read after some delay, correct host_busy count will be
> available.
> > > Right ?
> >
> > Yeah, any counter(include atomic counter) works in this way.
> >
> > But here it may be 'permanent' because one stale request pointer may
> stay in
> > one slot of ->rqs[] for long enough time if this slot isn't reused,
> meantime the
> > same request can be reallocated in case of real io scheduler. Maybe the
> > commit log should be improved a bit for making it explicit.
>
> Changing commit log description will help.
>
> >
> > >
> > > In my case (I am using shared host tag enabled driver), it is not race
> > > condition issue but stale rqs[] entries create permanent incorrect
> > > count of host_busy.
> > > Example - There are two pending IOs. This IOs are timed out. Bitmap of
> > > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually
> > > belongs to hctx1). Note - This is a shared bit map.
> > > If hctx0 has same address of the request at 5th and 10th index, we
> > > will
> >
> > It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
> map, both
> > tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags-
> > >rqs[10] should point to the updated requests(timed out).
> Updated pointers will be there for actual hctx. Below is possible and
> that is what causing problem in original issue.
>
> shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) <-
> This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry
> is also found which is actually outstanding on hctx1.
> shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
Sorry, I am a bit confused, please look at the following code and
blk_mq_find_and_get_req()(<-bt_tags_iter).
void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
{
unsigned int flags = tagset->flags;
int i, nr_tags;
nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
for (i = 0; i < nr_tags; i++) {
if (tagset->tags && tagset->tags[i])
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
BT_TAG_ITER_STARTED);
}
}
In case of shared tags, only tagset->tags[0] is iterated over, so both
5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
shouldn't just point to the two latest requests? How can both 0xAA &
0xBB be retrieved from shared_tags->rqs[10]?
> Issue noticed by me is the exact same issue described @ below -
> https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu
> awei.com/
>
> Issue is only exposed to shared host tagset. I got the required
> information. Thanks.
>
> Kashyap
> >
> > > count total 2 inflight commands instead of 1 from hctx0 context + From
> > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > Even though we read after some delay, host_busy will be incorrect. We
> > > expect host_busy = 2 but it will return 3.
> > >
> > > This patch fix my issue explained above for shared host-tag case. I
> > > am confused reading the commit message. You may not have intentionally
> > > fix the issue as I explained but indirectly it fixes my issue. Am I
> correct ?
> > >
> > > What was an issue reported by Luojiaxiang ? I am interested to know if
> > > issue reported by Luojiaxiang had shared host tagset enabled ?
> >
> > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > 5c10539492cb@huawei.com/
>
> I check this. It is same issue as what I am seeing on Broadcom controller
> only if shared host tagset is enabled.
But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
only for shared tags.
Thanks
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-15 8:02 ` Ming Lei
@ 2021-12-15 8:45 ` Kashyap Desai
2021-12-15 12:56 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Kashyap Desai @ 2021-12-15 8:45 UTC (permalink / raw)
To: Ming Lei
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
[-- Attachment #1: Type: text/plain, Size: 3712 bytes --]
> >
> > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)
> > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this
> > entry is also found which is actually outstanding on hctx1.
> > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
>
> Sorry, I am a bit confused, please look at the following code and
> blk_mq_find_and_get_req()(<-bt_tags_iter).
My issue is without shared tags. Below patch is certainly a fix for shared
tags (for 5.16-rc).
I have seen scsi eh deadlock issue on 5.15 kernel which does not have
shared tags (but it has shared bitmap.)
>
> void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> busy_tag_iter_fn *fn, void *priv) {
> unsigned int flags = tagset->flags;
> int i, nr_tags;
>
> nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
tagset->nr_hw_queues;
>
> for (i = 0; i < nr_tags; i++) {
> if (tagset->tags && tagset->tags[i])
> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> BT_TAG_ITER_STARTED);
> }
> }
>
> In case of shared tags, only tagset->tags[0] is iterated over, so both
> 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
> shouldn't just point to the two latest requests? How can both 0xAA &
0xBB be
> retrieved from shared_tags->rqs[10]?
Since I am not talking about shared tags, you can remap same condition
now.
In 5.15 kernel (shared bitmap is enabled but not shared tags) -
blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times.
It will call bt_tags_for_each() for hctx0.tags->[], hctx1.tags->[] ..
Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can
find hctx0.tags[10].rq which is really stale entry.
If the same request which is stale @ hctx0.tags[10] is really outstanding
at some other tag#, stale entry will be counted in host_busy.
>
> > Issue noticed by me is the exact same issue described @ below -
> > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c
> > b@hu
> > awei.com/
> >
> > Issue is only exposed to shared host tagset. I got the required
> > information. Thanks.
> >
> > Kashyap
> > >
> > > > count total 2 inflight commands instead of 1 from hctx0 context +
> > > > From
> > > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > > Even though we read after some delay, host_busy will be incorrect.
> > > > We expect host_busy = 2 but it will return 3.
> > > >
> > > > This patch fix my issue explained above for shared host-tag case.
> > > > I am confused reading the commit message. You may not have
> > > > intentionally fix the issue as I explained but indirectly it fixes
> > > > my issue. Am I
> > correct ?
> > > >
> > > > What was an issue reported by Luojiaxiang ? I am interested to
> > > > know if issue reported by Luojiaxiang had shared host tagset
enabled ?
> > >
> > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > > 5c10539492cb@huawei.com/
> >
> > I check this. It is same issue as what I am seeing on Broadcom
> > controller only if shared host tagset is enabled.
>
> But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
only for
> shared tags.
I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
shared bitmap, so it is exposed and It is confirmed from this discussion.
Do we still have exposure (if "blk-mq: avoid to iterate over stale
request" is not part of kernel) to mpi3mr type driver which does not use
shared bitmap but has nr_hw_queues > 1. ?
Kashyap
>
>
>
> Thanks
> Ming
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-15 8:45 ` Kashyap Desai
@ 2021-12-15 12:56 ` Ming Lei
2021-12-16 11:55 ` Kashyap Desai
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-12-15 12:56 UTC (permalink / raw)
To: Kashyap Desai
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
On Wed, Dec 15, 2021 at 02:15:51PM +0530, Kashyap Desai wrote:
> > >
> > > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> > > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)
> > > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this
> > > entry is also found which is actually outstanding on hctx1.
> > > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
> >
> > Sorry, I am a bit confused, please look at the following code and
> > blk_mq_find_and_get_req()(<-bt_tags_iter).
>
>
> My issue is without shared tags. Below patch is certainly a fix for shared
> tags (for 5.16-rc).
> I have seen scsi eh deadlock issue on 5.15 kernel which does not have
> shared tags (but it has shared bitmap.)
OK, if you are talking about non-shared tags, your issue is exactly
addressed by 67f3b2f822b7 blk-mq: avoid to iterate over stale request
>
> >
> > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> > busy_tag_iter_fn *fn, void *priv) {
> > unsigned int flags = tagset->flags;
> > int i, nr_tags;
> >
> > nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
> tagset->nr_hw_queues;
> >
> > for (i = 0; i < nr_tags; i++) {
> > if (tagset->tags && tagset->tags[i])
> > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> > BT_TAG_ITER_STARTED);
> > }
> > }
> >
> > In case of shared tags, only tagset->tags[0] is iterated over, so both
> > 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
> > shouldn't just point to the two latest requests? How can both 0xAA &
> 0xBB be
> > retrieved from shared_tags->rqs[10]?
>
> Since I am not talking about shared tags, you can remap same condition
> now.
> In 5.15 kernel (shared bitmap is enabled but not shared tags) -
Shared bitmap and shared tags should have be same thing, that means all hw queues
share same tag space.
> blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times.
> It will call bt_tags_for_each() for hctx0.tags->[], hctx1.tags->[] ..
> Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can
If ->birtmap_tags is shared, only one tags should be iterated over, so
the following commit was added:
0994c64eb415 blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags
> find hctx0.tags[10].rq which is really stale entry.
> If the same request which is stale @ hctx0.tags[10] is really outstanding
> at some other tag#, stale entry will be counted in host_busy.
>
> >
> > > Issue noticed by me is the exact same issue described @ below -
> > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c
> > > b@hu
> > > awei.com/
> > >
> > > Issue is only exposed to shared host tagset. I got the required
> > > information. Thanks.
> > >
> > > Kashyap
> > > >
> > > > > count total 2 inflight commands instead of 1 from hctx0 context +
> > > > > From
> > > > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > > > Even though we read after some delay, host_busy will be incorrect.
> > > > > We expect host_busy = 2 but it will return 3.
> > > > >
> > > > > This patch fix my issue explained above for shared host-tag case.
> > > > > I am confused reading the commit message. You may not have
> > > > > intentionally fix the issue as I explained but indirectly it fixes
> > > > > my issue. Am I
> > > correct ?
> > > > >
> > > > > What was an issue reported by Luojiaxiang ? I am interested to
> > > > > know if issue reported by Luojiaxiang had shared host tagset
> enabled ?
> > > >
> > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > > > 5c10539492cb@huawei.com/
> > >
> > > I check this. It is same issue as what I am seeing on Broadcom
> > > controller only if shared host tagset is enabled.
> >
> > But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
> only for
> > shared tags.
>
> I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> shared bitmap, so it is exposed and It is confirmed from this discussion.
> Do we still have exposure (if "blk-mq: avoid to iterate over stale
> request" is not part of kernel) to mpi3mr type driver which does not use
> shared bitmap but has nr_hw_queues > 1. ?
Not sure I understand your poing, but patch "blk-mq: avoid to iterate over stale
request" can cover both shared tags or not.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-15 12:56 ` Ming Lei
@ 2021-12-16 11:55 ` Kashyap Desai
2021-12-16 12:50 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Kashyap Desai @ 2021-12-16 11:55 UTC (permalink / raw)
To: Ming Lei
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
[-- Attachment #1: Type: text/plain, Size: 850 bytes --]
> >
> > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> > shared bitmap, so it is exposed and It is confirmed from this
discussion.
> > Do we still have exposure (if "blk-mq: avoid to iterate over stale
> > request" is not part of kernel) to mpi3mr type driver which does not
> > use shared bitmap but has nr_hw_queues > 1. ?
>
> Not sure I understand your poing, but patch "blk-mq: avoid to iterate
over
> stale request" can cover both shared tags or not.
I agree with all above reply.
My query is for mpi3mr driver which is not calling "host->host_tagset =
1;", but nr_hw_queues are more than 1.
Current <mpi3mr> driver is nvme style interface. nr_hw_queues > 1 and
host->host_tagset = 0.
Is this patch "blk-mq: avoid to iterate over stale request" is applicable
for <mpi3mr> driver ?
Kashyap
>
>
>
> Thanks,
> Ming
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] blk-mq: avoid to iterate over stale request
2021-12-16 11:55 ` Kashyap Desai
@ 2021-12-16 12:50 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2021-12-16 12:50 UTC (permalink / raw)
To: Kashyap Desai
Cc: luojiaxing, Jens Axboe, linux-block, linux-scsi,
Martin K. Petersen, John Garry
On Thu, Dec 16, 2021 at 05:25:41PM +0530, Kashyap Desai wrote:
> > >
> > > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> > > shared bitmap, so it is exposed and It is confirmed from this
> discussion.
> > > Do we still have exposure (if "blk-mq: avoid to iterate over stale
> > > request" is not part of kernel) to mpi3mr type driver which does not
> > > use shared bitmap but has nr_hw_queues > 1. ?
> >
> > Not sure I understand your poing, but patch "blk-mq: avoid to iterate
> over
> > stale request" can cover both shared tags or not.
>
> I agree with all above reply.
>
> My query is for mpi3mr driver which is not calling "host->host_tagset =
> 1;", but nr_hw_queues are more than 1.
> Current <mpi3mr> driver is nvme style interface. nr_hw_queues > 1 and
> host->host_tagset = 0.
>
> Is this patch "blk-mq: avoid to iterate over stale request" is applicable
> for <mpi3mr> driver ?
Yeah, this change covers both ->host_tagset and !->host_tagset, same
with the following words:
scsi_host_check_in_flight() is used to counting scsi in-flight requests
after scsi host is blocked, so no new scsi command can be marked as
SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run at
that time by blk-mq core.
The issue is in blk_mq_tagset_busy_iter(). One request is in-flight, but
this request may be kept in another slot of ->rqs[], meantime the slot
can be allocated out but ->rqs[] isn't updated yet. Then the in-flight
request is counted twice as SCMD_STATE_INFLIGHT.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-12-16 12:51 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-06 6:50 [PATCH] blk-mq: avoid to iterate over stale request Ming Lei
2021-09-06 22:44 ` Bart Van Assche
2021-09-07 1:14 ` Ming Lei
2021-09-13 1:26 ` Ming Lei
2021-09-13 1:31 ` Jens Axboe
2021-12-14 18:41 ` Kashyap Desai
2021-12-15 3:36 ` Ming Lei
2021-12-15 7:30 ` Kashyap Desai
2021-12-15 8:02 ` Ming Lei
2021-12-15 8:45 ` Kashyap Desai
2021-12-15 12:56 ` Ming Lei
2021-12-16 11:55 ` Kashyap Desai
2021-12-16 12:50 ` Ming Lei
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.