All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Mike Snitzer <snitzer@redhat.com>
Cc: emilne@redhat.com, Christoph Hellwig <hch@infradead.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Mark Salter <msalter@redhat.com>,
	"James E. J. Bottomley" <JBottomley@odin.com>,
	brking <brking@us.ibm.com>,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-block@vger.kernel.org,
	"Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>,
	Jens Axboe <axboe@fb.com>
Subject: Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!
Date: Wed, 25 Nov 2015 20:01:47 +0100	[thread overview]
Message-ID: <5656059B.9010102@suse.de> (raw)
In-Reply-To: <20151125180133.GA18839@redhat.com>

On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> On Wed, Nov 25 2015 at  4:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house.  We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>
> How did you arrive at that?  Do you have a reproducer now?
>
Not a reproducer, but several dumps for analysis.

>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1:    seg_boundary_mask = 65535,
>> path-2:    seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1:    seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag,
>
> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>
Yes.

>> which is evaluated via
>> ->dm_dispatch_request()
>>    ->blk_insert_cloned_request()
>>      ->blk_rq_check_limits()
>
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> anyway after reading your mail I'm still left wondering if your proposed
> patch is correct.
>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>    */
>>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>   {
>> -       if (!rq_mergeable(rq))
>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>                  return 0;
>>
>>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> You're not explaining the actual change in the patch very well; I think
> you're correct but you're leaving the justification as an exercise to
> the reviewer:
>
> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> !rq_mergeable() check but you're saying for this case in question we
> never get there -- due to the cloned request having NOMERGE set.
>
> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>
> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> sense for cloned requests that always have NOMERGE set.
>
> So you're saying that by having blk_rq_check_limits() go on to call
> blk_recalc_rq_segments() this bug will be fixed?
>
That is the idea.

I've already established that in all instances I have seen so far
req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.

As it turns out, req->nr_phys_segemnts _would_ have been updated in
blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
for the cloned request.
So each cloned request inherits the values from the original request,
despite the fact that req->nr_phys_segments _has_ to be evaluated in
the final request_queue context, as the queue limits _might_ be 
different from the original (merged) queue limits of the multipath
request queue.

> BTW, I think blk_rq_check_limits()'s export should be removed and the
> function made static and renamed to blk_clone_rq_check_limits(), again:
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>
Actually, seeing Jens' last comment the check for REQ_TYPE_FS is 
pointless, too, so we might as well remove the entire if-clause.

> Seems prudent to make that change now to be clear that this code is only
> used by cloned requests.
>
Yeah, that would make sense. I'll be preparing a patch.
With a more detailed description :-)

Cheers,

Hannes

-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


WARNING: multiple messages have this Message-ID (diff)
From: Hannes Reinecke <hare@suse.de>
To: Mike Snitzer <snitzer@redhat.com>
Cc: emilne@redhat.com, Christoph Hellwig <hch@infradead.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Mark Salter <msalter@redhat.com>,
	"James E. J. Bottomley" <JBottomley@odin.com>,
	brking <brking@us.ibm.com>,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-block@vger.kernel.org,
	Jun'ichi Nomura <j-nomura@ce.jp.nec.com>,
	Jens Axboe <axboe@fb.com>
Subject: Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!
Date: Wed, 25 Nov 2015 20:01:47 +0100	[thread overview]
Message-ID: <5656059B.9010102@suse.de> (raw)
In-Reply-To: <20151125180133.GA18839@redhat.com>

On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> On Wed, Nov 25 2015 at  4:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house.  We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>
> How did you arrive at that?  Do you have a reproducer now?
>
Not a reproducer, but several dumps for analysis.

>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1:    seg_boundary_mask = 65535,
>> path-2:    seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1:    seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag,
>
> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>
Yes.

>> which is evaluated via
>> ->dm_dispatch_request()
>>    ->blk_insert_cloned_request()
>>      ->blk_rq_check_limits()
>
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> anyway after reading your mail I'm still left wondering if your proposed
> patch is correct.
>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>    */
>>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>   {
>> -       if (!rq_mergeable(rq))
>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>                  return 0;
>>
>>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> You're not explaining the actual change in the patch very well; I think
> you're correct but you're leaving the justification as an exercise to
> the reviewer:
>
> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> !rq_mergeable() check but you're saying for this case in question we
> never get there -- due to the cloned request having NOMERGE set.
>
> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>
> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> sense for cloned requests that always have NOMERGE set.
>
> So you're saying that by having blk_rq_check_limits() go on to call
> blk_recalc_rq_segments() this bug will be fixed?
>
That is the idea.

I've already established that in all instances I have seen so far
req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.

As it turns out, req->nr_phys_segemnts _would_ have been updated in
blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
for the cloned request.
So each cloned request inherits the values from the original request,
despite the fact that req->nr_phys_segments _has_ to be evaluated in
the final request_queue context, as the queue limits _might_ be 
different from the original (merged) queue limits of the multipath
request queue.

> BTW, I think blk_rq_check_limits()'s export should be removed and the
> function made static and renamed to blk_clone_rq_check_limits(), again:
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>
Actually, seeing Jens' last comment the check for REQ_TYPE_FS is 
pointless, too, so we might as well remove the entire if-clause.

> Seems prudent to make that change now to be clear that this code is only
> used by cloned requests.
>
Yeah, that would make sense. I'll be preparing a patch.
With a more detailed description :-)

Cheers,

Hannes

-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2015-11-25 19:02 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-18  9:18 kernel BUG at drivers/scsi/scsi_lib.c:1096! Michael Ellerman
2015-11-18 11:06 ` Laurent Dufour
2015-11-18 11:10   ` Michael Ellerman
2015-11-18 11:17     ` Laurent Dufour
2015-11-18 14:03 ` Mark Salter
2015-11-19  1:02   ` Michael Ellerman
2015-11-19  8:23     ` Christoph Hellwig
2015-11-19 15:35       ` Hannes Reinecke
2015-11-19 15:35         ` Hannes Reinecke
2015-11-20 14:38         ` Ewan Milne
2015-11-20 14:55           ` Hannes Reinecke
2015-11-20 14:55             ` Hannes Reinecke
2015-11-20 15:28             ` Ewan Milne
2015-11-23  6:55               ` Hannes Reinecke
2015-11-23  6:55                 ` Hannes Reinecke
2015-11-25  9:04               ` Hannes Reinecke
2015-11-25 17:56                 ` Jens Axboe
2015-11-25 17:56                   ` Jens Axboe
2015-11-25 19:10                   ` Hannes Reinecke
2015-11-25 19:24                     ` Jens Axboe
2015-11-25 19:24                       ` Jens Axboe
2015-11-25 20:23                       ` Mike Snitzer
2015-11-25 21:20                         ` Mike Snitzer
2015-11-25 18:01                 ` Mike Snitzer
2015-11-25 19:01                   ` Hannes Reinecke [this message]
2015-11-25 19:01                     ` Hannes Reinecke
2015-12-04 16:59                     ` Takashi Iwai
2015-12-04 16:59                       ` Takashi Iwai
2015-12-04 17:02                       ` Jens Axboe
2015-12-04 17:02                         ` Jens Axboe
2015-12-04 17:09                         ` Takashi Iwai
2015-12-04 17:09                           ` Takashi Iwai
2015-11-20 12:10       ` Michael Ellerman
2015-11-20 12:10         ` Michael Ellerman
2015-11-20 12:56         ` Laurent Dufour
2015-11-20 13:37           ` Mark Salter
2015-11-21 11:30         ` Laurent Dufour
2015-11-21 11:30           ` Laurent Dufour
2015-11-21 16:56           ` Ming Lei
2015-11-21 16:56             ` Ming Lei
2015-11-22 23:20             ` Mark Salter
2015-11-23  0:36               ` Ming Lei
2015-11-23  1:50                 ` Mark Salter
2015-11-23  1:50                   ` Mark Salter
2015-11-23  2:46                   ` Ming Lei
2015-11-23 15:21                     ` Ming Lei
2015-11-23 15:21                       ` Ming Lei
2015-11-24 18:59                       ` Alan Ott
2015-11-24 18:59                         ` Alan Ott
2015-11-24 18:59                         ` Alan Ott
2015-11-24 18:59                         ` Alan Ott
2015-11-23 13:57               ` Laurent Dufour
2015-11-23 13:57                 ` Laurent Dufour
2015-11-23 15:13                 ` Pratyush Anand
2015-11-23 15:20                   ` Laurent Dufour
2015-11-23 15:27                     ` Ming Lei
2015-11-23 16:24                       ` Laurent Dufour
2015-11-24  1:30                       ` Mark Salter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5656059B.9010102@suse.de \
    --to=hare@suse.de \
    --cc=JBottomley@odin.com \
    --cc=axboe@fb.com \
    --cc=brking@us.ibm.com \
    --cc=emilne@redhat.com \
    --cc=hch@infradead.org \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=msalter@redhat.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.