All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <tom.leiming@gmail.com>
To: Keith Busch <keith.busch@linux.intel.com>
Cc: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block <linux-block@vger.kernel.org>,
	Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme <linux-nvme@lists.infradead.org>,
	Keith Busch <keith.busch@intel.com>,
	Jianchao Wang <jianchao.w.wang@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH 1/2] nvme: pci: simplify timeout handling
Date: Sun, 29 Apr 2018 05:39:52 +0800	[thread overview]
Message-ID: <CACVXFVMfvLLt7h58W-ifJKbC-s_1oDhCJEx4AqD6wJ+w2AjnYg@mail.gmail.com> (raw)
In-Reply-To: <20180428133514.GB5938@localhost.localdomain>

On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> > I understand how the problems are happening a bit better now. It used
>> > to be that blk-mq would lock an expired command one at a time, so when
>> > we had a batch of IO timeouts, the driver was able to complete all of
>> > them inside a single IO timeout handler.
>> >
>> > That's not the case anymore, so the driver is called for every IO
>> > timeout even though if it reaped all the commands at once.
>>
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.
>
> That's not quite what I was talking about.
>
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().
>
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.

That won't happen because new timeout model will mark aborted on timed-out
request first, then run synchronize_rcu() before making these requests
really expired, and now rcu lock is held in normal completion
handler(blk_mq_complete_request).

Yes, Bart is working towards that way, but there is still the same race
between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
changes wrt. the timeout model:

- reset may take a while to complete because of nvme_wait_freeze(), and
timeout can happen during resetting, then reset may hang forever. Even
without nvme_wait_freeze(), it is possible for timeout to happen during
reset work too in theory.

Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
is enough to just quiesce queues to make hardware happy for recovery,
that has been part of my V2 patchset.

But it is really simple and clean to run the recovery(nvme_dev_disable and
reset_work) in one same context for avoiding this race, and the two
parts should have been done together for making our life easier, that is why
I was trying to work towards this direction.

Thanks,
Ming

WARNING: multiple messages have this Message-ID (diff)
From: tom.leiming@gmail.com (Ming Lei)
Subject: [PATCH 1/2] nvme: pci: simplify timeout handling
Date: Sun, 29 Apr 2018 05:39:52 +0800	[thread overview]
Message-ID: <CACVXFVMfvLLt7h58W-ifJKbC-s_1oDhCJEx4AqD6wJ+w2AjnYg@mail.gmail.com> (raw)
In-Reply-To: <20180428133514.GB5938@localhost.localdomain>

On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> > I understand how the problems are happening a bit better now. It used
>> > to be that blk-mq would lock an expired command one at a time, so when
>> > we had a batch of IO timeouts, the driver was able to complete all of
>> > them inside a single IO timeout handler.
>> >
>> > That's not the case anymore, so the driver is called for every IO
>> > timeout even though if it reaped all the commands at once.
>>
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.
>
> That's not quite what I was talking about.
>
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().
>
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.

That won't happen because new timeout model will mark aborted on timed-out
request first, then run synchronize_rcu() before making these requests
really expired, and now rcu lock is held in normal completion
handler(blk_mq_complete_request).

Yes, Bart is working towards that way, but there is still the same race
between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
changes wrt. the timeout model:

- reset may take a while to complete because of nvme_wait_freeze(), and
timeout can happen during resetting, then reset may hang forever. Even
without nvme_wait_freeze(), it is possible for timeout to happen during
reset work too in theory.

Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
is enough to just quiesce queues to make hardware happy for recovery,
that has been part of my V2 patchset.

But it is really simple and clean to run the recovery(nvme_dev_disable and
reset_work) in one same context for avoiding this race, and the two
parts should have been done together for making our life easier, that is why
I was trying to work towards this direction.

Thanks,
Ming

  parent reply	other threads:[~2018-04-28 21:39 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-26 12:39 [PATCH 0/2] nvme: pci: fix & improve timeout handling Ming Lei
2018-04-26 12:39 ` Ming Lei
2018-04-26 12:39 ` [PATCH 1/2] nvme: pci: simplify " Ming Lei
2018-04-26 12:39   ` Ming Lei
2018-04-26 15:07   ` jianchao.wang
2018-04-26 15:07     ` jianchao.wang
2018-04-26 15:57     ` Ming Lei
2018-04-26 15:57       ` Ming Lei
2018-04-26 16:16       ` Ming Lei
2018-04-26 16:16         ` Ming Lei
2018-04-27  1:37       ` jianchao.wang
2018-04-27  1:37         ` jianchao.wang
2018-04-27 14:57         ` Ming Lei
2018-04-27 14:57           ` Ming Lei
2018-04-28 14:00           ` jianchao.wang
2018-04-28 14:00             ` jianchao.wang
2018-04-28 21:57             ` Ming Lei
2018-04-28 21:57               ` Ming Lei
2018-04-28 22:27               ` Ming Lei
2018-04-28 22:27                 ` Ming Lei
2018-04-29  1:36                 ` Ming Lei
2018-04-29  1:36                   ` Ming Lei
2018-04-29  2:21                   ` jianchao.wang
2018-04-29  2:21                     ` jianchao.wang
2018-04-29 14:13                     ` Ming Lei
2018-04-29 14:13                       ` Ming Lei
2018-04-27 17:51   ` Keith Busch
2018-04-27 17:51     ` Keith Busch
2018-04-28  3:50     ` Ming Lei
2018-04-28  3:50       ` Ming Lei
2018-04-28 13:35       ` Keith Busch
2018-04-28 13:35         ` Keith Busch
2018-04-28 14:31         ` jianchao.wang
2018-04-28 14:31           ` jianchao.wang
2018-04-28 21:39         ` Ming Lei [this message]
2018-04-28 21:39           ` Ming Lei
2018-04-30 19:52           ` Keith Busch
2018-04-30 19:52             ` Keith Busch
2018-04-30 23:14             ` Ming Lei
2018-04-30 23:14               ` Ming Lei
2018-05-08 15:30       ` Keith Busch
2018-05-08 15:30         ` Keith Busch
2018-05-10 20:52         ` Ming Lei
2018-05-10 20:52           ` Ming Lei
2018-05-10 21:05           ` Keith Busch
2018-05-10 21:05             ` Keith Busch
2018-05-10 21:10             ` Ming Lei
2018-05-10 21:10               ` Ming Lei
2018-05-10 21:18               ` Keith Busch
2018-05-10 21:18                 ` Keith Busch
2018-05-10 21:24                 ` Ming Lei
2018-05-10 21:24                   ` Ming Lei
2018-05-10 21:44                   ` Keith Busch
2018-05-10 21:44                     ` Keith Busch
2018-05-10 21:50                     ` Ming Lei
2018-05-10 21:50                       ` Ming Lei
2018-05-10 21:53                     ` Ming Lei
2018-05-10 21:53                       ` Ming Lei
2018-05-10 22:03                 ` Ming Lei
2018-05-10 22:03                   ` Ming Lei
2018-05-10 22:43                   ` Keith Busch
2018-05-10 22:43                     ` Keith Busch
2018-05-11  0:14                     ` Ming Lei
2018-05-11  0:14                       ` Ming Lei
2018-05-11  2:10             ` Ming Lei
2018-05-11  2:10               ` Ming Lei
2018-04-26 12:39 ` [PATCH 2/2] nvme: pci: guarantee EH can make progress Ming Lei
2018-04-26 12:39   ` Ming Lei
2018-04-26 16:24   ` Keith Busch
2018-04-26 16:24     ` Keith Busch
2018-04-28  3:28     ` Ming Lei
2018-04-28  3:28       ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACVXFVMfvLLt7h58W-ifJKbC-s_1oDhCJEx4AqD6wJ+w2AjnYg@mail.gmail.com \
    --to=tom.leiming@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jianchao.w.wang@oracle.com \
    --cc=keith.busch@intel.com \
    --cc=keith.busch@linux.intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.