All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Steigerwald <martin@lichtvoll.de>
To: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"israelr@mellanox.com" <israelr@mellanox.com>,
	"sagi@grimberg.me" <sagi@grimberg.me>, "hch@lst.de" <hch@lst.de>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"ming.lei@redhat.com" <ming.lei@redhat.com>,
	"maxg@mellanox.com" <maxg@mellanox.com>
Subject: Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Date: Wed, 11 Apr 2018 20:38:48 +0200	[thread overview]
Message-ID: <11544331.gmjPRolTO7@merkaba> (raw)
In-Reply-To: <ac0d5904dbcd55e190df318a66a9d7d51c56f3ae.camel@wdc.com>

Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> >=20
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
>=20
> (+Martin Steigerwald)
[=E2=80=A6]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
> =20
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=3D199077).

Yes, with the three patches:

=2D '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a=20
request.mbox'

=2D '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into=20
rcu_read_{lock,unlock}().mbox'

=2D '[PATCH v4] blk-mq_Fix race conditions in request timeout=20
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to=20
be gone.

Also it appears that the error loading SMART data issue is gone as well=20
(see my bug report). However it is still to early to say for sure. I=20
think I need at least 2-3 days of additional testing with this kernel to=20
be sufficiently sure about it.

However=E2=80=A6 I could also test another patch, but from reading the rest=
 of=20
this thread so far I have no clear on whether to try one of the new=20
patches and if so which one and whether adding it on top of some of the=20
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here=20
and then, but I won=C2=B4t have much time to read the complete discussion t=
o=20
figure out what to apply.

Personally as a stable kernel has been released with those issues, I=20
think its good to fix it up soon. On the other hand it may take quite=20
some time til popular distros carry 4.16 for regular users. And I have=20
no idea how frequent the reported issues are, i.e. how many users would=20
be affected.

Thanks,
=2D-=20
Martin

WARNING: multiple messages have this Message-ID (diff)
From: Martin Steigerwald <martin@lichtvoll.de>
To: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"israelr@mellanox.com" <israelr@mellanox.com>,
	"sagi@grimberg.me" <sagi@grimberg.me>, "hch@lst.de" <hch@lst.de>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"ming.lei@redhat.com" <ming.lei@redhat.com>,
	"maxg@mellanox.com" <maxg@mellanox.com>
Subject: Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Date: Wed, 11 Apr 2018 20:38:48 +0200	[thread overview]
Message-ID: <11544331.gmjPRolTO7@merkaba> (raw)
In-Reply-To: <ac0d5904dbcd55e190df318a66a9d7d51c56f3ae.camel@wdc.com>

Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> > 
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
> 
> (+Martin Steigerwald)
[…]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
>  
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

Yes, with the three patches:

- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'

- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'

- '[PATCH v4] blk-mq_Fix race conditions in request timeout 
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to 
be gone.

Also it appears that the error loading SMART data issue is gone as well 
(see my bug report). However it is still to early to say for sure. I 
think I need at least 2-3 days of additional testing with this kernel to 
be sufficiently sure about it.

However… I could also test another patch, but from reading the rest of 
this thread so far I have no clear on whether to try one of the new 
patches and if so which one and whether adding it on top of some of the 
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here 
and then, but I won´t have much time to read the complete discussion to 
figure out what to apply.

Personally as a stable kernel has been released with those issues, I 
think its good to fix it up soon. On the other hand it may take quite 
some time til popular distros carry 4.16 for regular users. And I have 
no idea how frequent the reported issues are, i.e. how many users would 
be affected.

Thanks,
-- 
Martin

  parent reply	other threads:[~2018-04-11 18:38 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-10  1:34 [PATCH v4] blk-mq: Fix race conditions in request timeout handling Bart Van Assche
2018-04-10  7:59 ` jianchao.wang
2018-04-10 10:04   ` Ming Lei
2018-04-10 12:04     ` Shan Hai
2018-04-10 13:01   ` Bart Van Assche
2018-04-10 13:01     ` Bart Van Assche
2018-04-10 14:32     ` jianchao.wang
2018-04-10  8:41 ` Ming Lei
2018-04-10 12:58   ` Bart Van Assche
2018-04-10 12:58     ` Bart Van Assche
2018-04-10 13:55     ` Ming Lei
2018-04-10 14:09       ` Bart Van Assche
2018-04-10 14:09         ` Bart Van Assche
2018-04-10 14:30         ` Ming Lei
2018-04-10 15:02           ` Bart Van Assche
2018-04-10 15:02             ` Bart Van Assche
2018-04-10 15:25             ` Ming Lei
2018-04-10 15:30               ` tj
2018-04-10 15:38                 ` Ming Lei
2018-04-10 15:40                   ` tj
2018-04-10 21:33                     ` tj
2018-04-10 21:46                       ` Bart Van Assche
2018-04-10 21:46                         ` Bart Van Assche
2018-04-10 21:54                         ` tj
2018-04-11 12:50                           ` Bart Van Assche
2018-04-11 12:50                             ` Bart Van Assche
2018-04-11 14:16                             ` tj
2018-04-11 18:38                             ` Martin Steigerwald [this message]
2018-04-11 18:38                               ` Martin Steigerwald
2018-04-11 14:24                           ` Sagi Grimberg
2018-04-11 14:43                             ` tj
2018-04-11 16:16                             ` Israel Rukshin
2018-04-11 17:07                               ` tj
2018-04-11 21:31                                 ` tj
2018-04-12  8:59                                   ` Israel Rukshin
2018-04-12 13:35                                     ` tj
2018-04-15 12:28                                       ` Israel Rukshin
2018-04-18 16:34                           ` Bart Van Assche
2018-04-10  9:55 ` Christoph Hellwig
2018-04-10 13:26   ` Bart Van Assche
2018-04-10 13:26     ` Bart Van Assche
2018-04-10 14:50     ` hch
2018-04-10 14:41   ` Jens Axboe
2018-04-10 14:20 ` Tejun Heo
2018-04-10 14:30   ` Bart Van Assche
2018-04-10 14:30     ` Bart Van Assche
2018-04-10 14:33     ` tj

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11544331.gmjPRolTO7@merkaba \
    --to=martin@lichtvoll.de \
    --cc=Bart.VanAssche@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=israelr@mellanox.com \
    --cc=linux-block@vger.kernel.org \
    --cc=maxg@mellanox.com \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.