linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* connect cmd error for nvme-rdma with eventual kernel crash
@ 2017-03-01  0:57 Parav Pandit
  2017-03-01  1:26 ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Parav Pandit @ 2017-03-01  0:57 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, Sagi Grimberg, Christoph Hellwig, Max Gurtovoy

Hi Jens,

With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in linux-block gi=
t tree,
There are two requests tables. Static and dynamic of same size.
However function blk_mq_tag_to_rq() always try to get the tag from the dyna=
mic table which doesn't seem to be always initialized.

I am running nvme-rdma initiator and it fails to find the request for the g=
iven tag when command completes.
Command triggers error recovery with "tag not found" error.
Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL poi=
nter. Seems to be additional bug in error recovery.

To debug, I added initializing dynamic tags as well.

blk_mq_alloc_rqs() {
			tags->static_rqs[i] =3D rq;
+			tags->rqs[i] =3D rq;

This appears to resolve the issue. But that's not the fix.
It appears to me that nvme stack is broken in certain conditions with recen=
t static and dynamic rq tables change.

Can you please confirm?

Regards,
Parav Pandit

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: connect cmd error for nvme-rdma with eventual kernel crash
  2017-03-01  0:57 connect cmd error for nvme-rdma with eventual kernel crash Parav Pandit
@ 2017-03-01  1:26 ` Jens Axboe
  2017-03-01  4:55   ` Parav Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2017-03-01  1:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: axboe, linux-block, Sagi Grimberg, Christoph Hellwig, Max Gurtovoy


> On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
>=20
> Hi Jens,
>=20
> With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in linux-block g=
it tree,
> There are two requests tables. Static and dynamic of same size.
> However function blk_mq_tag_to_rq() always try to get the tag from the dyn=
amic table which doesn't seem to be always initialized.
>=20
> I am running nvme-rdma initiator and it fails to find the request for the g=
iven tag when command completes.
> Command triggers error recovery with "tag not found" error.
> Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL po=
inter. Seems to be additional bug in error recovery.
>=20
> To debug, I added initializing dynamic tags as well.
>=20
> blk_mq_alloc_rqs() {
>            tags->static_rqs[i] =3D rq;
> +            tags->rqs[i] =3D rq;
>=20
> This appears to resolve the issue. But that's not the fix.
> It appears to me that nvme stack is broken in certain conditions with rece=
nt static and dynamic rq tables change.

Can you try my for-linus branch?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: connect cmd error for nvme-rdma with eventual kernel crash
  2017-03-01  1:26 ` Jens Axboe
@ 2017-03-01  4:55   ` Parav Pandit
  2017-03-01  5:50     ` Omar Sandoval
  0 siblings, 1 reply; 5+ messages in thread
From: Parav Pandit @ 2017-03-01  4:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: axboe, linux-block, Sagi Grimberg, Christoph Hellwig, Max Gurtovoy

Hi Jens,

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
>=20
> > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
> >
> > Hi Jens,
> >
> > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > linux-block git tree, There are two requests tables. Static and dynamic=
 of
> same size.
> > However function blk_mq_tag_to_rq() always try to get the tag from the
> dynamic table which doesn't seem to be always initialized.
> >
> > I am running nvme-rdma initiator and it fails to find the request for t=
he
> given tag when command completes.
> > Command triggers error recovery with "tag not found" error.
> > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL
> pointer. Seems to be additional bug in error recovery.
> >
> > To debug, I added initializing dynamic tags as well.
> >
> > blk_mq_alloc_rqs() {
> >            tags->static_rqs[i] =3D rq;
> > +            tags->rqs[i] =3D rq;
> >
> > This appears to resolve the issue. But that's not the fix.
> > It appears to me that nvme stack is broken in certain conditions with r=
ecent
> static and dynamic rq tables change.
>=20
> Can you try my for-linus branch?

I tried for-linus branch and it works.

Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
__blk_mq_alloc_request()=20
data->hctx->tags->rqs[rq->tag] =3D rq;

Commit says no functional difference but it is actually fixing this issue.

Parav

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: connect cmd error for nvme-rdma with eventual kernel crash
  2017-03-01  4:55   ` Parav Pandit
@ 2017-03-01  5:50     ` Omar Sandoval
  2017-03-01 15:06       ` Parav Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Omar Sandoval @ 2017-03-01  5:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jens Axboe, axboe, linux-block, Sagi Grimberg, Christoph Hellwig,
	Max Gurtovoy

On Wed, Mar 01, 2017 at 04:55:23AM +0000, Parav Pandit wrote:
> Hi Jens,
> 
> > -----Original Message-----
> > From: Jens Axboe [mailto:axboe@kernel.dk]
> > Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
> > 
> > > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > Hi Jens,
> > >
> > > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > > linux-block git tree, There are two requests tables. Static and dynamic of
> > same size.
> > > However function blk_mq_tag_to_rq() always try to get the tag from the
> > dynamic table which doesn't seem to be always initialized.
> > >
> > > I am running nvme-rdma initiator and it fails to find the request for the
> > given tag when command completes.
> > > Command triggers error recovery with "tag not found" error.
> > > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL
> > pointer. Seems to be additional bug in error recovery.
> > >
> > > To debug, I added initializing dynamic tags as well.
> > >
> > > blk_mq_alloc_rqs() {
> > >            tags->static_rqs[i] = rq;
> > > +            tags->rqs[i] = rq;
> > >
> > > This appears to resolve the issue. But that's not the fix.
> > > It appears to me that nvme stack is broken in certain conditions with recent
> > static and dynamic rq tables change.
> > 
> > Can you try my for-linus branch?
> 
> I tried for-linus branch and it works.
> 
> Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
> __blk_mq_alloc_request() 
> data->hctx->tags->rqs[rq->tag] = rq;
> 
> Commit says no functional difference but it is actually fixing this issue.
> 
> Parav

The fix is actually this one:

commit f867f4804d55adeef42c68c89edad49cdf3058f7
Author: Omar Sandoval <osandov@fb.com>
Date:   Mon Feb 27 10:28:27 2017 -0800

    blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request

    blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
    its blk_mq_alloc_request() counterpart. It also crashes because it
    doesn't update the tags->rqs map.

    Fix it by making it allocate a scheduler request.

    Reported-by: Sagi Grimberg <sagi@grimberg.me>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Tested-by: Sagi Grimberg <sagi@grimberg.me>

The commit you pointed out would have also fixed it, but after this
change it's a no-op.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: connect cmd error for nvme-rdma with eventual kernel crash
  2017-03-01  5:50     ` Omar Sandoval
@ 2017-03-01 15:06       ` Parav Pandit
  0 siblings, 0 replies; 5+ messages in thread
From: Parav Pandit @ 2017-03-01 15:06 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, axboe, linux-block, Sagi Grimberg, Christoph Hellwig,
	Max Gurtovoy

Thanks Omar.

> -----Original Message-----
> From: Omar Sandoval [mailto:osandov@osandov.com]
> Sent: Tuesday, February 28, 2017 11:51 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jens Axboe <axboe@kernel.dk>; axboe@fb.com; linux-
> block@vger.kernel.org; Sagi Grimberg <sagi@grimberg.me>; Christoph
> Hellwig <hch@lst.de>; Max Gurtovoy <maxg@mellanox.com>
> Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
>=20
> On Wed, Mar 01, 2017 at 04:55:23AM +0000, Parav Pandit wrote:
> > Hi Jens,
> >
> > > -----Original Message-----
> > > From: Jens Axboe [mailto:axboe@kernel.dk]
> > > Subject: Re: connect cmd error for nvme-rdma with eventual kernel
> > > crash
> > >
> > > > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com>
> wrote:
> > > >
> > > > Hi Jens,
> > > >
> > > > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > > > linux-block git tree, There are two requests tables. Static and
> > > > dynamic of
> > > same size.
> > > > However function blk_mq_tag_to_rq() always try to get the tag from
> > > > the
> > > dynamic table which doesn't seem to be always initialized.
> > > >
> > > > I am running nvme-rdma initiator and it fails to find the request
> > > > for the
> > > given tag when command completes.
> > > > Command triggers error recovery with "tag not found" error.
> > > > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with
> > > > NULL
> > > pointer. Seems to be additional bug in error recovery.
> > > >
> > > > To debug, I added initializing dynamic tags as well.
> > > >
> > > > blk_mq_alloc_rqs() {
> > > >            tags->static_rqs[i] =3D rq;
> > > > +            tags->rqs[i] =3D rq;
> > > >
> > > > This appears to resolve the issue. But that's not the fix.
> > > > It appears to me that nvme stack is broken in certain conditions
> > > > with recent
> > > static and dynamic rq tables change.
> > >
> > > Can you try my for-linus branch?
> >
> > I tried for-linus branch and it works.
> >
> > Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
> > __blk_mq_alloc_request()
> > data->hctx->tags->rqs[rq->tag] =3D rq;
> >
> > Commit says no functional difference but it is actually fixing this iss=
ue.
> >
> > Parav
>=20
> The fix is actually this one:
>=20
> commit f867f4804d55adeef42c68c89edad49cdf3058f7
> Author: Omar Sandoval <osandov@fb.com>
> Date:   Mon Feb 27 10:28:27 2017 -0800
>=20
>     blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
>=20
>     blk_mq_alloc_request_hctx() allocates a driver request directly, unli=
ke
>     its blk_mq_alloc_request() counterpart. It also crashes because it
>     doesn't update the tags->rqs map.
>=20
>     Fix it by making it allocate a scheduler request.
>=20
>     Reported-by: Sagi Grimberg <sagi@grimberg.me>
>     Signed-off-by: Omar Sandoval <osandov@fb.com>
>     Signed-off-by: Jens Axboe <axboe@fb.com>
>     Tested-by: Sagi Grimberg <sagi@grimberg.me>
>=20
> The commit you pointed out would have also fixed it, but after this chang=
e
> it's a no-op.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-01 15:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01  0:57 connect cmd error for nvme-rdma with eventual kernel crash Parav Pandit
2017-03-01  1:26 ` Jens Axboe
2017-03-01  4:55   ` Parav Pandit
2017-03-01  5:50     ` Omar Sandoval
2017-03-01 15:06       ` Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).