* Disconnecting nvmet-rdma
@ 2016-10-18 22:02 Bart Van Assche
2016-10-19 13:00 ` Christoph Hellwig
0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2016-10-18 22:02 UTC (permalink / raw)
Hello Christoph,
Without the patch below I can easily trigger a NULL pointer dereference in
nvmet_rdma_queue_disconnect(). However, I don't think that that patch is
correct. Can you have a look at this?
Thanks,
Bart.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000200
IP: [<ffffffffa0621488>] nvmet_rdma_queue_disconnect+0x18/0x70 [nvmet_rdma]
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<ffffffffa0623219>] nvmet_rdma_cm_handler+0x309/0x592 [nvmet_rdma]
[<ffffffffa0275461>] cma_ib_handler+0xa1/0x1f0 [rdma_cm]
[<ffffffffa025daa0>] cm_process_work+0x20/0x100 [ib_cm]
[<ffffffffa025e223>] cm_work_handler+0x6a3/0x1f68 [ib_cm]
[<ffffffff81081485>] process_one_work+0x1f5/0x690
[<ffffffff81081969>] worker_thread+0x49/0x490
[<ffffffff8108817a>] kthread+0xea/0x100
[<ffffffff8162cd7f>] ret_from_fork+0x1f/0x40
---
drivers/nvme/target/rdma.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index d1aea17..a61e47f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1354,9 +1354,12 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
break;
case RDMA_CM_EVENT_ADDR_CHANGE:
case RDMA_CM_EVENT_DISCONNECTED:
- case RDMA_CM_EVENT_TIMEWAIT_EXIT:
nvmet_rdma_queue_disconnect(queue);
break;
+ case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+ if (queue)
+ nvmet_rdma_queue_disconnect(queue);
+ break;
case RDMA_CM_EVENT_DEVICE_REMOVAL:
ret = nvmet_rdma_device_removal(cm_id, queue);
break;
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Disconnecting nvmet-rdma
2016-10-18 22:02 Disconnecting nvmet-rdma Bart Van Assche
@ 2016-10-19 13:00 ` Christoph Hellwig
2016-10-20 7:49 ` Sagi Grimberg
2016-10-21 15:21 ` Bart Van Assche
0 siblings, 2 replies; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-19 13:00 UTC (permalink / raw)
On Tue, Oct 18, 2016@03:02:03PM -0700, Bart Van Assche wrote:
> Hello Christoph,
>
> Without the patch below I can easily trigger a NULL pointer dereference in
> nvmet_rdma_queue_disconnect(). However, I don't think that that patch is
> correct. Can you have a look at this?
Hi Bart,
how do you reproduce the timedwait condition? My RDMA test setup is
still being moved, so I can't reproduce it myself, but I'd like to know
for the future.
The only reason why I could see a NULL queue here is if RDMA/CM
also calls the timedwait exit handler for the listener CM ids,
in which case your patch would be correct. Can you check for that
theory by printing the cm_id address in nvmet_rdma_add_port and in
nvmet_rdma_cm_handler?
Also is there any chance you could try your reproducer with the iSER target
as well? It also seems to blindly derference the queue.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Disconnecting nvmet-rdma
2016-10-19 13:00 ` Christoph Hellwig
@ 2016-10-20 7:49 ` Sagi Grimberg
2016-10-21 13:39 ` Christoph Hellwig
2016-10-21 15:21 ` Bart Van Assche
1 sibling, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2016-10-20 7:49 UTC (permalink / raw)
> how do you reproduce the timedwait condition? My RDMA test setup is
> still being moved, so I can't reproduce it myself, but I'd like to know
> for the future.
+1 on that, never was able to step on this.
> The only reason why I could see a NULL queue here is if RDMA/CM
> also calls the timedwait exit handler for the listener CM ids,
> in which case your patch would be correct. Can you check for that
> theory by printing the cm_id address in nvmet_rdma_add_port and in
> nvmet_rdma_cm_handler?
Where do you see indication for that in the code? TIMEWAIT doesn't make
sense for listener cm_ids. The CM enters timewait (starts a timer) when:
- sends a disconnect request
- sends a disconnect reply
- sends a connect reject
- received a connect reject
Non of those happen with the listener cm_id. Maybe I'm missing
something?
Note that the cm_id->qp is NULL which means that it was
never created (destroy doesn't nullify it).
From looking at the code there are two flows that can trigger this:
- we failed nvmet_rdma_queue_connect() but didn't destroy
the cm_id -> which triggered this event later (but I can't find
indication of that in the code)
- Something in the CM spaghetti triggered this after we accepted
but the client rejected us for some reason (although I think we should
have seen UNREACHABLE event...
Or something else...
Bart, more information on what happened exactly (and how) would help
here.
> Also is there any chance you could try your reproducer with the iSER target
> as well? It also seems to blindly derference the queue.
It probably will happen too...
^ permalink raw reply [flat|nested] 5+ messages in thread
* Disconnecting nvmet-rdma
2016-10-20 7:49 ` Sagi Grimberg
@ 2016-10-21 13:39 ` Christoph Hellwig
0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-21 13:39 UTC (permalink / raw)
On Thu, Oct 20, 2016@10:49:43AM +0300, Sagi Grimberg wrote:
> Where do you see indication for that in the code? TIMEWAIT doesn't make
> sense for listener cm_ids.
Nowhere, this was just a wild guess given the pain we had with these
listener IDs before.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Disconnecting nvmet-rdma
2016-10-19 13:00 ` Christoph Hellwig
2016-10-20 7:49 ` Sagi Grimberg
@ 2016-10-21 15:21 ` Bart Van Assche
1 sibling, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2016-10-21 15:21 UTC (permalink / raw)
On Wed, 2016-10-19@15:00 +0200, Christoph Hellwig wrote:
> On Tue, Oct 18, 2016@03:02:03PM -0700, Bart Van Assche wrote:
> > Without the patch below I can easily trigger a NULL pointer
> > dereference in
> > nvmet_rdma_queue_disconnect(). However, I don't think that that
> > patch is
> > correct. Can you have a look at this?
>
> how do you reproduce the timedwait condition???My RDMA test setup is
> still being moved, so I can't reproduce it myself, but I'd like to
> know for the future.
Hello Christoph and Sagi,
This issue is something I could reproduce easily about ten days ago.
However, when I removed the "if (queue)" change from my tree I could
not reproduce this issue with the current version of my NVMeOF test
script. I will try to figure out whether I can trigger this by going
back to a previous version of my test script.
Bart.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-10-21 15:21 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-18 22:02 Disconnecting nvmet-rdma Bart Van Assche
2016-10-19 13:00 ` Christoph Hellwig
2016-10-20 7:49 ` Sagi Grimberg
2016-10-21 13:39 ` Christoph Hellwig
2016-10-21 15:21 ` Bart Van Assche
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.