All of lore.kernel.org
 help / color / mirror / Atom feed
* Disconnecting nvmet-rdma
@ 2016-10-18 22:02 Bart Van Assche
  2016-10-19 13:00 ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2016-10-18 22:02 UTC (permalink / raw)


Hello Christoph,

Without the patch below I can easily trigger a NULL pointer dereference in
nvmet_rdma_queue_disconnect(). However, I don't think that that patch is
correct. Can you have a look at this?

Thanks,

Bart.


BUG: unable to handle kernel NULL pointer dereference at 0000000000000200
IP: [<ffffffffa0621488>] nvmet_rdma_queue_disconnect+0x18/0x70 [nvmet_rdma]
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
 [<ffffffffa0623219>] nvmet_rdma_cm_handler+0x309/0x592 [nvmet_rdma]
 [<ffffffffa0275461>] cma_ib_handler+0xa1/0x1f0 [rdma_cm]
 [<ffffffffa025daa0>] cm_process_work+0x20/0x100 [ib_cm]
 [<ffffffffa025e223>] cm_work_handler+0x6a3/0x1f68 [ib_cm]
 [<ffffffff81081485>] process_one_work+0x1f5/0x690
 [<ffffffff81081969>] worker_thread+0x49/0x490
 [<ffffffff8108817a>] kthread+0xea/0x100
 [<ffffffff8162cd7f>] ret_from_fork+0x1f/0x40


---
 drivers/nvme/target/rdma.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index d1aea17..a61e47f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1354,9 +1354,12 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
 		break;
 	case RDMA_CM_EVENT_ADDR_CHANGE:
 	case RDMA_CM_EVENT_DISCONNECTED:
-	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
 		nvmet_rdma_queue_disconnect(queue);
 		break;
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		if (queue)
+			nvmet_rdma_queue_disconnect(queue);
+		break;
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:
 		ret = nvmet_rdma_device_removal(cm_id, queue);
 		break;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Disconnecting nvmet-rdma
  2016-10-18 22:02 Disconnecting nvmet-rdma Bart Van Assche
@ 2016-10-19 13:00 ` Christoph Hellwig
  2016-10-20  7:49   ` Sagi Grimberg
  2016-10-21 15:21   ` Bart Van Assche
  0 siblings, 2 replies; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-19 13:00 UTC (permalink / raw)


On Tue, Oct 18, 2016@03:02:03PM -0700, Bart Van Assche wrote:
> Hello Christoph,
> 
> Without the patch below I can easily trigger a NULL pointer dereference in
> nvmet_rdma_queue_disconnect(). However, I don't think that that patch is
> correct. Can you have a look at this?

Hi Bart,

how do you reproduce the timedwait condition?  My RDMA test setup is
still being moved, so I can't reproduce it myself, but I'd like to know
for the future.

The only reason why I could see a NULL queue here is if RDMA/CM
also calls the timedwait exit handler for the listener CM ids,
in which case your patch would be correct.  Can you check for that
theory by printing the cm_id address in nvmet_rdma_add_port and in
nvmet_rdma_cm_handler?

Also is there any chance you could try your reproducer with the iSER target
as well?  It also seems to blindly derference the queue.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Disconnecting nvmet-rdma
  2016-10-19 13:00 ` Christoph Hellwig
@ 2016-10-20  7:49   ` Sagi Grimberg
  2016-10-21 13:39     ` Christoph Hellwig
  2016-10-21 15:21   ` Bart Van Assche
  1 sibling, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2016-10-20  7:49 UTC (permalink / raw)



> how do you reproduce the timedwait condition?  My RDMA test setup is
> still being moved, so I can't reproduce it myself, but I'd like to know
> for the future.

+1 on that, never was able to step on this.

> The only reason why I could see a NULL queue here is if RDMA/CM
> also calls the timedwait exit handler for the listener CM ids,
> in which case your patch would be correct.  Can you check for that
> theory by printing the cm_id address in nvmet_rdma_add_port and in
> nvmet_rdma_cm_handler?

Where do you see indication for that in the code? TIMEWAIT doesn't make
sense for listener cm_ids. The CM enters timewait (starts a timer) when:
- sends a disconnect request
- sends a disconnect reply
- sends a connect reject
- received a connect reject

Non of those happen with the listener cm_id. Maybe I'm missing
something?

Note that the cm_id->qp is NULL which means that it was
never created (destroy doesn't nullify it).

 From looking at the code there are two flows that can trigger this:
- we failed nvmet_rdma_queue_connect() but didn't destroy
   the cm_id -> which triggered this event later (but I can't find
   indication of that in the code)

- Something in the CM spaghetti triggered this after we accepted
   but the client rejected us for some reason (although I think we should
   have seen UNREACHABLE event...

Or something else...

Bart, more information on what happened exactly (and how) would help
here.

> Also is there any chance you could try your reproducer with the iSER target
> as well?  It also seems to blindly derference the queue.

It probably will happen too...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Disconnecting nvmet-rdma
  2016-10-20  7:49   ` Sagi Grimberg
@ 2016-10-21 13:39     ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-21 13:39 UTC (permalink / raw)


On Thu, Oct 20, 2016@10:49:43AM +0300, Sagi Grimberg wrote:
> Where do you see indication for that in the code? TIMEWAIT doesn't make
> sense for listener cm_ids.

Nowhere, this was just a wild guess given the pain we had with these
listener IDs before.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Disconnecting nvmet-rdma
  2016-10-19 13:00 ` Christoph Hellwig
  2016-10-20  7:49   ` Sagi Grimberg
@ 2016-10-21 15:21   ` Bart Van Assche
  1 sibling, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2016-10-21 15:21 UTC (permalink / raw)


On Wed, 2016-10-19@15:00 +0200, Christoph Hellwig wrote:
> On Tue, Oct 18, 2016@03:02:03PM -0700, Bart Van Assche wrote:
> > Without the patch below I can easily trigger a NULL pointer
> > dereference in
> > nvmet_rdma_queue_disconnect(). However, I don't think that that
> > patch is
> > correct. Can you have a look at this?
> 
> how do you reproduce the timedwait condition???My RDMA test setup is
> still being moved, so I can't reproduce it myself, but I'd like to
> know for the future.

Hello Christoph and Sagi,

This issue is something I could reproduce easily about ten days ago.
However, when I removed the "if (queue)" change from my tree I could
not reproduce this issue with the current version of my NVMeOF test
script. I will try to figure out whether I can trigger this by going
back to a previous version of my test script.

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-21 15:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-18 22:02 Disconnecting nvmet-rdma Bart Van Assche
2016-10-19 13:00 ` Christoph Hellwig
2016-10-20  7:49   ` Sagi Grimberg
2016-10-21 13:39     ` Christoph Hellwig
2016-10-21 15:21   ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.