All of lore.kernel.org
 help / color / mirror / Atom feed
* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
       [not found] <CY1PR12MB077418C35F6EA2499CBACA45BC2C0@CY1PR12MB0774.namprd12.prod.outlook.com>
@ 2017-03-07 13:33 ` Sagi Grimberg
  2017-03-08 15:46   ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2017-03-07 13:33 UTC (permalink / raw)


> Hello Sagi/Christoph,

Hi Raju,

CCing linux-nvme because I think I read a similar bug
report from Yi Zhang.

> Could you please help me with the below issue...
> I'm seeing a race condition when the CQEs are getting processed concurrently with the disconnected event. This ends up with NULL pointer dereference.
> The issue is seen when the interface is brought down while the nvme io is in progress.
>
> Below is the race that is seen between two threads....
>
>       Thread 1:                                                                                                           Thread 2:
>
> -> RDMA_CM_EVENT_DISCONNECTED event is received
> -> nvmet_rdma_queue_disconnect() gets called
>                                                                                                            -> nvmet_rdma_recv_done()  gets called
>       -> rdma_disconnect() and
>       -> schedule_work(&queue->release_work);
>                                                                                                                 -> nvmet_rdma_put_rsp()
> -> nvmet_rdma_release_queue_work() gets called
>       -> nvmet_rdma_free_queue() is called
>             -> nvmet_sq_destroy()
>             -> nvmet_rdma_destroy_queue_ib()
>                   -> ib_drain_qp(queue->cm_id->qp);
>                   -> rdma_destroy_qp(queue->cm_id);
>                   ->ib_free_cq(queue->cq);
>             -> nvmet_rdma_free_rsps(queue);
>                   -> nvmet_rdma_free_rsp(ndev, rsp)
>                                                                                                                       -> spin_lock_irqsave(&rsp->queue->rsps_lock, flags)
>                                                                                                                       BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
>                                                                                                                       IP: _raw_spin_lock_irqsave+0x18/0x40
>            -> kfree(queue)
> ------
>
> I would be happy to provide any details. I could also share the reproduction steps if you are interested.

Hmm, that's interesting,

So what I don't understand is how do we get a successful completion
after all the below finished successfully:
1. rdma_diconnect
2. ib_drain_qp
3. rdma_destroy_qp
4. ib_free_cq (which flushes the cq worker)

Only then we free the responses array and the queue itself.

Looking at the code there are two possible sources from this that
I can think of:

1. nvmet_sq_destroy is not doing its job for completing all
    its inflight requests. Although we do wait for the final
    ref on the nvmet_sq to drop to zero.

from percpu-refcount.h:
/**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
  *
  * Increment a percpu refcount unless it has already been killed.  Returns
  * %true on success; %false on failure.
  *
  * Completion of percpu_ref_kill() in itself doesn't guarantee that this
  * function will fail.  For such guarantee, percpu_ref_kill_and_confirm()
  * should be used.  After the confirm_kill callback is invoked, it's
  * guaranteed that no new reference will be given out by
  * percpu_ref_tryget_live().
  *
  * This function is safe to call as long as @ref is between init and exit.
  */

For that perhaps you can try patch [1].

2. ib_destroy_cq does not really protect against a case where
    the work requeue itself because it runs flush_work(). In this
    case when the work re-executes it polls a cq array that is
    already freed and sees a bogus successful completion. Perhaps
    ib_free_cq should run cancel_work_sync() instead? see [2].

Both of these options are far fetched... I'd try each individually
to see if one is a possible source of the problem.

Christoph, do you have a better idea?


[1]:
--
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 5267ce20c12d..b29e7483affb 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -423,6 +423,13 @@ void nvmet_sq_setup(struct nvmet_ctrl *ctrl, struct 
nvmet_sq *sq,
         ctrl->sqs[qid] = sq;
  }

+static void nvmet_confirm_sq(struct percpu_ref *ref)
+{
+       struct nvmet_sq *sq = container_of(ref, struct nvmet_sq, ref);
+
+       complete(&sq->confirm_done);
+}
+
  void nvmet_sq_destroy(struct nvmet_sq *sq)
  {
         /*
@@ -431,7 +438,8 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
          */
         if (sq->ctrl && sq->ctrl->sqs && sq->ctrl->sqs[0] == sq)
                 nvmet_async_events_free(sq->ctrl);
-       percpu_ref_kill(&sq->ref);
+       percpu_ref_kill_and_confirm(&sq->ref, nvmet_confirm_sq);
+       wait_for_completion(&sq->confirm_done);
         wait_for_completion(&sq->free_done);
         percpu_ref_exit(&sq->ref);

@@ -459,6 +467,7 @@ int nvmet_sq_init(struct nvmet_sq *sq)
                 return ret;
         }
         init_completion(&sq->free_done);
+       init_completion(&sq->confirm_done);

         return 0;
  }
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 6ea32c49de58..1bbb67bf72d2 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -73,6 +73,7 @@ struct nvmet_sq {
         u16                     qid;
         u16                     size;
         struct completion       free_done;
+       struct completion       confirm_done;
  };

  /**
--

[2]:
--
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index e95510117a6d..2746d2eb3d52 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -196,7 +196,7 @@ void ib_free_cq(struct ib_cq *cq)
                 irq_poll_disable(&cq->iop);
                 break;
         case IB_POLL_WORKQUEUE:
-               flush_work(&cq->work);
+               cancel_work_sync(&cq->work);
                 break;
         default:
                 WARN_ON_ONCE(1);
--

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-07 13:33 ` nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event Sagi Grimberg
@ 2017-03-08 15:46   ` Christoph Hellwig
  2017-03-08 19:34     ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2017-03-08 15:46 UTC (permalink / raw)


On Tue, Mar 07, 2017@03:33:27PM +0200, Sagi Grimberg wrote:
> 1. nvmet_sq_destroy is not doing its job for completing all
>    its inflight requests. Although we do wait for the final
>    ref on the nvmet_sq to drop to zero.

> For that perhaps you can try patch [1].

Yes, I'll think we need that.  Did I mention that the percpu
refounter API is a complete trainwreck a couple times? :)

> 2. ib_destroy_cq does not really protect against a case where
>    the work requeue itself because it runs flush_work(). In this
>    case when the work re-executes it polls a cq array that is
>    already freed and sees a bogus successful completion. Perhaps
>    ib_free_cq should run cancel_work_sync() instead? see [2].

Yeah, we'll probably need that as well.  Independent of it solves
the problem reported here.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-08 15:46   ` Christoph Hellwig
@ 2017-03-08 19:34     ` Sagi Grimberg
  0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2017-03-08 19:34 UTC (permalink / raw)



>> For that perhaps you can try patch [1].
>
> Yes, I'll think we need that.  Did I mention that the percpu
> refounter API is a complete trainwreck a couple times? :)

Heh, You probably did, I wander what is the use-case
for percpu_ref_kill without the guarantee that subsequent
percpu_ref_tryget_live will fail...

>> 2. ib_destroy_cq does not really protect against a case where
>>    the work requeue itself because it runs flush_work(). In this
>>    case when the work re-executes it polls a cq array that is
>>    already freed and sees a bogus successful completion. Perhaps
>>    ib_free_cq should run cancel_work_sync() instead? see [2].
>
> Yeah, we'll probably need that as well.  Independent of it solves
> the problem reported here.

I'll send proper patches.

Would be nice if Raju or Yi can see if this helps
at all..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-13  8:05     ` Sagi Grimberg
@ 2017-03-14 13:22       ` Yi Zhang
  0 siblings, 0 replies; 8+ messages in thread
From: Yi Zhang @ 2017-03-14 13:22 UTC (permalink / raw)



On 03/13/2017 04:05 PM, Sagi Grimberg wrote:
>> Hi Sagi
>
> Hi Yi,
>
>> With this patch, issue[1] cannot be reproduced, but I still can
>> reproduce issue[2]. thanks
>>
>> [1]
>>
>> kernel NULL pointer on nvmet with stress
>> rescan_controller/reset_controller test during I/O
>>
>> [2]
>> mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during
>> stress test on reset_controller
>
> So this particular issue seems to do with the fact that the target is
> attacked with reset attempts. The controller teardown happens
> asynchronously and waits for safe termination while re-establishments
> are allowed immediately.
>
> I think we need to somehow figure out what is our max allowed active
> controllers and simply refuse controller establishment if we exceeded
> this number. The tricky part is to understand what this magic number
> is?
>
> Question, does the test in [2] complete eventually? or is this a fatal
> error?
Hi Sagi
If I don't stop the reset_controller operation on client side, it will 
finally got one fatal error since the operation will eat memory 
continuously on target side.

Thanks
Yi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-13  2:22   ` Yi Zhang
@ 2017-03-13  8:05     ` Sagi Grimberg
  2017-03-14 13:22       ` Yi Zhang
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2017-03-13  8:05 UTC (permalink / raw)


> Hi Sagi

Hi Yi,

> With this patch, issue[1] cannot be reproduced, but I still can
> reproduce issue[2]. thanks
>
> [1]
>
> kernel NULL pointer on nvmet with stress
> rescan_controller/reset_controller test during I/O
>
> [2]
> mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during
> stress test on reset_controller

So this particular issue seems to do with the fact that the target is
attacked with reset attempts. The controller teardown happens
asynchronously and waits for safe termination while re-establishments
are allowed immediately.

I think we need to somehow figure out what is our max allowed active
controllers and simply refuse controller establishment if we exceeded
this number. The tricky part is to understand what this magic number
is?

Question, does the test in [2] complete eventually? or is this a fatal
error?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-09 11:45 ` Sagi Grimberg
  2017-03-11  8:29   ` Raju  Rangoju
@ 2017-03-13  2:22   ` Yi Zhang
  2017-03-13  8:05     ` Sagi Grimberg
  1 sibling, 1 reply; 8+ messages in thread
From: Yi Zhang @ 2017-03-13  2:22 UTC (permalink / raw)


Hi Sagi

With this patch, issue[1] cannot be reproduced, but I still can 
reproduce issue[2]. thanks

[1]

kernel NULL pointer on nvmet with stress 
rescan_controller/reset_controller test during I/O

[2]
mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during 
stress test on reset_controller


On 03/09/2017 07:45 PM, Sagi Grimberg wrote:
>> Hi Sagi,
>
> Hi Raju (CC'ing Yi)
>
>>
>> I had tried each of the below patches individually,  issue is still 
>> seen with both the patches.
>>
>> with patch #1, from the dmesg I see that NULL pointer dereference 
>> issue is hit before the 3,4,5 (see below) were finished successfully 
>> for that queue
>>
>> Where :
>> 1. rdma_diconnect
>> 2. nvmet_sq_destroy
>> 3. ib_drain_qp
>> 4. rdma_destroy_qp
>> 5. ib_free_cq (which flushes the cq worker)
>
> I took a deeper look here, and I think that the root cause has nothing
> to do with the 2 (still useful) patches I sent. Actually, the fact that
> patch (1) caused you to get the NULL deref even before 3,4,5 tells me
> that the qp and cq are not free at all, and for some reason we see them
> as NULL.
>
> So in nvmet_rdma_recv_done() if the queue is not in state
> NVMET_RDMA_Q_LIVE, we simply restore the rsp back to the queue free
> list:
>
> static inline void
> nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
> {
>         unsigned long flags;
>
>         spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
>         list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
>         spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
> }
>
> However we only set rsp->queue in nvmet_rdma_handle_command() which
> does not take place because, as I mentioned, we are in disconnect
> state...
>
> I think this patch should make this issue go away:
> -- 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 973b674ab55b..06a8c6114098 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -180,9 +180,9 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
>  {
>         unsigned long flags;
>
> -       spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
> -       list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
> -       spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
> + spin_lock_irqsave(&rsp->cmd->queue->rsps_lock, flags);
> +       list_add_tail(&rsp->free_list, &rsp->cmd->queue->free_rsps);
> + spin_unlock_irqrestore(&rsp->cmd->queue->rsps_lock, flags);
>  }
>
>  static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int 
> nents)
> @@ -473,7 +473,7 @@ static void nvmet_rdma_process_wr_wait_list(struct 
> nvmet_rdma_queue *queue)
>
>  static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
>  {
> -       struct nvmet_rdma_queue *queue = rsp->queue;
> +       struct nvmet_rdma_queue *queue = rsp->cmd->queue;
>
>         atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail);
>
> @@ -517,7 +517,7 @@ static void nvmet_rdma_send_done(struct ib_cq *cq, 
> struct ib_wc *wc)
>                      wc->status != IB_WC_WR_FLUSH_ERR)) {
>                 pr_err("SEND for CQE 0x%p failed with status %s (%d).\n",
>                         wc->wr_cqe, ib_wc_status_msg(wc->status), 
> wc->status);
> -               nvmet_rdma_error_comp(rsp->queue);
> +               nvmet_rdma_error_comp(rsp->cmd->queue);
>         }
>  }
>
> @@ -525,7 +525,7 @@ static void nvmet_rdma_queue_response(struct 
> nvmet_req *req)
>  {
>         struct nvmet_rdma_rsp *rsp =
>                 container_of(req, struct nvmet_rdma_rsp, req);
> -       struct rdma_cm_id *cm_id = rsp->queue->cm_id;
> +       struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
>         struct ib_send_wr *first_wr, *bad_wr;
>
>         if (rsp->flags & NVMET_RDMA_REQ_INVALIDATE_RKEY) {
> @@ -541,9 +541,9 @@ static void nvmet_rdma_queue_response(struct 
> nvmet_req *req)
>         else
>                 first_wr = &rsp->send_wr;
>
> -       nvmet_rdma_post_recv(rsp->queue->dev, rsp->cmd);
> +       nvmet_rdma_post_recv(rsp->cmd->queue->dev, rsp->cmd);
>
> - ib_dma_sync_single_for_device(rsp->queue->dev->device,
> + ib_dma_sync_single_for_device(rsp->cmd->queue->dev->device,
>                 rsp->send_sge.addr, rsp->send_sge.length,
>                 DMA_TO_DEVICE);
>
> @@ -614,7 +614,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct 
> nvmet_rdma_rsp *rsp)
>  static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
>                 struct nvme_keyed_sgl_desc *sgl, bool invalidate)
>  {
> -       struct rdma_cm_id *cm_id = rsp->queue->cm_id;
> +       struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
>         u64 addr = le64_to_cpu(sgl->addr);
>         u32 len = get_unaligned_le24(sgl->length);
>         u32 key = get_unaligned_le32(sgl->key);
> @@ -676,7 +676,7 @@ static u16 nvmet_rdma_map_sgl(struct 
> nvmet_rdma_rsp *rsp)
>
>  static bool nvmet_rdma_execute_command(struct nvmet_rdma_rsp *rsp)
>  {
> -       struct nvmet_rdma_queue *queue = rsp->queue;
> +       struct nvmet_rdma_queue *queue = rsp->cmd->queue;
>
>         if (unlikely(atomic_sub_return(1 + rsp->n_rdma,
>                         &queue->sq_wr_avail) < 0)) {
> @@ -703,11 +703,6 @@ static void nvmet_rdma_handle_command(struct 
> nvmet_rdma_queue *queue,
>  {
>         u16 status;
>
> -       cmd->queue = queue;
> -       cmd->n_rdma = 0;
> -       cmd->req.port = queue->port;
> -
> -
>         ib_dma_sync_single_for_cpu(queue->dev->device,
>                 cmd->cmd->sge[0].addr, cmd->cmd->sge[0].length,
>                 DMA_FROM_DEVICE);
> @@ -763,6 +758,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, 
> struct ib_wc *wc)
>         rsp->cmd = cmd;
>         rsp->flags = 0;
>         rsp->req.cmd = cmd->nvme_cmd;
> +       rsp->n_rdma = 0;
> +       rsp->req.port = queue->port;
>
>         if (unlikely(queue->state != NVMET_RDMA_Q_LIVE)) {
>                 unsigned long flags;
> -- 
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
  2017-03-09 11:45 ` Sagi Grimberg
@ 2017-03-11  8:29   ` Raju  Rangoju
  2017-03-13  2:22   ` Yi Zhang
  1 sibling, 0 replies; 8+ messages in thread
From: Raju  Rangoju @ 2017-03-11  8:29 UTC (permalink / raw)



Hi Sagi,


This patch looks good. I was able to run the test overnight without any issues.
Please see my comments inline.


-Thanks




From: Sagi Grimberg <sagi@grimberg.me>
Sent: Thursday, March 9, 2017 5:15 PM
To: Raju Rangoju; hch at lst.de
Cc: SWise OGC; linux-nvme; Yi Zhang
Subject: Re: nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
?    
> Hi Sagi,

Hi Raju (CC'ing Yi)

>
> I had tried each of the below patches individually,? issue is still seen with both the patches.
>
> with patch #1, from the dmesg I see that NULL pointer dereference issue is hit before the 3,4,5 (see below) were finished successfully for that queue
>
> Where :
> 1. rdma_diconnect
> 2. nvmet_sq_destroy
> 3. ib_drain_qp
> 4. rdma_destroy_qp
> 5. ib_free_cq (which flushes the cq worker)

I took a deeper look here, and I think that the root cause has nothing
to do with the 2 (still useful) patches I sent.?

< RAJU > Yes, NULL deref happens much before calling ib_free_cq().



Actually, the fact that
patch (1) caused you to get the NULL deref even before 3,4,5 tells me
that the qp and cq are not free at all, and for some reason we see them
as NULL.
<RAJU > Yes.

So in nvmet_rdma_recv_done() if the queue is not in state
NVMET_RDMA_Q_LIVE, we simply restore the rsp back to the queue free
list:

static inline void
nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
{
???????? unsigned long flags;

???????? spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
???????? list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
???????? spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
}

However we only set rsp->queue in nvmet_rdma_handle_command() which
does not take place because, as I mentioned, we are in disconnect
state...

I think this patch should make this issue go away:
--
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 973b674ab55b..06a8c6114098 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -180,9 +180,9 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
? {
???????? unsigned long flags;

-?????? spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
-?????? list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
-?????? spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
+?????? spin_lock_irqsave(&rsp->cmd->queue->rsps_lock, flags);
+?????? list_add_tail(&rsp->free_list, &rsp->cmd->queue->free_rsps);
+?????? spin_unlock_irqrestore(&rsp->cmd->queue->rsps_lock, flags);
? }

? static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int 
nents)
@@ -473,7 +473,7 @@ static void nvmet_rdma_process_wr_wait_list(struct 
nvmet_rdma_queue *queue)

? static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
? {
-?????? struct nvmet_rdma_queue *queue = rsp->queue;
+?????? struct nvmet_rdma_queue *queue = rsp->cmd->queue;

???????? atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail);

@@ -517,7 +517,7 @@ static void nvmet_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
????????????????????? wc->status != IB_WC_WR_FLUSH_ERR)) {
???????????????? pr_err("SEND for CQE 0x%p failed with status %s (%d).\n",
???????????????????????? wc->wr_cqe, ib_wc_status_msg(wc->status), 
wc->status);
-?????????????? nvmet_rdma_error_comp(rsp->queue);
+?????????????? nvmet_rdma_error_comp(rsp->cmd->queue);
???????? }
? }

@@ -525,7 +525,7 @@ static void nvmet_rdma_queue_response(struct 
nvmet_req *req)
? {
???????? struct nvmet_rdma_rsp *rsp =
???????????????? container_of(req, struct nvmet_rdma_rsp, req);
-?????? struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+?????? struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
???????? struct ib_send_wr *first_wr, *bad_wr;

???????? if (rsp->flags & NVMET_RDMA_REQ_INVALIDATE_RKEY) {
@@ -541,9 +541,9 @@ static void nvmet_rdma_queue_response(struct 
nvmet_req *req)
???????? else
???????????????? first_wr = &rsp->send_wr;

-?????? nvmet_rdma_post_recv(rsp->queue->dev, rsp->cmd);
+?????? nvmet_rdma_post_recv(rsp->cmd->queue->dev, rsp->cmd);

-?????? ib_dma_sync_single_for_device(rsp->queue->dev->device,
+?????? ib_dma_sync_single_for_device(rsp->cmd->queue->dev->device,
???????????????? rsp->send_sge.addr, rsp->send_sge.length,
???????????????? DMA_TO_DEVICE);

@@ -614,7 +614,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct 
nvmet_rdma_rsp *rsp)
? static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
???????????????? struct nvme_keyed_sgl_desc *sgl, bool invalidate)
? {
-?????? struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+?????? struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
???????? u64 addr = le64_to_cpu(sgl->addr);
???????? u32 len = get_unaligned_le24(sgl->length);
???????? u32 key = get_unaligned_le32(sgl->key);
@@ -676,7 +676,7 @@ static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp 
*rsp)

? static bool nvmet_rdma_execute_command(struct nvmet_rdma_rsp *rsp)
? {
-?????? struct nvmet_rdma_queue *queue = rsp->queue;
+?????? struct nvmet_rdma_queue *queue = rsp->cmd->queue;

???????? if (unlikely(atomic_sub_return(1 + rsp->n_rdma,
???????????????????????? &queue->sq_wr_avail) < 0)) {
@@ -703,11 +703,6 @@ static void nvmet_rdma_handle_command(struct 
nvmet_rdma_queue *queue,
? {
???????? u16 status;

-?????? cmd->queue = queue;
-?????? cmd->n_rdma = 0;
-?????? cmd->req.port = queue->port;
-
-
???????? ib_dma_sync_single_for_cpu(queue->dev->device,
???????????????? cmd->cmd->sge[0].addr, cmd->cmd->sge[0].length,
???????????????? DMA_FROM_DEVICE);
@@ -763,6 +758,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, 
struct ib_wc *wc)
???????? rsp->cmd = cmd;
???????? rsp->flags = 0;
???????? rsp->req.cmd = cmd->nvme_cmd;
+?????? rsp->n_rdma = 0;
+?????? rsp->req.port = queue->port;

???????? if (unlikely(queue->state != NVMET_RDMA_Q_LIVE)) {
???????????????? unsigned long flags;
--
     

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event
       [not found] <CY1PR12MB07747089AFFCB6115BBCF5ECBC210@CY1PR12MB0774.namprd12.prod.outlook.com>
@ 2017-03-09 11:45 ` Sagi Grimberg
  2017-03-11  8:29   ` Raju  Rangoju
  2017-03-13  2:22   ` Yi Zhang
  0 siblings, 2 replies; 8+ messages in thread
From: Sagi Grimberg @ 2017-03-09 11:45 UTC (permalink / raw)


> Hi Sagi,

Hi Raju (CC'ing Yi)

>
> I had tried each of the below patches individually,  issue is still seen with both the patches.
>
> with patch #1, from the dmesg I see that NULL pointer dereference issue is hit before the 3,4,5 (see below) were finished successfully for that queue
>
> Where :
> 1. rdma_diconnect
> 2. nvmet_sq_destroy
> 3. ib_drain_qp
> 4. rdma_destroy_qp
> 5. ib_free_cq (which flushes the cq worker)

I took a deeper look here, and I think that the root cause has nothing
to do with the 2 (still useful) patches I sent. Actually, the fact that
patch (1) caused you to get the NULL deref even before 3,4,5 tells me
that the qp and cq are not free at all, and for some reason we see them
as NULL.

So in nvmet_rdma_recv_done() if the queue is not in state
NVMET_RDMA_Q_LIVE, we simply restore the rsp back to the queue free
list:

static inline void
nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
{
         unsigned long flags;

         spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
         list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
         spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
}

However we only set rsp->queue in nvmet_rdma_handle_command() which
does not take place because, as I mentioned, we are in disconnect
state...

I think this patch should make this issue go away:
--
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 973b674ab55b..06a8c6114098 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -180,9 +180,9 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
  {
         unsigned long flags;

-       spin_lock_irqsave(&rsp->queue->rsps_lock, flags);
-       list_add_tail(&rsp->free_list, &rsp->queue->free_rsps);
-       spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
+       spin_lock_irqsave(&rsp->cmd->queue->rsps_lock, flags);
+       list_add_tail(&rsp->free_list, &rsp->cmd->queue->free_rsps);
+       spin_unlock_irqrestore(&rsp->cmd->queue->rsps_lock, flags);
  }

  static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int 
nents)
@@ -473,7 +473,7 @@ static void nvmet_rdma_process_wr_wait_list(struct 
nvmet_rdma_queue *queue)

  static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
  {
-       struct nvmet_rdma_queue *queue = rsp->queue;
+       struct nvmet_rdma_queue *queue = rsp->cmd->queue;

         atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail);

@@ -517,7 +517,7 @@ static void nvmet_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
                      wc->status != IB_WC_WR_FLUSH_ERR)) {
                 pr_err("SEND for CQE 0x%p failed with status %s (%d).\n",
                         wc->wr_cqe, ib_wc_status_msg(wc->status), 
wc->status);
-               nvmet_rdma_error_comp(rsp->queue);
+               nvmet_rdma_error_comp(rsp->cmd->queue);
         }
  }

@@ -525,7 +525,7 @@ static void nvmet_rdma_queue_response(struct 
nvmet_req *req)
  {
         struct nvmet_rdma_rsp *rsp =
                 container_of(req, struct nvmet_rdma_rsp, req);
-       struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+       struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
         struct ib_send_wr *first_wr, *bad_wr;

         if (rsp->flags & NVMET_RDMA_REQ_INVALIDATE_RKEY) {
@@ -541,9 +541,9 @@ static void nvmet_rdma_queue_response(struct 
nvmet_req *req)
         else
                 first_wr = &rsp->send_wr;

-       nvmet_rdma_post_recv(rsp->queue->dev, rsp->cmd);
+       nvmet_rdma_post_recv(rsp->cmd->queue->dev, rsp->cmd);

-       ib_dma_sync_single_for_device(rsp->queue->dev->device,
+       ib_dma_sync_single_for_device(rsp->cmd->queue->dev->device,
                 rsp->send_sge.addr, rsp->send_sge.length,
                 DMA_TO_DEVICE);

@@ -614,7 +614,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct 
nvmet_rdma_rsp *rsp)
  static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
                 struct nvme_keyed_sgl_desc *sgl, bool invalidate)
  {
-       struct rdma_cm_id *cm_id = rsp->queue->cm_id;
+       struct rdma_cm_id *cm_id = rsp->cmd->queue->cm_id;
         u64 addr = le64_to_cpu(sgl->addr);
         u32 len = get_unaligned_le24(sgl->length);
         u32 key = get_unaligned_le32(sgl->key);
@@ -676,7 +676,7 @@ static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp 
*rsp)

  static bool nvmet_rdma_execute_command(struct nvmet_rdma_rsp *rsp)
  {
-       struct nvmet_rdma_queue *queue = rsp->queue;
+       struct nvmet_rdma_queue *queue = rsp->cmd->queue;

         if (unlikely(atomic_sub_return(1 + rsp->n_rdma,
                         &queue->sq_wr_avail) < 0)) {
@@ -703,11 +703,6 @@ static void nvmet_rdma_handle_command(struct 
nvmet_rdma_queue *queue,
  {
         u16 status;

-       cmd->queue = queue;
-       cmd->n_rdma = 0;
-       cmd->req.port = queue->port;
-
-
         ib_dma_sync_single_for_cpu(queue->dev->device,
                 cmd->cmd->sge[0].addr, cmd->cmd->sge[0].length,
                 DMA_FROM_DEVICE);
@@ -763,6 +758,8 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, 
struct ib_wc *wc)
         rsp->cmd = cmd;
         rsp->flags = 0;
         rsp->req.cmd = cmd->nvme_cmd;
+       rsp->n_rdma = 0;
+       rsp->req.port = queue->port;

         if (unlikely(queue->state != NVMET_RDMA_Q_LIVE)) {
                 unsigned long flags;
--

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-03-14 13:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CY1PR12MB077418C35F6EA2499CBACA45BC2C0@CY1PR12MB0774.namprd12.prod.outlook.com>
2017-03-07 13:33 ` nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event Sagi Grimberg
2017-03-08 15:46   ` Christoph Hellwig
2017-03-08 19:34     ` Sagi Grimberg
     [not found] <CY1PR12MB07747089AFFCB6115BBCF5ECBC210@CY1PR12MB0774.namprd12.prod.outlook.com>
2017-03-09 11:45 ` Sagi Grimberg
2017-03-11  8:29   ` Raju  Rangoju
2017-03-13  2:22   ` Yi Zhang
2017-03-13  8:05     ` Sagi Grimberg
2017-03-14 13:22       ` Yi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.