All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme-rdma corrupts memory upon timeout
@ 2018-02-25 15:10 Alon Horev
  2018-02-25 16:18 ` Bart Van Assche
  2018-02-25 17:45 ` Sagi Grimberg
  0 siblings, 2 replies; 10+ messages in thread
From: Alon Horev @ 2018-02-25 15:10 UTC (permalink / raw)


Hey,

We're running nvmf over a large cluster using RDMA. Sometimes, there's
some congestion that causes the nvme host driver to time out (we use a
4 second timeout).
Even though the host (initiator) times out and returns with an error
to userspace, we can see the buffer being written after the io
returned. This can obviously cause serious crashes and corruptions.
We suspect the same happens with writes but have yet to prove it.

We think we can spot the root cause: 'nvme_rdma_error_recovery'
handles the timeout in an asynchronous manner. It queues a task for
reconnecting the nvme device. Until that task is executed by the
worker thread the qp is open and a rdma write can get through. Does
this make sense?

Some additional information: we use a keepalive and reconnect timeout
of 1 second. ConnectX4 with OFED 4.1. I validated the code hasn't
changed in latest linux sources.

Thanks, Alon Horev
VastData

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 15:10 nvme-rdma corrupts memory upon timeout Alon Horev
@ 2018-02-25 16:18 ` Bart Van Assche
  2018-02-25 17:45 ` Sagi Grimberg
  1 sibling, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-02-25 16:18 UTC (permalink / raw)


On Sun, 2018-02-25@17:10 +0200, Alon Horev wrote:
> We think we can spot the root cause: 'nvme_rdma_error_recovery'
> handles the timeout in an asynchronous manner. It queues a task for
> reconnecting the nvme device. Until that task is executed by the
> worker thread the qp is open and a rdma write can get through. Does
> this make sense?

I think it's fine that error recovery happens asynchronously. Other drivers
(e.g. ib_srp) also use this approach. However, it seems to me that
nvme_rdma_error_recovery_work() does not wait until ongoing RDMA transfers
have finished. I think that's something that needs to be addressed. Has it
been considered to insert an ib_drain_qp() call in that function?

Another concern is that I think that there is a TOCTOU race in
nvme_cancel_request(). Has it been considered to make that function call
blk_abort_request() instead of blk_mq_request_started() +
blk_mq_complete_request()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 15:10 nvme-rdma corrupts memory upon timeout Alon Horev
  2018-02-25 16:18 ` Bart Van Assche
@ 2018-02-25 17:45 ` Sagi Grimberg
  2018-02-25 18:14   ` Sagi Grimberg
  1 sibling, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2018-02-25 17:45 UTC (permalink / raw)



> Hey,

Hi Alon, thanks for reporting!

> Some additional information: we use a keepalive and reconnect timeout
> of 1 second. ConnectX4 with OFED 4.1. I validated the code hasn't
> changed in latest linux sources.

So we obviously cannot help you with OFED (or anything that is not
upstream for that matter). This mailing list is designed to develop
Linux upstream and does not have any control of any other code
distribution.

For OFED issues the correct address for filing bug reports would be:
http://bugs.openfabrics.org/

For Mellanox OFED I believe you probably already have a support
channel...

Now as to your issue,

 > We're running nvmf over a large cluster using RDMA. Sometimes, there's
 > some congestion that causes the nvme host driver to time out (we use a
 > 4 second timeout).
 > Even though the host (initiator) times out and returns with an error
 > to userspace, we can see the buffer being written after the io
 > returned. This can obviously cause serious crashes and corruptions.
 > We suspect the same happens with writes but have yet to prove it.
 >
 > We think we can spot the root cause: 'nvme_rdma_error_recovery'
 > handles the timeout in an asynchronous manner. It queues a task for
 > reconnecting the nvme device. Until that task is executed by the
 > worker thread the qp is open and a rdma write can get through. Does
 > this make sense?

Yes it does. The problem is that for I/O timeout kicking error recovery
we don't make sure to either invalidate the rkey nor drain the rdma
queue pair (either will do).

Does this patch help?
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 2ef761b5a26e..856ae9a7615a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -956,15 +956,15 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)

         if (ctrl->ctrl.queue_count > 1) {
                 nvme_stop_queues(&ctrl->ctrl);
+               nvme_rdma_destroy_io_queues(ctrl, false);
                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
                                         nvme_cancel_request, &ctrl->ctrl);
-               nvme_rdma_destroy_io_queues(ctrl, false);
         }

         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+       nvme_rdma_destroy_admin_queue(ctrl, false);
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);
-       nvme_rdma_destroy_admin_queue(ctrl, false);

         /*
          * queues are not a live anymore, so restart the queues to fail 
fast
@@ -1724,9 +1724,9 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)

         if (ctrl->ctrl.queue_count > 1) {
                 nvme_stop_queues(&ctrl->ctrl);
+               nvme_rdma_destroy_io_queues(ctrl, shutdown);
                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
                                         nvme_cancel_request, &ctrl->ctrl);
-               nvme_rdma_destroy_io_queues(ctrl, shutdown);
         }

         if (shutdown)
@@ -1735,10 +1735,10 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)
                 nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);

         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+       nvme_rdma_destroy_admin_queue(ctrl, shutdown);
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);
         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
-       nvme_rdma_destroy_admin_queue(ctrl, shutdown);
  }

  static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 17:45 ` Sagi Grimberg
@ 2018-02-25 18:14   ` Sagi Grimberg
  2018-02-26  7:53     ` Alon Horev
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Sagi Grimberg @ 2018-02-25 18:14 UTC (permalink / raw)



> Does this patch help?
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 2ef761b5a26e..856ae9a7615a 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -956,15 +956,15 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
> 
>  ??????? if (ctrl->ctrl.queue_count > 1) {
>  ??????????????? nvme_stop_queues(&ctrl->ctrl);
> +?????????????? nvme_rdma_destroy_io_queues(ctrl, false);
>  ??????????????? blk_mq_tagset_busy_iter(&ctrl->tag_set,
>  ??????????????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
> -?????????????? nvme_rdma_destroy_io_queues(ctrl, false);
>  ??????? }
> 
>  ??????? blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +?????? nvme_rdma_destroy_admin_queue(ctrl, false);
>  ??????? blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>  ??????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
> -?????? nvme_rdma_destroy_admin_queue(ctrl, false);
> 
>  ??????? /*
>  ???????? * queues are not a live anymore, so restart the queues to fail 
> fast
> @@ -1724,9 +1724,9 @@ static void nvme_rdma_shutdown_ctrl(struct 
> nvme_rdma_ctrl *ctrl, bool shutdown)
> 
>  ??????? if (ctrl->ctrl.queue_count > 1) {
>  ??????????????? nvme_stop_queues(&ctrl->ctrl);
> +?????????????? nvme_rdma_destroy_io_queues(ctrl, shutdown);
>  ??????????????? blk_mq_tagset_busy_iter(&ctrl->tag_set,
>  ??????????????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
> -?????????????? nvme_rdma_destroy_io_queues(ctrl, shutdown);
>  ??????? }
> 
>  ??????? if (shutdown)
> @@ -1735,10 +1735,10 @@ static void nvme_rdma_shutdown_ctrl(struct 
> nvme_rdma_ctrl *ctrl, bool shutdown)
>  ??????????????? nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
> 
>  ??????? blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +?????? nvme_rdma_destroy_admin_queue(ctrl, shutdown);
>  ??????? blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>  ??????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
>  ??????? blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> -?????? nvme_rdma_destroy_admin_queue(ctrl, shutdown);
>  ?}
> 
>  ?static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
> -- 

Or maybe this should do a better job:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4c32518a6c81..e45801fe78c1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -956,12 +956,14 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)

         if (ctrl->ctrl.queue_count > 1) {
                 nvme_stop_queues(&ctrl->ctrl);
+               nvme_rdma_stop_io_queues(ctrl);
                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
                                         nvme_cancel_request, &ctrl->ctrl);
                 nvme_rdma_destroy_io_queues(ctrl, false);
         }

         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+       nvme_rdma_stop_queue(&ctrl->queues[0]);
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);
         nvme_rdma_destroy_admin_queue(ctrl, false);
@@ -1729,9 +1731,12 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)

         if (ctrl->ctrl.queue_count > 1) {
                 nvme_stop_queues(&ctrl->ctrl);
+               nvme_rdma_stop_io_queues(ctrl);
                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
                                         nvme_cancel_request, &ctrl->ctrl);
                 nvme_rdma_destroy_io_queues(ctrl, shutdown);
+               if (shutdown)
+                       nvme_start_queues(&ctrl->ctrl);
         }

         if (shutdown)
@@ -1740,10 +1745,11 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)
                 nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);

         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+       nvme_rdma_stop_queue(&ctrl->queues[0]);
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);
-       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
         nvme_rdma_destroy_admin_queue(ctrl, shutdown);
+       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
  }

  static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 18:14   ` Sagi Grimberg
@ 2018-02-26  7:53     ` Alon Horev
  2018-02-26 12:37       ` Sagi Grimberg
  2018-02-26 18:50     ` Bart Van Assche
  2018-02-26 22:20     ` Bart Van Assche
  2 siblings, 1 reply; 10+ messages in thread
From: Alon Horev @ 2018-02-26  7:53 UTC (permalink / raw)


This patch still returns to userspace after queuing work and may
result in corruption. Maybe we can flush the work queue after a
timeout?
Just to put things in perspective, we have around 100 subsystems
connected on a single host. This means the time frame between queuing
and execution may be larger than usual.

Thanks, Alon

On Sun, Feb 25, 2018@8:14 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Does this patch help?
>> --
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 2ef761b5a26e..856ae9a7615a 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -956,15 +956,15 @@ static void nvme_rdma_error_recovery_work(struct
>> work_struct *work)
>>
>>          if (ctrl->ctrl.queue_count > 1) {
>>                  nvme_stop_queues(&ctrl->ctrl);
>> +               nvme_rdma_destroy_io_queues(ctrl, false);
>>                  blk_mq_tagset_busy_iter(&ctrl->tag_set,
>>                                          nvme_cancel_request,
>> &ctrl->ctrl);
>> -               nvme_rdma_destroy_io_queues(ctrl, false);
>>          }
>>
>>          blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>> +       nvme_rdma_destroy_admin_queue(ctrl, false);
>>          blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>                                  nvme_cancel_request, &ctrl->ctrl);
>> -       nvme_rdma_destroy_admin_queue(ctrl, false);
>>
>>          /*
>>           * queues are not a live anymore, so restart the queues to fail
>> fast
>> @@ -1724,9 +1724,9 @@ static void nvme_rdma_shutdown_ctrl(struct
>> nvme_rdma_ctrl *ctrl, bool shutdown)
>>
>>          if (ctrl->ctrl.queue_count > 1) {
>>                  nvme_stop_queues(&ctrl->ctrl);
>> +               nvme_rdma_destroy_io_queues(ctrl, shutdown);
>>                  blk_mq_tagset_busy_iter(&ctrl->tag_set,
>>                                          nvme_cancel_request,
>> &ctrl->ctrl);
>> -               nvme_rdma_destroy_io_queues(ctrl, shutdown);
>>          }
>>
>>          if (shutdown)
>> @@ -1735,10 +1735,10 @@ static void nvme_rdma_shutdown_ctrl(struct
>> nvme_rdma_ctrl *ctrl, bool shutdown)
>>                  nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
>>
>>          blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>> +       nvme_rdma_destroy_admin_queue(ctrl, shutdown);
>>          blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>                                  nvme_cancel_request, &ctrl->ctrl);
>>          blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>> -       nvme_rdma_destroy_admin_queue(ctrl, shutdown);
>>   }
>>
>>   static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
>> --
>
>
> Or maybe this should do a better job:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4c32518a6c81..e45801fe78c1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -956,12 +956,14 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>
>         if (ctrl->ctrl.queue_count > 1) {
>                 nvme_stop_queues(&ctrl->ctrl);
> +               nvme_rdma_stop_io_queues(ctrl);
>                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
>                                         nvme_cancel_request, &ctrl->ctrl);
>                 nvme_rdma_destroy_io_queues(ctrl, false);
>         }
>
>         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +       nvme_rdma_stop_queue(&ctrl->queues[0]);
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
>         nvme_rdma_destroy_admin_queue(ctrl, false);
> @@ -1729,9 +1731,12 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl, bool shutdown)
>
>         if (ctrl->ctrl.queue_count > 1) {
>                 nvme_stop_queues(&ctrl->ctrl);
> +               nvme_rdma_stop_io_queues(ctrl);
>                 blk_mq_tagset_busy_iter(&ctrl->tag_set,
>                                         nvme_cancel_request, &ctrl->ctrl);
>                 nvme_rdma_destroy_io_queues(ctrl, shutdown);
> +               if (shutdown)
> +                       nvme_start_queues(&ctrl->ctrl);
>         }
>
>         if (shutdown)
> @@ -1740,10 +1745,11 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl, bool shutdown)
>                 nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
>
>         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +       nvme_rdma_stop_queue(&ctrl->queues[0]);
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
> -       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>         nvme_rdma_destroy_admin_queue(ctrl, shutdown);
> +       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>
>  }
>
>  static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
> --



-- 
Alon Horev
+972-524-517-627

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-26  7:53     ` Alon Horev
@ 2018-02-26 12:37       ` Sagi Grimberg
  2018-03-01  9:12         ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2018-02-26 12:37 UTC (permalink / raw)



> This patch still returns to userspace after queuing work and may
> result in corruption.

That's probably that one request being timed out as we complete
it earlier.

Does this patch on top help?
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index e793b0899d4e..50bbf88b82f6 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1590,10 +1590,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
         /* queue error recovery */
         nvme_rdma_error_recovery(req->queue->ctrl);

-       /* fail with DNR on cmd timeout */
-       nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
-
-       return BLK_EH_HANDLED;
+       return BLK_EH_RESET_TIMER;
  }

  /*
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 18:14   ` Sagi Grimberg
  2018-02-26  7:53     ` Alon Horev
@ 2018-02-26 18:50     ` Bart Van Assche
  2018-02-26 22:20     ` Bart Van Assche
  2 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-02-26 18:50 UTC (permalink / raw)


On 02/25/18 10:14, Sagi Grimberg wrote:
> Or maybe this should do a better job:
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4c32518a6c81..e45801fe78c1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -956,12 +956,14 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
> 
>  ??????? if (ctrl->ctrl.queue_count > 1) {
>  ??????????????? nvme_stop_queues(&ctrl->ctrl);
> +?????????????? nvme_rdma_stop_io_queues(ctrl);
>  ??????????????? blk_mq_tagset_busy_iter(&ctrl->tag_set,
>  ??????????????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
>  ??????????????? nvme_rdma_destroy_io_queues(ctrl, false);
>  ??????? }
> 
>  ??????? blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +?????? nvme_rdma_stop_queue(&ctrl->queues[0]);
>  ??????? blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>  ??????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
>  ??????? nvme_rdma_destroy_admin_queue(ctrl, false);
> @@ -1729,9 +1731,12 @@ static void nvme_rdma_shutdown_ctrl(struct 
> nvme_rdma_ctrl *ctrl, bool shutdown)
> 
>  ??????? if (ctrl->ctrl.queue_count > 1) {
>  ??????????????? nvme_stop_queues(&ctrl->ctrl);
> +?????????????? nvme_rdma_stop_io_queues(ctrl);
>  ??????????????? blk_mq_tagset_busy_iter(&ctrl->tag_set,
>  ??????????????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
>  ??????????????? nvme_rdma_destroy_io_queues(ctrl, shutdown);
> +?????????????? if (shutdown)
> +?????????????????????? nvme_start_queues(&ctrl->ctrl);
>  ??????? }
> 
>  ??????? if (shutdown)
> @@ -1740,10 +1745,11 @@ static void nvme_rdma_shutdown_ctrl(struct 
> nvme_rdma_ctrl *ctrl, bool shutdown)
>  ??????????????? nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
> 
>  ??????? blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +?????? nvme_rdma_stop_queue(&ctrl->queues[0]);
>  ??????? blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>  ??????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
> -?????? blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>  ??????? nvme_rdma_destroy_admin_queue(ctrl, shutdown);
> +?????? blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>  ?}
> 
>  ?static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)

Hello Sagi,

Can you resend this patch using e-mail software that does not corrupt a 
patch? All tabs in your patch have been changed into 8 high-ASCII spaces 
(ASCII code 160).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-25 18:14   ` Sagi Grimberg
  2018-02-26  7:53     ` Alon Horev
  2018-02-26 18:50     ` Bart Van Assche
@ 2018-02-26 22:20     ` Bart Van Assche
  2 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-02-26 22:20 UTC (permalink / raw)


On 02/25/18 10:14, Sagi Grimberg wrote:
> Or maybe this should do a better job:
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4c32518a6c81..e45801fe78c1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -956,12 +956,14 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
> 
>  ??????? if (ctrl->ctrl.queue_count > 1) {
>  ??????????????? nvme_stop_queues(&ctrl->ctrl);
> +?????????????? nvme_rdma_stop_io_queues(ctrl);
>  ??????????????? blk_mq_tagset_busy_iter(&ctrl->tag_set,
>  ??????????????????????????????????????? nvme_cancel_request, &ctrl->ctrl);
>  ??????????????? nvme_rdma_destroy_io_queues(ctrl, false);
>  ??????? }
> 
>  ??????? blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> +?????? nvme_rdma_stop_queue(&ctrl->queues[0]);
>  ??????? blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>  ??????????????????????????????? nvme_cancel_request, &ctrl->ctrl);

Hello Sagi,

With this change applied, I think what nvme_rdma_error_recovery_work() 
does for I/O and admin queues is as follows:
- Call blk_mq_quiesce_queue() to wait concurrent .queue_rq() calls have
   finished.
- Call nvme_rdma_stop_io_queues() to change the QP state into "error"
   and to wait until all RDMA completions have been processed.
- Call blk_mq_tagset_busy_iter() to cancel any pending block layer
   requests.
- Call blk_mq_unfreeze_queue() to resume the request queues.
- Call nvme_rdma_reconnect_or_remove().

The above patch seems like an improvement to me but I don't think that 
it fixes the race between nvme_cancel_request() and 
nvme_rdma_queue_rq(). Has it been considered to modify 
nvme_rdma_error_recovery_work() as follows:
* Clear NVME_RDMA_Q_LIVE.
* Change the RDMA QP state into "error".
* Freeze and unfreeze the block layer request queue. The freeze will
   wait until all pending requests have finished.
* Call nvme_rdma_reconnect_or_remove().

That last sequence has the following advantages:
* Draining the RDMA QP explicitly is no longer necessary.
* It fixes the race with nvme_cancel_request() by not calling
   nvme_cancel_request().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-02-26 12:37       ` Sagi Grimberg
@ 2018-03-01  9:12         ` Sagi Grimberg
  2018-03-29 13:07           ` Max Gurtovoy
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2018-03-01  9:12 UTC (permalink / raw)



>> This patch still returns to userspace after queuing work and may
>> result in corruption.
> 
> That's probably that one request being timed out as we complete
> it earlier.
> 
> Does this patch on top help?
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index e793b0899d4e..50bbf88b82f6 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1590,10 +1590,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
>  ??????? /* queue error recovery */
>  ??????? nvme_rdma_error_recovery(req->queue->ctrl);
> 
> -?????? /* fail with DNR on cmd timeout */
> -?????? nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
> -
> -?????? return BLK_EH_HANDLED;
> +?????? return BLK_EH_RESET_TIMER;
>  ?}
> 
>  ?/*
> -- 

Did this help?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* nvme-rdma corrupts memory upon timeout
  2018-03-01  9:12         ` Sagi Grimberg
@ 2018-03-29 13:07           ` Max Gurtovoy
  0 siblings, 0 replies; 10+ messages in thread
From: Max Gurtovoy @ 2018-03-29 13:07 UTC (permalink / raw)




On 3/1/2018 11:12 AM, Sagi Grimberg wrote:
> 
>>> This patch still returns to userspace after queuing work and may
>>> result in corruption.
>>
>> That's probably that one request being timed out as we complete
>> it earlier.
>>
>> Does this patch on top help?
>> -- 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index e793b0899d4e..50bbf88b82f6 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1590,10 +1590,7 @@ nvme_rdma_timeout(struct request *rq, bool 
>> reserved)
>> ???????? /* queue error recovery */
>> ???????? nvme_rdma_error_recovery(req->queue->ctrl);
>>
>> -?????? /* fail with DNR on cmd timeout */
>> -?????? nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
>> -
>> -?????? return BLK_EH_HANDLED;
>> +?????? return BLK_EH_RESET_TIMER;
>> ??}
>>
>> ??/*
>> -- 
> 
> Did this help?

it helped in our setup. Can we push this ?
regarding the stopping/draining the qp before the calling the 
blk_mq_tagset_busy_iter, it seems right to me also since we want to stop 
getting completions from the HCA before we cancel all the inflight requests.

We are completing the request from various contexts and can get NULL 
dereference nvme_rdma_process_nvme_rsp (in req->mr).
We added a debug prints to see if the rq is inflight during 
nvme_rdma_process_nvme_rsp and it was't (other context "complete" the 
request already).

-Max.

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&data=02%7C01%7Cmaxg%40mellanox.com%7Cd9e8d94af2a948c67a7108d57f549415%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636554923592413828&sdata=%2BbaaFQTsG8TU9NUSfr6adRhrk3VTMBfHHjr9hM%2BxdfE%3D&reserved=0 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-03-29 13:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-25 15:10 nvme-rdma corrupts memory upon timeout Alon Horev
2018-02-25 16:18 ` Bart Van Assche
2018-02-25 17:45 ` Sagi Grimberg
2018-02-25 18:14   ` Sagi Grimberg
2018-02-26  7:53     ` Alon Horev
2018-02-26 12:37       ` Sagi Grimberg
2018-03-01  9:12         ` Sagi Grimberg
2018-03-29 13:07           ` Max Gurtovoy
2018-02-26 18:50     ` Bart Van Assche
2018-02-26 22:20     ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.