All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme-rdma: Always signal fabrics private commands
@ 2016-06-23 16:08 Sagi Grimberg
  2016-06-23 18:17 ` Steve Wise
  2016-06-24  7:07 ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Sagi Grimberg @ 2016-06-23 16:08 UTC (permalink / raw)


Some RDMA adapters were observed to have some issues
with selective completion signaling which might cause
a use-after-free condition when the device accidentally
reports a completion when the caller context (wr_cqe)
was already freed.

The first time this was detected was for flush requests
that were not allocated from the tagset, now we see that
in the error path of fabrics connect (admin). The normal
I/O selective signaling is safe because we free the tagset
only when all the queue-pairs were drained.

Reported-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index b939f89ad936..bf141cb4e671 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1453,7 +1453,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (rq->cmd_type == REQ_TYPE_FS && req_op(rq) == REQ_OP_FLUSH)
+	if ((rq->cmd_type == REQ_TYPE_FS && req_op(rq) == REQ_OP_FLUSH) ||
+	    rq->cmd_type == REQ_TYPE_DRV_PRIV)
 		flush = true;
 	ret = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
 			req->need_inval ? &req->reg_wr.wr : NULL, flush);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-23 16:08 [PATCH] nvme-rdma: Always signal fabrics private commands Sagi Grimberg
@ 2016-06-23 18:17 ` Steve Wise
  2016-06-24  7:07 ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Steve Wise @ 2016-06-23 18:17 UTC (permalink / raw)


This fixes the problem.  Thanks Sagi!

Tested-by: Steve Wise <swise at opengridcomputing.com.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-23 16:08 [PATCH] nvme-rdma: Always signal fabrics private commands Sagi Grimberg
  2016-06-23 18:17 ` Steve Wise
@ 2016-06-24  7:07 ` Christoph Hellwig
  2016-06-24 14:05   ` Steve Wise
  2016-06-26 16:41   ` Sagi Grimberg
  1 sibling, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2016-06-24  7:07 UTC (permalink / raw)


On Thu, Jun 23, 2016@07:08:24PM +0300, Sagi Grimberg wrote:
> Some RDMA adapters were observed to have some issues
> with selective completion signaling which might cause
> a use-after-free condition when the device accidentally
> reports a completion when the caller context (wr_cqe)
> was already freed.

I'd really love to fully root cause this issue and find a way
to fix it in the driver or core.  This isn't really something
a ULP should have to care about, and I'm trying to understand how
the existing ULPs get away without this.

I think we should apply this anyway for now unless we can come up
woth something better, but I'm not exactly happy about it.

> The first time this was detected was for flush requests
> that were not allocated from the tagset, now we see that
> in the error path of fabrics connect (admin). The normal
> I/O selective signaling is safe because we free the tagset
> only when all the queue-pairs were drained.

So for flush we needed this because the flush request is allocated
as part of the hctx, but pass through requests aren't really
special in terms of allocation.  What's the reason we need to
treat these special?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-24  7:07 ` Christoph Hellwig
@ 2016-06-24 14:05   ` Steve Wise
  2016-06-26 16:41   ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Steve Wise @ 2016-06-24 14:05 UTC (permalink / raw)


> On Thu, Jun 23, 2016@07:08:24PM +0300, Sagi Grimberg wrote:
> > Some RDMA adapters were observed to have some issues
> > with selective completion signaling which might cause
> > a use-after-free condition when the device accidentally
> > reports a completion when the caller context (wr_cqe)
> > was already freed.
> 
> I'd really love to fully root cause this issue and find a way
> to fix it in the driver or core.  This isn't really something
> a ULP should have to care about, and I'm trying to understand how
> the existing ULPs get away without this.
>

Haven't we root caused it?  iw_cxgb4 cannot free up SQ slots containing
unsignaled WRs until a subsequent signaled WR is completed and polled by the
ULP.  If the QP is moved out of RTS before that happens thyen the unsignaled WRs
are completed as FLUSHED.  And NVMF is not ensuring that for all unsignaled WRs,
the wr_cqe remains around until the qp is flushed. 

>From a quick browse of the ULPS that support iw_cxgb4, it looks like NFSRDMA
server always signals, and NFSRDMA client always posts chains that end in a
signaled WR (not 100% sure on this).  iser does control its signaling, and it
perhaps suffers from the same problem.  But the target side has only now become
enabled for iwarp/cxgb4, so we'll see if we hit the same problems.  It appears
isert always signals.  

> I think we should apply this anyway for now unless we can come up
> woth something better, but I'm not exactly happy about it.
> 
> > The first time this was detected was for flush requests
> > that were not allocated from the tagset, now we see that
> > in the error path of fabrics connect (admin). The normal
> > I/O selective signaling is safe because we free the tagset
> > only when all the queue-pairs were drained.
> 
> So for flush we needed this because the flush request is allocated
> as part of the hctx, but pass through requests aren't really
> special in terms of allocation.  What's the reason we need to
> treat these special?

Perhaps it is just avoiding the problem by nature of being a signaled WR causing
iw_cxgb4 to now know the unsignaled WRs are complete...

I'm happy to help with guidance.  I'm not very familiar with the NVMF code above
its use of RDMA though.  And my solutions to try and fix this have all been
considered incorrect. :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-24  7:07 ` Christoph Hellwig
  2016-06-24 14:05   ` Steve Wise
@ 2016-06-26 16:41   ` Sagi Grimberg
  2016-06-28  8:41     ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2016-06-26 16:41 UTC (permalink / raw)



>> Some RDMA adapters were observed to have some issues
>> with selective completion signaling which might cause
>> a use-after-free condition when the device accidentally
>> reports a completion when the caller context (wr_cqe)
>> was already freed.
>
> I'd really love to fully root cause this issue and find a way
> to fix it in the driver or core.
> This isn't really something a ULP should have to care about, and I'm trying to understand how
> the existing ULPs get away without this.

It's a cxgb4 specific work-around (the only device this was observed
by). Not sure how this can be addressed in the core. We could comment
that in the code.

> I think we should apply this anyway for now unless we can come up
> woth something better, but I'm not exactly happy about it.
>
>> The first time this was detected was for flush requests
>> that were not allocated from the tagset, now we see that
>> in the error path of fabrics connect (admin). The normal
>> I/O selective signaling is safe because we free the tagset
>> only when all the queue-pairs were drained.
>
> So for flush we needed this because the flush request is allocated
> as part of the hctx, but pass through requests aren't really
> special in terms of allocation.  What's the reason we need to
> treat these special?

OK heres what I think is going on. we allocate the rdma queue and
issue admin connect (unsignaled). connect fails, and we orderly
teardown the admin queue (free the tagset, put the device and free the
queue).

Due to the fact that the cxgb4 driver is responsible for flushing
pending work requests and has no way of telling what the HW processed
other than the head/tail indexes (which are probably updated at
completion time) it sees the admin connect in the send-queue, it
doesn't know if the HW did anything with it, so it flushes it anyway.

Our error path is freeing the tagset before we free the queue (draining
the qp) so we get to a use-after-free condition (->done() is a freed
tag memory).

Note that we must allocate the qp before we allocate the tagset because
we need the device when init_request callouts come. So we allocated
before, we free after. An alternative fix was to free the queue before
the tagset even though we allocated it before (as Steve suggested).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-26 16:41   ` Sagi Grimberg
@ 2016-06-28  8:41     ` Christoph Hellwig
  2016-06-28 14:20       ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2016-06-28  8:41 UTC (permalink / raw)


On Sun, Jun 26, 2016@07:41:39PM +0300, Sagi Grimberg wrote:
> Our error path is freeing the tagset before we free the queue (draining
> the qp) so we get to a use-after-free condition (->done() is a freed
> tag memory).
>
> Note that we must allocate the qp before we allocate the tagset because
> we need the device when init_request callouts come. So we allocated
> before, we free after. An alternative fix was to free the queue before
> the tagset even though we allocated it before (as Steve suggested).

Would draining, but not freeing the qp before freeing the tagset work?
That seems like the most sensible option here.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-28  8:41     ` Christoph Hellwig
@ 2016-06-28 14:20       ` Steve Wise
  2016-06-29 14:57         ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Wise @ 2016-06-28 14:20 UTC (permalink / raw)


> On Sun, Jun 26, 2016@07:41:39PM +0300, Sagi Grimberg wrote:
> > Our error path is freeing the tagset before we free the queue (draining
> > the qp) so we get to a use-after-free condition (->done() is a freed
> > tag memory).
> >
> > Note that we must allocate the qp before we allocate the tagset because
> > we need the device when init_request callouts come. So we allocated
> > before, we free after. An alternative fix was to free the queue before
> > the tagset even though we allocated it before (as Steve suggested).
> 
> Would draining, but not freeing the qp before freeing the tagset work?
> That seems like the most sensible option here.

disconnecting and draining, I think.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-28 14:20       ` Steve Wise
@ 2016-06-29 14:57         ` Steve Wise
  2016-06-30  6:36           ` 'Christoph Hellwig'
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Wise @ 2016-06-29 14:57 UTC (permalink / raw)


> > On Sun, Jun 26, 2016@07:41:39PM +0300, Sagi Grimberg wrote:
> > > Our error path is freeing the tagset before we free the queue (draining
> > > the qp) so we get to a use-after-free condition (->done() is a freed
> > > tag memory).
> > >
> > > Note that we must allocate the qp before we allocate the tagset because
> > > we need the device when init_request callouts come. So we allocated
> > > before, we free after. An alternative fix was to free the queue before
> > > the tagset even though we allocated it before (as Steve suggested).
> >
> > Would draining, but not freeing the qp before freeing the tagset work?
> > That seems like the most sensible option here.
> 
> disconnecting and draining, I think.

How about this?

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index e1205c0..d3a0396 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -604,23 +604,32 @@ out_destroy_cm_id:
        return ret;
 }

-static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
+static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
 {
-       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
-               return;
-
        rdma_disconnect(queue->cm_id);
        ib_drain_qp(queue->qp);
+}
+
+static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
+{
        nvme_rdma_destroy_queue_ib(queue);
        rdma_destroy_id(queue->cm_id);
 }

+static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue)
+{
+       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
+               return;
+       nvme_rdma_stop_queue(queue);
+       nvme_rdma_free_queue(queue);
+}
+
 static void nvme_rdma_free_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
        int i;

        for (i = 1; i < ctrl->queue_count; i++)
-               nvme_rdma_free_queue(&ctrl->queues[i]);
+               nvme_rdma_stop_and_free_queue(&ctrl->queues[i]);
 }

 static int nvme_rdma_connect_io_queues(struct nvme_rdma_ctrl *ctrl)
@@ -653,7 +662,7 @@ static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl
*ctrl)

 out_free_queues:
        for (; i >= 1; i--)
-               nvme_rdma_free_queue(&ctrl->queues[i]);
+               nvme_rdma_stop_and_free_queue(&ctrl->queues[i]);

        return ret;
 }
@@ -662,7 +671,7 @@ static void nvme_rdma_destroy_admin_queue(struct
nvme_rdma_ctrl *ctrl)
 {
        nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe,
                        sizeof(struct nvme_command), DMA_TO_DEVICE);
-       nvme_rdma_free_queue(&ctrl->queues[0]);
+       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
        blk_cleanup_queue(ctrl->ctrl.admin_q);
        blk_mq_free_tag_set(&ctrl->admin_tag_set);
        nvme_rdma_dev_put(ctrl->device);
@@ -705,7 +714,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct
*work)
                        goto requeue;
        }

-       nvme_rdma_free_queue(&ctrl->queues[0]);
+       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);

        ret = blk_mq_reinit_tagset(&ctrl->admin_tag_set);
        if (ret)
@@ -1617,6 +1626,7 @@ static int nvme_rdma_configure_admin_queue(struct
nvme_rdma_ctrl *ctrl)
 out_cleanup_queue:
        blk_cleanup_queue(ctrl->ctrl.admin_q);
 out_free_tagset:
+       nvme_rdma_stop_queue(&ctrl->queues[0]);
        blk_mq_free_tag_set(&ctrl->admin_tag_set);
 out_put_dev:
        nvme_rdma_dev_put(ctrl->device);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-29 14:57         ` Steve Wise
@ 2016-06-30  6:36           ` 'Christoph Hellwig'
  2016-06-30 13:44             ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: 'Christoph Hellwig' @ 2016-06-30  6:36 UTC (permalink / raw)


On Wed, Jun 29, 2016@09:57:16AM -0500, Steve Wise wrote:
> > disconnecting and draining, I think.
> 
> How about this?

This looks reasonable, although it came throught totally whitespace
damaged..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-30  6:36           ` 'Christoph Hellwig'
@ 2016-06-30 13:44             ` Steve Wise
  2016-06-30 15:10               ` Steve Wise
  2016-07-13 10:08               ` Sagi Grimberg
  0 siblings, 2 replies; 18+ messages in thread
From: Steve Wise @ 2016-06-30 13:44 UTC (permalink / raw)


> >
> > How about this?
> 
> This looks reasonable, although it came throught totally whitespace
> damaged..
> 

Yes, I just cut/pasted.  I'll send out a patch today.  Sagi, did you have any
comments?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-30 13:44             ` Steve Wise
@ 2016-06-30 15:10               ` Steve Wise
  2016-07-13 10:08               ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Steve Wise @ 2016-06-30 15:10 UTC (permalink / raw)


> 
> > >
> > > How about this?
> >
> > This looks reasonable, although it came throught totally whitespace
> > damaged..
> >
> 
> Yes, I just cut/pasted.  I'll send out a patch today.  Sagi, did you have any
> comments?
> 

Hey Christoph,

I see you added it to nvmf-all.3.  Should I still send out a formal patch? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-06-30 13:44             ` Steve Wise
  2016-06-30 15:10               ` Steve Wise
@ 2016-07-13 10:08               ` Sagi Grimberg
  2016-07-13 10:11                 ` Sagi Grimberg
  1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2016-07-13 10:08 UTC (permalink / raw)



>>> How about this?
>>
>> This looks reasonable, although it came throught totally whitespace
>> damaged..
>>
>
> Yes, I just cut/pasted.  I'll send out a patch today.  Sagi, did you have any
> comments?

Hi Steve,

Sorry for the late response, I was on vacation.

The patch looks fine,

Reviewed-by: Sagi Grimberg <sagi at imberg.me>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 10:08               ` Sagi Grimberg
@ 2016-07-13 10:11                 ` Sagi Grimberg
  2016-07-13 14:28                   ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2016-07-13 10:11 UTC (permalink / raw)



> Hi Steve,
>
> Sorry for the late response, I was on vacation.
>
> The patch looks fine,
>
> Reviewed-by: Sagi Grimberg <sagi at imberg.me>

Although, I think that we really should fix the
double completion for unsignaled wqes in cxgb4...
it might bite us elsewhere down the road.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 10:11                 ` Sagi Grimberg
@ 2016-07-13 14:28                   ` Steve Wise
  2016-07-13 14:47                     ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Wise @ 2016-07-13 14:28 UTC (permalink / raw)


 
> 
> > Hi Steve,
> >
> > Sorry for the late response, I was on vacation.
> >
> > The patch looks fine,
> >
> > Reviewed-by: Sagi Grimberg <sagi at imberg.me>
> 
> Although, I think that we really should fix the
> double completion for unsignaled wqes in cxgb4...
> it might bite us elsewhere down the road.

Double completion?  When the QP exits RTS with pending unsignaled SQ WRs, cxgb4
doesn't know if those were actually completed by hardware, so they are completed
with FLUSH_ERR status.  I _could_ change cxgb4 to just eat those, but I'm a
little worried about breaking the iWARP Verbs semantics.  Perhaps I shouldn't
be.  It does seem to be causing lots of pain...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 14:28                   ` Steve Wise
@ 2016-07-13 14:47                     ` Sagi Grimberg
  2016-07-13 14:51                       ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2016-07-13 14:47 UTC (permalink / raw)



> Double completion?  When the QP exits RTS with pending unsignaled SQ WRs, cxgb4
> doesn't know if those were actually completed by hardware, so they are completed
> with FLUSH_ERR status.  I _could_ change cxgb4 to just eat those, but I'm a
> little worried about breaking the iWARP Verbs semantics.  Perhaps I shouldn't
> be.  It does seem to be causing lots of pain...

What exactly breaks iWARP semantics here?

Think of a case where we posted unsignaled send, got a successful reply
from the peer, now we drain the qp, and the send which belongs to a
transaction that we already completed is flush with error. Does that
sound like a correct behavior?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 14:47                     ` Sagi Grimberg
@ 2016-07-13 14:51                       ` Steve Wise
  2016-07-13 15:02                         ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Wise @ 2016-07-13 14:51 UTC (permalink / raw)


> > Double completion?  When the QP exits RTS with pending unsignaled SQ WRs,
> cxgb4
> > doesn't know if those were actually completed by hardware, so they are
> completed
> > with FLUSH_ERR status.  I _could_ change cxgb4 to just eat those, but I'm a
> > little worried about breaking the iWARP Verbs semantics.  Perhaps I
shouldn't
> > be.  It does seem to be causing lots of pain...
> 
> What exactly breaks iWARP semantics here?
> 
> Think of a case where we posted unsignaled send, got a successful reply
> from the peer, now we drain the qp, and the send which belongs to a
> transaction that we already completed is flush with error. Does that
> sound like a correct behavior?

Well, from the specification, yes.    From
https://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.1.3.1 :

----
An Unsignaled WR is defined as completed successfully when all of
   the following rules are met:


   *   A Work Completion is retrieved from the CQ associated with the
       SQ where the unsignaled Work Request was posted,

   *   that Work Completion corresponds to a subsequent Work Request on
       the same Send Queue as the unsignaled Work Request, and

   *   the subsequent Work Request is ordered after the unsignaled Work
       Request as per the ordering rules. Depending on the Work Request
       used, this may require using the Local Fence indicator in order
       to guarantee ordering.
---

So in your example, even though the application knows the SEND made it because
the peer replied and genereated an RQ completion, the iwarp provider does not
know the SEND made it...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 14:51                       ` Steve Wise
@ 2016-07-13 15:02                         ` Sagi Grimberg
  2016-07-13 15:12                           ` Steve Wise
  0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2016-07-13 15:02 UTC (permalink / raw)



>> Think of a case where we posted unsignaled send, got a successful reply
>> from the peer, now we drain the qp, and the send which belongs to a
>> transaction that we already completed is flush with error. Does that
>> sound like a correct behavior?
>
> Well, from the specification, yes.    From
> https://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.1.3.1 :
>
> ----
> An Unsignaled WR is defined as completed successfully when all of
>     the following rules are met:
>
>
>     *   A Work Completion is retrieved from the CQ associated with the
>         SQ where the unsignaled Work Request was posted,
>
>     *   that Work Completion corresponds to a subsequent Work Request on
>         the same Send Queue as the unsignaled Work Request, and
>
>     *   the subsequent Work Request is ordered after the unsignaled Work
>         Request as per the ordering rules. Depending on the Work Request
>         used, this may require using the Local Fence indicator in order
>         to guarantee ordering.
> ---

OK, thanks for educating me :)

> So in your example, even though the application knows the SEND made it because
> the peer replied and genereated an RQ completion, the iwarp provider does not
> know the SEND made it...

So we have two options here:

1. always make sure not to free anything related to SQEs until we
destroy the QP (hopefully won't bite us again, which is not a good
bet given that the sequence is not trivial).

2. always signal sends for iWARP (yukk...)

I pick poison 1 (for now...)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] nvme-rdma: Always signal fabrics private commands
  2016-07-13 15:02                         ` Sagi Grimberg
@ 2016-07-13 15:12                           ` Steve Wise
  0 siblings, 0 replies; 18+ messages in thread
From: Steve Wise @ 2016-07-13 15:12 UTC (permalink / raw)


> 
> 
> >> Think of a case where we posted unsignaled send, got a successful reply
> >> from the peer, now we drain the qp, and the send which belongs to a
> >> transaction that we already completed is flush with error. Does that
> >> sound like a correct behavior?
> >
> > Well, from the specification, yes.    From
> > https://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.1.3.1 :
> >
> > ----
> > An Unsignaled WR is defined as completed successfully when all of
> >     the following rules are met:
> >
> >
> >     *   A Work Completion is retrieved from the CQ associated with the
> >         SQ where the unsignaled Work Request was posted,
> >
> >     *   that Work Completion corresponds to a subsequent Work Request on
> >         the same Send Queue as the unsignaled Work Request, and
> >
> >     *   the subsequent Work Request is ordered after the unsignaled Work
> >         Request as per the ordering rules. Depending on the Work Request
> >         used, this may require using the Local Fence indicator in order
> >         to guarantee ordering.
> > ---
> 
> OK, thanks for educating me :)
>

No problem. :)  By the way, IB Verbs has the same rules.  From 1.3 of the IBTA
spec:
----
10.8.6 UNSIGNALED COMPLETIONS

An unsignaled Work Request that completed successfully is confirmed
when all of the following rules are met:

. A Work Completion is retrieved from the same CQ that is associated
with the Send Queue to which the unsignaled Work Request
was submitted.

. That Work Completion corresponds to a subsequent Work Request
on the same Send Queue as the unsignaled Work Request.

C10-108: The CI shall not access buffers associated with an Unsignaled
Work Request once a Work Completion has been retrieved that corresponds
to a subsequent Work Request on the same Send Queue.
----

 
> > So in your example, even though the application knows the SEND made it
because
> > the peer replied and genereated an RQ completion, the iwarp provider does
not
> > know the SEND made it...
> 
> So we have two options here:
> 
> 1. always make sure not to free anything related to SQEs until we
> destroy the QP (hopefully won't bite us again, which is not a good
> bet given that the sequence is not trivial).
> 
> 2. always signal sends for iWARP (yukk...)
> 
> I pick poison 1 (for now...)
>

I agree.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-07-13 15:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23 16:08 [PATCH] nvme-rdma: Always signal fabrics private commands Sagi Grimberg
2016-06-23 18:17 ` Steve Wise
2016-06-24  7:07 ` Christoph Hellwig
2016-06-24 14:05   ` Steve Wise
2016-06-26 16:41   ` Sagi Grimberg
2016-06-28  8:41     ` Christoph Hellwig
2016-06-28 14:20       ` Steve Wise
2016-06-29 14:57         ` Steve Wise
2016-06-30  6:36           ` 'Christoph Hellwig'
2016-06-30 13:44             ` Steve Wise
2016-06-30 15:10               ` Steve Wise
2016-07-13 10:08               ` Sagi Grimberg
2016-07-13 10:11                 ` Sagi Grimberg
2016-07-13 14:28                   ` Steve Wise
2016-07-13 14:47                     ` Sagi Grimberg
2016-07-13 14:51                       ` Steve Wise
2016-07-13 15:02                         ` Sagi Grimberg
2016-07-13 15:12                           ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.