All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:12 ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-10 15:12 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy
  Cc: Samuel Jones

In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signaling is done so
that it depends on the queue depth now. The magic define has
been removed completely.

Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
---
Changes from v1:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1

 drivers/nvme/host/rdma.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 47a479f..4de1b92 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
                nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+       int sig_limit;
+
+       /* We signal completion every queue depth/2 and also
+        * handle the case of possible device with queue_depth=1,
+        * where we would need to signal every message.
+        */
+       sig_limit = max(queue->queue_size / 2, 1);
+       return (++queue->sig_count % sig_limit) == 0;
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
                struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
                struct ib_send_wr *first, bool flush)
@@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * Would have been way to obvious to handle this in hardware or
         * at least the RDMA stack..
         *
-        * This messy and racy code sniplet is copy and pasted from the iSER
-        * initiator, and the magic '32' comes from there as well.
-        *
         * Always signal the flushes. The magic request used for the flush
         * sequencer is not allocated in our driver's tagset and it's
         * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * embedded in request's payload, is not freed when __ib_process_cq()
         * calls wr_cqe->done().
         */
-       if ((++queue->sig_count % 32) == 0 || flush)
+       if (nvme_rdma_queue_sig_limit(queue) || flush)
                wr.send_flags |= IB_SEND_SIGNALED;
 
        if (first)
-- 
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:12 ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-10 15:12 UTC (permalink / raw)


In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signaling is done so
that it depends on the queue depth now. The magic define has
been removed completely.

Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
Signed-off-by: Samuel Jones <sjones at kalray.eu>
---
Changes from v1:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1

 drivers/nvme/host/rdma.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 47a479f..4de1b92 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
                nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+       int sig_limit;
+
+       /* We signal completion every queue depth/2 and also
+        * handle the case of possible device with queue_depth=1,
+        * where we would need to signal every message.
+        */
+       sig_limit = max(queue->queue_size / 2, 1);
+       return (++queue->sig_count % sig_limit) == 0;
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
                struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
                struct ib_send_wr *first, bool flush)
@@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * Would have been way to obvious to handle this in hardware or
         * at least the RDMA stack..
         *
-        * This messy and racy code sniplet is copy and pasted from the iSER
-        * initiator, and the magic '32' comes from there as well.
-        *
         * Always signal the flushes. The magic request used for the flush
         * sequencer is not allocated in our driver's tagset and it's
         * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * embedded in request's payload, is not freed when __ib_process_cq()
         * calls wr_cqe->done().
         */
-       if ((++queue->sig_count % 32) == 0 || flush)
+       if (nvme_rdma_queue_sig_limit(queue) || flush)
                wr.send_flags |= IB_SEND_SIGNALED;
 
        if (first)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:12 ` Marta Rybczynska
@ 2017-04-10 15:16     ` Christoph Hellwig
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-04-10 15:16 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy, Samuel Jones

> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}

I would love if we could do this at the ib_qp level.  I also know
you found a real bug and we want it fixed ASAP, so maybe for now
we should merge this one:

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

and then do the generic version.  Marta, do you want to tackle the
generic version or should I look into this once I get a little time.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:16     ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-04-10 15:16 UTC (permalink / raw)


> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}

I would love if we could do this at the ib_qp level.  I also know
you found a real bug and we want it fixed ASAP, so maybe for now
we should merge this one:

Reviewed-by: Christoph Hellwig <hch at lst.de>

and then do the generic version.  Marta, do you want to tackle the
generic version or should I look into this once I get a little time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:16     ` Christoph Hellwig
@ 2017-04-10 15:21         ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-10 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy, Samuel Jones

>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
> 
> I would love if we could do this at the ib_qp level.  I also know
> you found a real bug and we want it fixed ASAP, so maybe for now
> we should merge this one:
> 
> Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> 
> and then do the generic version.  Marta, do you want to tackle the
> generic version or should I look into this once I get a little time.

I can look if I can do the same thing at the ib_qp level. It would be
great to have other people feedback on this one before it is merged.
We're testing this one now and we can see what I can do in a week or
so? Fine for you?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:21         ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-10 15:21 UTC (permalink / raw)


>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
> 
> I would love if we could do this at the ib_qp level.  I also know
> you found a real bug and we want it fixed ASAP, so maybe for now
> we should merge this one:
> 
> Reviewed-by: Christoph Hellwig <hch at lst.de>
> 
> and then do the generic version.  Marta, do you want to tackle the
> generic version or should I look into this once I get a little time.

I can look if I can do the same thing at the ib_qp level. It would be
great to have other people feedback on this one before it is merged.
We're testing this one now and we can see what I can do in a week or
so? Fine for you?

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:16     ` Christoph Hellwig
@ 2017-04-10 15:27         ` Bart Van Assche
  -1 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-10 15:27 UTC (permalink / raw)
  To: hch-jcswGhMUV9g, mrybczyn-FNhOzJFKnXGHXe+LvDLADg
  Cc: leonro-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	axboe-b10kYP2dOMg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, samuel.jones-FNhOzJFKnXGHXe+LvDLADg,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Mon, 2017-04-10 at 17:16 +0200, Christoph Hellwig wrote:
> > +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> > +{
> > +       int sig_limit;
> > +
> > +       /* We signal completion every queue depth/2 and also
> > +        * handle the case of possible device with queue_depth=1,
> > +        * where we would need to signal every message.
> > +        */
> > +       sig_limit = max(queue->queue_size / 2, 1);
> > +       return (++queue->sig_count % sig_limit) == 0;
> > +}
> 
> I would love if we could do this at the ib_qp level.

Hello Christoph,

Please keep in mind that some but not all RDMA drivers need a construct like
this.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:27         ` Bart Van Assche
  0 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-10 15:27 UTC (permalink / raw)


On Mon, 2017-04-10@17:16 +0200, Christoph Hellwig wrote:
> > +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> > +{
> > +       int sig_limit;
> > +
> > +       /* We signal completion every queue depth/2 and also
> > +        * handle the case of possible device with queue_depth=1,
> > +        * where we would need to signal every message.
> > +        */
> > +       sig_limit = max(queue->queue_size / 2, 1);
> > +       return (++queue->sig_count % sig_limit) == 0;
> > +}
> 
> I would love if we could do this at the ib_qp level.

Hello Christoph,

Please keep in mind that some but not all RDMA drivers need a construct like
this.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:27         ` Bart Van Assche
@ 2017-04-10 15:31             ` hch
  -1 siblings, 0 replies; 34+ messages in thread
From: hch-jcswGhMUV9g @ 2017-04-10 15:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch-jcswGhMUV9g, mrybczyn-FNhOzJFKnXGHXe+LvDLADg,
	leonro-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	axboe-b10kYP2dOMg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, samuel.jones-FNhOzJFKnXGHXe+LvDLADg,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Mon, Apr 10, 2017 at 03:27:14PM +0000, Bart Van Assche wrote:
> > I would love if we could do this at the ib_qp level.
> 
> Hello Christoph,
> 
> Please keep in mind that some but not all RDMA drivers need a construct like
> this.

Sure, I don't want to force anyone to use it.  But I'd like to have
a generic helper that everyone can use instead of reinventing the wheel.
E.g. at least iser will need the same fix, and many drivers would
benefit from not always signalling completions even if they don't have
the ability yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:31             ` hch
  0 siblings, 0 replies; 34+ messages in thread
From: hch @ 2017-04-10 15:31 UTC (permalink / raw)


On Mon, Apr 10, 2017@03:27:14PM +0000, Bart Van Assche wrote:
> > I would love if we could do this at the ib_qp level.
> 
> Hello Christoph,
> 
> Please keep in mind that some but not all RDMA drivers need a construct like
> this.

Sure, I don't want to force anyone to use it.  But I'd like to have
a generic helper that everyone can use instead of reinventing the wheel.
E.g. at least iser will need the same fix, and many drivers would
benefit from not always signalling completions even if they don't have
the ability yet.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:12 ` Marta Rybczynska
@ 2017-04-10 15:32     ` Bart Van Assche
  -1 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-10 15:32 UTC (permalink / raw)
  To: mrybczyn-FNhOzJFKnXGHXe+LvDLADg, leonro-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g,
	axboe-b10kYP2dOMg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: samuel.jones-FNhOzJFKnXGHXe+LvDLADg

On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> This patch changes the way the signaling is done so
> that it depends on the queue depth now. The magic define has
> been removed completely.
> 
> Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
> Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
> ---
> Changes from v1:
> * signal by queue size/2, remove hardcoded 32
> * support queue depth of 1
> 
>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 47a479f..4de1b92 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>                 nvme_rdma_wr_error(cq, wc, "SEND");
>  }
>  
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}
> +
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>                 struct ib_send_wr *first, bool flush)
> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * Would have been way to obvious to handle this in hardware or
>          * at least the RDMA stack..
>          *
> -        * This messy and racy code sniplet is copy and pasted from the iSER
> -        * initiator, and the magic '32' comes from there as well.
> -        *
>          * Always signal the flushes. The magic request used for the flush
>          * sequencer is not allocated in our driver's tagset and it's
>          * triggered to be freed by blk_cleanup_queue(). So we need to
> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * embedded in request's payload, is not freed when __ib_process_cq()
>          * calls wr_cqe->done().
>          */
> -       if ((++queue->sig_count % 32) == 0 || flush)
> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>                 wr.send_flags |= IB_SEND_SIGNALED;
>  
>         if (first)

Hello Marta,

The approach of this patch is suboptimal from a performance point of view.
If the number of WRs that have been submitted since the last signaled WR
was submitted would be tracked in a member variable that would allow to
get rid of the (relatively slow) division operation.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-10 15:32     ` Bart Van Assche
  0 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-10 15:32 UTC (permalink / raw)


On Mon, 2017-04-10@17:12 +0200, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> This patch changes the way the signaling is done so
> that it depends on the queue depth now. The magic define has
> been removed completely.
> 
> Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
> Signed-off-by: Samuel Jones <sjones at kalray.eu>
> ---
> Changes from v1:
> * signal by queue size/2, remove hardcoded 32
> * support queue depth of 1
> 
>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 47a479f..4de1b92 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>                 nvme_rdma_wr_error(cq, wc, "SEND");
>  }
>  
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}
> +
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>                 struct ib_send_wr *first, bool flush)
> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * Would have been way to obvious to handle this in hardware or
>          * at least the RDMA stack..
>          *
> -        * This messy and racy code sniplet is copy and pasted from the iSER
> -        * initiator, and the magic '32' comes from there as well.
> -        *
>          * Always signal the flushes. The magic request used for the flush
>          * sequencer is not allocated in our driver's tagset and it's
>          * triggered to be freed by blk_cleanup_queue(). So we need to
> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * embedded in request's payload, is not freed when __ib_process_cq()
>          * calls wr_cqe->done().
>          */
> -       if ((++queue->sig_count % 32) == 0 || flush)
> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>                 wr.send_flags |= IB_SEND_SIGNALED;
>  
>         if (first)

Hello Marta,

The approach of this patch is suboptimal from a performance point of view.
If the number of WRs that have been submitted since the last signaled WR
was submitted would be tracked in a member variable that would allow to
get rid of the (relatively slow) division operation.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:32     ` Bart Van Assche
@ 2017-04-11  8:52         ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-11  8:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leonro-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, axboe-b10kYP2dOMg,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith busch, dledford-H+wXaHxf7aLQT0dZR+AlfA, samuel jones

> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>> In the case of small NVMe-oF queue size (<32) we may enter
>> a deadlock caused by the fact that the IB completions aren't sent
>> waiting for 32 and the send queue will fill up.
>> 
>> The error is seen as (using mlx5):
>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>> 
>> This patch changes the way the signaling is done so
>> that it depends on the queue depth now. The magic define has
>> been removed completely.
>> 
>> Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>> Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>> ---
>> Changes from v1:
>> * signal by queue size/2, remove hardcoded 32
>> * support queue depth of 1
>> 
>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>  1 file changed, 13 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 47a479f..4de1b92 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>> ib_wc *wc)
>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>  }
>>  
>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
>> +
>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>                 struct ib_send_wr *first, bool flush)
>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * Would have been way to obvious to handle this in hardware or
>>          * at least the RDMA stack..
>>          *
>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>> -        * initiator, and the magic '32' comes from there as well.
>> -        *
>>          * Always signal the flushes. The magic request used for the flush
>>          * sequencer is not allocated in our driver's tagset and it's
>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>          * calls wr_cqe->done().
>>          */
>> -       if ((++queue->sig_count % 32) == 0 || flush)
>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>  
>>         if (first)
> 
> Hello Marta,
> 
> The approach of this patch is suboptimal from a performance point of view.
> If the number of WRs that have been submitted since the last signaled WR
> was submitted would be tracked in a member variable that would allow to
> get rid of the (relatively slow) division operation.
> 

Hello Bart,
I think that we can remove the division (the modulo sig_limit). The sig_count
is an u8 so it is really a type of variable you propose. It isn't used anywhere
it seems so we can change the way it is used in the snippet to count until the
signaling moment. It will give something like:

static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
{
       int sig_limit;

       /* We signal completion every queue depth/2 and also
        * handle the case of possible device with queue_depth=1,
        * where we would need to signal every message.
        */
       sig_limit = max(queue->queue_size / 2, 1);
       queue->sig_count++;
       if (queue->sig_count < sig_limit)
           return 0;
       queue->sig_count = 0;
       return 1;
}


Do you like it better?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-11  8:52         ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-11  8:52 UTC (permalink / raw)


> On Mon, 2017-04-10@17:12 +0200, Marta Rybczynska wrote:
>> In the case of small NVMe-oF queue size (<32) we may enter
>> a deadlock caused by the fact that the IB completions aren't sent
>> waiting for 32 and the send queue will fill up.
>> 
>> The error is seen as (using mlx5):
>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>> 
>> This patch changes the way the signaling is done so
>> that it depends on the queue depth now. The magic define has
>> been removed completely.
>> 
>> Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
>> Signed-off-by: Samuel Jones <sjones at kalray.eu>
>> ---
>> Changes from v1:
>> * signal by queue size/2, remove hardcoded 32
>> * support queue depth of 1
>> 
>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>  1 file changed, 13 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 47a479f..4de1b92 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>> ib_wc *wc)
>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>  }
>>  
>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
>> +
>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>                 struct ib_send_wr *first, bool flush)
>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * Would have been way to obvious to handle this in hardware or
>>          * at least the RDMA stack..
>>          *
>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>> -        * initiator, and the magic '32' comes from there as well.
>> -        *
>>          * Always signal the flushes. The magic request used for the flush
>>          * sequencer is not allocated in our driver's tagset and it's
>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>          * calls wr_cqe->done().
>>          */
>> -       if ((++queue->sig_count % 32) == 0 || flush)
>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>  
>>         if (first)
> 
> Hello Marta,
> 
> The approach of this patch is suboptimal from a performance point of view.
> If the number of WRs that have been submitted since the last signaled WR
> was submitted would be tracked in a member variable that would allow to
> get rid of the (relatively slow) division operation.
> 

Hello Bart,
I think that we can remove the division (the modulo sig_limit). The sig_count
is an u8 so it is really a type of variable you propose. It isn't used anywhere
it seems so we can change the way it is used in the snippet to count until the
signaling moment. It will give something like:

static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
{
       int sig_limit;

       /* We signal completion every queue depth/2 and also
        * handle the case of possible device with queue_depth=1,
        * where we would need to signal every message.
        */
       sig_limit = max(queue->queue_size / 2, 1);
       queue->sig_count++;
       if (queue->sig_count < sig_limit)
           return 0;
       queue->sig_count = 0;
       return 1;
}


Do you like it better?

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-11  8:52         ` Marta Rybczynska
@ 2017-04-11 10:50             ` Max Gurtovoy
  -1 siblings, 0 replies; 34+ messages in thread
From: Max Gurtovoy @ 2017-04-11 10:50 UTC (permalink / raw)
  To: Marta Rybczynska, Bart Van Assche
  Cc: leonro-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, axboe-b10kYP2dOMg,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith busch, dledford-H+wXaHxf7aLQT0dZR+AlfA, samuel jones



On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>>> In the case of small NVMe-oF queue size (<32) we may enter
>>> a deadlock caused by the fact that the IB completions aren't sent
>>> waiting for 32 and the send queue will fill up.
>>>
>>> The error is seen as (using mlx5):
>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>
>>> This patch changes the way the signaling is done so
>>> that it depends on the queue depth now. The magic define has
>>> been removed completely.
>>>
>>> Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>>> Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>>> ---
>>> Changes from v1:
>>> * signal by queue size/2, remove hardcoded 32
>>> * support queue depth of 1
>>>
>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 47a479f..4de1b92 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>> ib_wc *wc)
>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>  }
>>>
>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>> +{
>>> +       int sig_limit;
>>> +
>>> +       /* We signal completion every queue depth/2 and also
>>> +        * handle the case of possible device with queue_depth=1,
>>> +        * where we would need to signal every message.
>>> +        */
>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>> +       return (++queue->sig_count % sig_limit) == 0;
>>> +}
>>> +
>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>                 struct ib_send_wr *first, bool flush)
>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * Would have been way to obvious to handle this in hardware or
>>>          * at least the RDMA stack..
>>>          *
>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>> -        * initiator, and the magic '32' comes from there as well.
>>> -        *
>>>          * Always signal the flushes. The magic request used for the flush
>>>          * sequencer is not allocated in our driver's tagset and it's
>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>          * calls wr_cqe->done().
>>>          */
>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>
>>>         if (first)
>>
>> Hello Marta,
>>
>> The approach of this patch is suboptimal from a performance point of view.
>> If the number of WRs that have been submitted since the last signaled WR
>> was submitted would be tracked in a member variable that would allow to
>> get rid of the (relatively slow) division operation.
>>
>
> Hello Bart,
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
>
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
>
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }
>
>
> Do you like it better?

Hi Marta,
I think that Bart (and I agree) meant to avoid doing the division in the 
fast path. You can set a new variable to rdma_ctrl called sig_limit and 
set once it during initialization stage. Then just do:

if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)

in nvme_rdma_post_send function.

Also, as you mentioned sig_count is u8 so you need to avoid setting the 
sig_limit to a bigger value than 255. So please take a minimum of 32 (or 
any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);

thanks,
Max.

>
> Marta
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-11 10:50             ` Max Gurtovoy
  0 siblings, 0 replies; 34+ messages in thread
From: Max Gurtovoy @ 2017-04-11 10:50 UTC (permalink / raw)




On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>> On Mon, 2017-04-10@17:12 +0200, Marta Rybczynska wrote:
>>> In the case of small NVMe-oF queue size (<32) we may enter
>>> a deadlock caused by the fact that the IB completions aren't sent
>>> waiting for 32 and the send queue will fill up.
>>>
>>> The error is seen as (using mlx5):
>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>
>>> This patch changes the way the signaling is done so
>>> that it depends on the queue depth now. The magic define has
>>> been removed completely.
>>>
>>> Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
>>> Signed-off-by: Samuel Jones <sjones at kalray.eu>
>>> ---
>>> Changes from v1:
>>> * signal by queue size/2, remove hardcoded 32
>>> * support queue depth of 1
>>>
>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 47a479f..4de1b92 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>> ib_wc *wc)
>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>  }
>>>
>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>> +{
>>> +       int sig_limit;
>>> +
>>> +       /* We signal completion every queue depth/2 and also
>>> +        * handle the case of possible device with queue_depth=1,
>>> +        * where we would need to signal every message.
>>> +        */
>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>> +       return (++queue->sig_count % sig_limit) == 0;
>>> +}
>>> +
>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>                 struct ib_send_wr *first, bool flush)
>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * Would have been way to obvious to handle this in hardware or
>>>          * at least the RDMA stack..
>>>          *
>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>> -        * initiator, and the magic '32' comes from there as well.
>>> -        *
>>>          * Always signal the flushes. The magic request used for the flush
>>>          * sequencer is not allocated in our driver's tagset and it's
>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>          * calls wr_cqe->done().
>>>          */
>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>
>>>         if (first)
>>
>> Hello Marta,
>>
>> The approach of this patch is suboptimal from a performance point of view.
>> If the number of WRs that have been submitted since the last signaled WR
>> was submitted would be tracked in a member variable that would allow to
>> get rid of the (relatively slow) division operation.
>>
>
> Hello Bart,
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
>
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
>
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }
>
>
> Do you like it better?

Hi Marta,
I think that Bart (and I agree) meant to avoid doing the division in the 
fast path. You can set a new variable to rdma_ctrl called sig_limit and 
set once it during initialization stage. Then just do:

if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)

in nvme_rdma_post_send function.

Also, as you mentioned sig_count is u8 so you need to avoid setting the 
sig_limit to a bigger value than 255. So please take a minimum of 32 (or 
any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);

thanks,
Max.

>
> Marta
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-11 10:50             ` Max Gurtovoy
@ 2017-04-11 11:04                 ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-11 11:04 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Bart Van Assche, leonro-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g,
	axboe-b10kYP2dOMg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith busch, dledford-H+wXaHxf7aLQT0dZR+AlfA, samuel jones

> On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>>> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>>>> In the case of small NVMe-oF queue size (<32) we may enter
>>>> a deadlock caused by the fact that the IB completions aren't sent
>>>> waiting for 32 and the send queue will fill up.
>>>>
>>>> The error is seen as (using mlx5):
>>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>>
>>>> This patch changes the way the signaling is done so
>>>> that it depends on the queue depth now. The magic define has
>>>> been removed completely.
>>>>
>>>> Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>>>> Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
>>>> ---
>>>> Changes from v1:
>>>> * signal by queue size/2, remove hardcoded 32
>>>> * support queue depth of 1
>>>>
>>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index 47a479f..4de1b92 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>>> ib_wc *wc)
>>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>>  }
>>>>
>>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>>> +{
>>>> +       int sig_limit;
>>>> +
>>>> +       /* We signal completion every queue depth/2 and also
>>>> +        * handle the case of possible device with queue_depth=1,
>>>> +        * where we would need to signal every message.
>>>> +        */
>>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>>> +       return (++queue->sig_count % sig_limit) == 0;
>>>> +}
>>>> +
>>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>>                 struct ib_send_wr *first, bool flush)
>>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * Would have been way to obvious to handle this in hardware or
>>>>          * at least the RDMA stack..
>>>>          *
>>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>>> -        * initiator, and the magic '32' comes from there as well.
>>>> -        *
>>>>          * Always signal the flushes. The magic request used for the flush
>>>>          * sequencer is not allocated in our driver's tagset and it's
>>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>>          * calls wr_cqe->done().
>>>>          */
>>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>>
>>>>         if (first)
>>>
>>> Hello Marta,
>>>
>>> The approach of this patch is suboptimal from a performance point of view.
>>> If the number of WRs that have been submitted since the last signaled WR
>>> was submitted would be tracked in a member variable that would allow to
>>> get rid of the (relatively slow) division operation.
>>>
>>
>> Hello Bart,
>> I think that we can remove the division (the modulo sig_limit). The sig_count
>> is an u8 so it is really a type of variable you propose. It isn't used anywhere
>> it seems so we can change the way it is used in the snippet to count until the
>> signaling moment. It will give something like:
>>
>> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> {
>>        int sig_limit;
>>
>>        /* We signal completion every queue depth/2 and also
>>         * handle the case of possible device with queue_depth=1,
>>         * where we would need to signal every message.
>>         */
>>        sig_limit = max(queue->queue_size / 2, 1);
>>        queue->sig_count++;
>>        if (queue->sig_count < sig_limit)
>>            return 0;
>>        queue->sig_count = 0;
>>        return 1;
>> }
>>
>>
>> Do you like it better?
> 
> Hi Marta,
> I think that Bart (and I agree) meant to avoid doing the division in the
> fast path. You can set a new variable to rdma_ctrl called sig_limit and
> set once it during initialization stage. Then just do:
> 
> if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)
> 
> in nvme_rdma_post_send function.
> 
> Also, as you mentioned sig_count is u8 so you need to avoid setting the
> sig_limit to a bigger value than 255. So please take a minimum of 32 (or
> any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);
> 

Hello Max,
Do you mean the division by 2 or the modulo? I would say that the division
by two should be compiled as a shift operation on most architectures, won't it?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-11 11:04                 ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-11 11:04 UTC (permalink / raw)


> On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>>> On Mon, 2017-04-10@17:12 +0200, Marta Rybczynska wrote:
>>>> In the case of small NVMe-oF queue size (<32) we may enter
>>>> a deadlock caused by the fact that the IB completions aren't sent
>>>> waiting for 32 and the send queue will fill up.
>>>>
>>>> The error is seen as (using mlx5):
>>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>>
>>>> This patch changes the way the signaling is done so
>>>> that it depends on the queue depth now. The magic define has
>>>> been removed completely.
>>>>
>>>> Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
>>>> Signed-off-by: Samuel Jones <sjones at kalray.eu>
>>>> ---
>>>> Changes from v1:
>>>> * signal by queue size/2, remove hardcoded 32
>>>> * support queue depth of 1
>>>>
>>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index 47a479f..4de1b92 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>>> ib_wc *wc)
>>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>>  }
>>>>
>>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>>> +{
>>>> +       int sig_limit;
>>>> +
>>>> +       /* We signal completion every queue depth/2 and also
>>>> +        * handle the case of possible device with queue_depth=1,
>>>> +        * where we would need to signal every message.
>>>> +        */
>>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>>> +       return (++queue->sig_count % sig_limit) == 0;
>>>> +}
>>>> +
>>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>>                 struct ib_send_wr *first, bool flush)
>>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * Would have been way to obvious to handle this in hardware or
>>>>          * at least the RDMA stack..
>>>>          *
>>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>>> -        * initiator, and the magic '32' comes from there as well.
>>>> -        *
>>>>          * Always signal the flushes. The magic request used for the flush
>>>>          * sequencer is not allocated in our driver's tagset and it's
>>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>>          * calls wr_cqe->done().
>>>>          */
>>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>>
>>>>         if (first)
>>>
>>> Hello Marta,
>>>
>>> The approach of this patch is suboptimal from a performance point of view.
>>> If the number of WRs that have been submitted since the last signaled WR
>>> was submitted would be tracked in a member variable that would allow to
>>> get rid of the (relatively slow) division operation.
>>>
>>
>> Hello Bart,
>> I think that we can remove the division (the modulo sig_limit). The sig_count
>> is an u8 so it is really a type of variable you propose. It isn't used anywhere
>> it seems so we can change the way it is used in the snippet to count until the
>> signaling moment. It will give something like:
>>
>> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> {
>>        int sig_limit;
>>
>>        /* We signal completion every queue depth/2 and also
>>         * handle the case of possible device with queue_depth=1,
>>         * where we would need to signal every message.
>>         */
>>        sig_limit = max(queue->queue_size / 2, 1);
>>        queue->sig_count++;
>>        if (queue->sig_count < sig_limit)
>>            return 0;
>>        queue->sig_count = 0;
>>        return 1;
>> }
>>
>>
>> Do you like it better?
> 
> Hi Marta,
> I think that Bart (and I agree) meant to avoid doing the division in the
> fast path. You can set a new variable to rdma_ctrl called sig_limit and
> set once it during initialization stage. Then just do:
> 
> if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)
> 
> in nvme_rdma_post_send function.
> 
> Also, as you mentioned sig_count is u8 so you need to avoid setting the
> sig_limit to a bigger value than 255. So please take a minimum of 32 (or
> any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);
> 

Hello Max,
Do you mean the division by 2 or the modulo? I would say that the division
by two should be compiled as a shift operation on most architectures, won't it?

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-11  8:52         ` Marta Rybczynska
@ 2017-04-11 15:10             ` Bart Van Assche
  -1 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-11 15:10 UTC (permalink / raw)
  To: mrybczyn-FNhOzJFKnXGHXe+LvDLADg
  Cc: leonro-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, axboe-b10kYP2dOMg,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, samuel.jones-FNhOzJFKnXGHXe+LvDLADg,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Tue, 2017-04-11 at 10:52 +0200, Marta Rybczynska wrote:
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
> 
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
> 
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }

Hello Marta,

Thank you for having addressed my feedback. Although this is not important, I
think it is possible to optimize the above code further as follows:
* Instead of initializing / resetting sig_count to zero, initialize it to sig_limit.
* Instead of incrementing sig_count in nvme_rdma_queue_sig_limit(), decrement it.
* Instead of comparing sig_count against sig_limit, compare it against zero.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-04-11 15:10             ` Bart Van Assche
  0 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2017-04-11 15:10 UTC (permalink / raw)


On Tue, 2017-04-11@10:52 +0200, Marta Rybczynska wrote:
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
> 
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
> 
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }

Hello Marta,

Thank you for having addressed my feedback. Although this is not important, I
think it is possible to optimize the above code further as follows:
* Instead of initializing / resetting sig_count to zero, initialize it to sig_limit.
* Instead of incrementing sig_count in nvme_rdma_queue_sig_limit(), decrement it.
* Instead of comparing sig_count against sig_limit, compare it against zero.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] nvme-rdma: support devices with queue size < 32
  2017-04-11 15:10             ` Bart Van Assche
@ 2017-04-20  9:43                 ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-20  9:43 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	axboe-b10kYP2dOMg, maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	hch-jcswGhMUV9g, keith busch, samuel jones,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, Bart Van Assche

In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signalling is done so
that it depends on the queue depth now. The magic define has
been removed completely.

Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
---
Changes in v3:
* avoid division in the fast path
* reverse sig_count logic to simplify the code: it now counts down
  from the queue depth/2 to 0
* change sig_count to int to avoid overflows for big queues

Changes in v2:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1

 drivers/nvme/host/rdma.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3d25add..ee6f747 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	u8			sig_count;
+	int			sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -251,6 +251,15 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
 	return queue->cm_error;
 }
 
+static inline int nvme_rdma_init_sig_count(int queue_size)
+{
+	/* We signal completion every queue depth/2 and also
+	 * handle the case of possible device with queue_depth=1,
+	 * where we would need to signal every message.
+	 */
+	return max(queue_size / 2, 1);
+}
+
 static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 {
 	struct nvme_rdma_device *dev = queue->device;
@@ -556,6 +565,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
 	queue->queue_size = queue_size;
 
+	queue->sig_count = nvme_rdma_init_sig_count(queue_size);
+
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(queue->cm_id)) {
@@ -1011,6 +1022,16 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+	queue->sig_count--;
+	if (queue->sig_count == 0) {
+		queue->sig_count = nvme_rdma_init_sig_count(queue->queue_size);
+		return 1;
+	}
+	return 0;
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
 		struct ib_send_wr *first, bool flush)
@@ -1038,9 +1059,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	 * Would have been way to obvious to handle this in hardware or
 	 * at least the RDMA stack..
 	 *
-	 * This messy and racy code sniplet is copy and pasted from the iSER
-	 * initiator, and the magic '32' comes from there as well.
-	 *
 	 * Always signal the flushes. The magic request used for the flush
 	 * sequencer is not allocated in our driver's tagset and it's
 	 * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1048,7 +1066,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	 * embeded in request's payload, is not freed when __ib_process_cq()
 	 * calls wr_cqe->done().
 	 */
-	if ((++queue->sig_count % 32) == 0 || flush)
+	if (nvme_rdma_queue_sig_limit(queue) || flush)
 		wr.send_flags |= IB_SEND_SIGNALED;
 
 	if (first)
-- 
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3] nvme-rdma: support devices with queue size < 32
@ 2017-04-20  9:43                 ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-20  9:43 UTC (permalink / raw)


In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signalling is done so
that it depends on the queue depth now. The magic define has
been removed completely.

Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
Signed-off-by: Samuel Jones <sjones at kalray.eu>
---
Changes in v3:
* avoid division in the fast path
* reverse sig_count logic to simplify the code: it now counts down
  from the queue depth/2 to 0
* change sig_count to int to avoid overflows for big queues

Changes in v2:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1

 drivers/nvme/host/rdma.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3d25add..ee6f747 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	u8			sig_count;
+	int			sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -251,6 +251,15 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
 	return queue->cm_error;
 }
 
+static inline int nvme_rdma_init_sig_count(int queue_size)
+{
+	/* We signal completion every queue depth/2 and also
+	 * handle the case of possible device with queue_depth=1,
+	 * where we would need to signal every message.
+	 */
+	return max(queue_size / 2, 1);
+}
+
 static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 {
 	struct nvme_rdma_device *dev = queue->device;
@@ -556,6 +565,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
 	queue->queue_size = queue_size;
 
+	queue->sig_count = nvme_rdma_init_sig_count(queue_size);
+
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(queue->cm_id)) {
@@ -1011,6 +1022,16 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+	queue->sig_count--;
+	if (queue->sig_count == 0) {
+		queue->sig_count = nvme_rdma_init_sig_count(queue->queue_size);
+		return 1;
+	}
+	return 0;
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
 		struct ib_send_wr *first, bool flush)
@@ -1038,9 +1059,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	 * Would have been way to obvious to handle this in hardware or
 	 * at least the RDMA stack..
 	 *
-	 * This messy and racy code sniplet is copy and pasted from the iSER
-	 * initiator, and the magic '32' comes from there as well.
-	 *
 	 * Always signal the flushes. The magic request used for the flush
 	 * sequencer is not allocated in our driver's tagset and it's
 	 * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1048,7 +1066,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	 * embeded in request's payload, is not freed when __ib_process_cq()
 	 * calls wr_cqe->done().
 	 */
-	if ((++queue->sig_count % 32) == 0 || flush)
+	if (nvme_rdma_queue_sig_limit(queue) || flush)
 		wr.send_flags |= IB_SEND_SIGNALED;
 
 	if (first)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] nvme-rdma: support devices with queue size < 32
  2017-04-20  9:43                 ` Marta Rybczynska
@ 2017-04-20 11:37                     ` Sagi Grimberg
  -1 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-04-20 11:37 UTC (permalink / raw)
  To: Marta Rybczynska, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	axboe-b10kYP2dOMg, maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	hch-jcswGhMUV9g, keith busch, samuel jones,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, Bart Van Assche

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

BTW, did you test with deeper queue depths (say 512)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] nvme-rdma: support devices with queue size < 32
@ 2017-04-20 11:37                     ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-04-20 11:37 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

BTW, did you test with deeper queue depths (say 512)?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] nvme-rdma: support devices with queue size < 32
  2017-04-20 11:37                     ` Sagi Grimberg
@ 2017-04-20 11:43                         ` Sagi Grimberg
  -1 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-04-20 11:43 UTC (permalink / raw)
  To: Marta Rybczynska, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	axboe-b10kYP2dOMg, maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	hch-jcswGhMUV9g, keith busch, samuel jones,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, Bart Van Assche

> Looks good,
>
> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>
> BTW, did you test with deeper queue depths (say 512)?

Wait, taking it back...

Can you make nvme_rdma_queue_sig_limit() return a bool instead?

Also, Looking at this closer, I'm pretty convinced that this
should convert to atomic. For iSER its fine as is because
we are under the iscsi connection lock, but here we need to
handle mutual exclusion.

This would be an incremental change though.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] nvme-rdma: support devices with queue size < 32
@ 2017-04-20 11:43                         ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-04-20 11:43 UTC (permalink / raw)


> Looks good,
>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
>
> BTW, did you test with deeper queue depths (say 512)?

Wait, taking it back...

Can you make nvme_rdma_queue_sig_limit() return a bool instead?

Also, Looking at this closer, I'm pretty convinced that this
should convert to atomic. For iSER its fine as is because
we are under the iscsi connection lock, but here we need to
handle mutual exclusion.

This would be an incremental change though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] nvme-rdma: support devices with queue size < 32
  2017-04-20 11:43                         ` Sagi Grimberg
@ 2017-04-21  8:01                             ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-21  8:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	axboe-b10kYP2dOMg, maxg-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	hch-jcswGhMUV9g, keith busch, samuel jones,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, Bart Van Assche



----- Le 20 Avr 17, à 13:43, Sagi Grimberg sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org a écrit :

>> Looks good,
>>
>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>
>> BTW, did you test with deeper queue depths (say 512)?

It's in progress. Small depth (16) is running for several
days already.

> 
> Wait, taking it back...
> 
> Can you make nvme_rdma_queue_sig_limit() return a bool instead?

Sure.

> 
> Also, Looking at this closer, I'm pretty convinced that this
> should convert to atomic. For iSER its fine as is because
> we are under the iscsi connection lock, but here we need to
> handle mutual exclusion.
> 
> This would be an incremental change though.

I can see that. I was wondering about atomics when rewriting
this, and I agree that it will be cleaner.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] nvme-rdma: support devices with queue size < 32
@ 2017-04-21  8:01                             ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-04-21  8:01 UTC (permalink / raw)




----- Le 20 Avr 17, ? 13:43, Sagi Grimberg sagi at grimberg.me a ?crit :

>> Looks good,
>>
>> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
>>
>> BTW, did you test with deeper queue depths (say 512)?

It's in progress. Small depth (16) is running for several
days already.

> 
> Wait, taking it back...
> 
> Can you make nvme_rdma_queue_sig_limit() return a bool instead?

Sure.

> 
> Also, Looking at this closer, I'm pretty convinced that this
> should convert to atomic. For iSER its fine as is because
> we are under the iscsi connection lock, but here we need to
> handle mutual exclusion.
> 
> This would be an incremental change though.

I can see that. I was wondering about atomics when rewriting
this, and I agree that it will be cleaner.

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-04-10 15:12 ` Marta Rybczynska
@ 2017-05-22 18:51     ` Christoph Hellwig
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-05-22 18:51 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy, Samuel Jones

I've hand applies this once with a few formatting tweaks now that
I got the ACK from Sagi.  І'd love to see if someone could figure
out how we could move the improved atomic + ilog version into the
rdma core somehow.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-05-22 18:51     ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-05-22 18:51 UTC (permalink / raw)


I've hand applies this once with a few formatting tweaks now that
I got the ACK from Sagi.  ?'d love to see if someone could figure
out how we could move the improved atomic + ilog version into the
rdma core somehow.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-05-22 18:51     ` Christoph Hellwig
@ 2017-05-23 15:32         ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-05-23 15:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy, Samuel Jones


> I've hand applies this once with a few formatting tweaks now that
> I got the ACK from Sagi.  І'd love to see if someone could figure
> out how we could move the improved atomic + ilog version into the
> rdma core somehow.

I'll submit the atomic+ilog version in nvme next week. I have no
access to the test systems this week.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-05-23 15:32         ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-05-23 15:32 UTC (permalink / raw)



> I've hand applies this once with a few formatting tweaks now that
> I got the ACK from Sagi.  ?'d love to see if someone could figure
> out how we could move the improved atomic + ilog version into the
> rdma core somehow.

I'll submit the atomic+ilog version in nvme next week. I have no
access to the test systems this week.

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
  2017-05-23 15:32         ` Marta Rybczynska
@ 2017-06-05  9:47             ` Marta Rybczynska
  -1 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-06-05  9:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, keith busch,
	axboe-b10kYP2dOMg, Max Gurtovoy, Samuel Jones

>> I've hand applies this once with a few formatting tweaks now that
>> I got the ACK from Sagi.  І'd love to see if someone could figure
>> out how we could move the improved atomic + ilog version into the
>> rdma core somehow.
> 
> I'll submit the atomic+ilog version in nvme next week. I have no
> access to the test systems this week.
> 

I've just posted the new version, with changed title as the v2 is
merged. The new thread name is:
[PATCH] nvme-rdma: remove race conditions from IB signalling

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] nvme-rdma: support devices with queue size < 32
@ 2017-06-05  9:47             ` Marta Rybczynska
  0 siblings, 0 replies; 34+ messages in thread
From: Marta Rybczynska @ 2017-06-05  9:47 UTC (permalink / raw)


>> I've hand applies this once with a few formatting tweaks now that
>> I got the ACK from Sagi.  ?'d love to see if someone could figure
>> out how we could move the improved atomic + ilog version into the
>> rdma core somehow.
> 
> I'll submit the atomic+ilog version in nvme next week. I have no
> access to the test systems this week.
> 

I've just posted the new version, with changed title as the v2 is
merged. The new thread name is:
[PATCH] nvme-rdma: remove race conditions from IB signalling

Marta

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-06-05  9:47 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-10 15:12 [PATCH v2] nvme-rdma: support devices with queue size < 32 Marta Rybczynska
2017-04-10 15:12 ` Marta Rybczynska
     [not found] ` <1519881025.363156294.1491837154312.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
2017-04-10 15:16   ` Christoph Hellwig
2017-04-10 15:16     ` Christoph Hellwig
     [not found]     ` <20170410151657.GA15173-jcswGhMUV9g@public.gmane.org>
2017-04-10 15:21       ` Marta Rybczynska
2017-04-10 15:21         ` Marta Rybczynska
2017-04-10 15:27       ` Bart Van Assche
2017-04-10 15:27         ` Bart Van Assche
     [not found]         ` <1491838033.4199.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-04-10 15:31           ` hch-jcswGhMUV9g
2017-04-10 15:31             ` hch
2017-04-10 15:32   ` Bart Van Assche
2017-04-10 15:32     ` Bart Van Assche
     [not found]     ` <1491838338.4199.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-04-11  8:52       ` Marta Rybczynska
2017-04-11  8:52         ` Marta Rybczynska
     [not found]         ` <1807354347.364485979.1491900728245.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
2017-04-11 10:50           ` Max Gurtovoy
2017-04-11 10:50             ` Max Gurtovoy
     [not found]             ` <e65126fa-48b7-473e-72f6-4a3a8474d11e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-04-11 11:04               ` Marta Rybczynska
2017-04-11 11:04                 ` Marta Rybczynska
2017-04-11 15:10           ` Bart Van Assche
2017-04-11 15:10             ` Bart Van Assche
     [not found]             ` <1491923426.2654.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-04-20  9:43               ` [PATCH v3] " Marta Rybczynska
2017-04-20  9:43                 ` Marta Rybczynska
     [not found]                 ` <1677617891.385461828.1492681414720.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
2017-04-20 11:37                   ` Sagi Grimberg
2017-04-20 11:37                     ` Sagi Grimberg
     [not found]                     ` <391faec7-924d-7af2-eb1b-b3106bd4193c-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-04-20 11:43                       ` Sagi Grimberg
2017-04-20 11:43                         ` Sagi Grimberg
     [not found]                         ` <dc34a67f-a650-311d-f9ae-7e854c9d7067-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-04-21  8:01                           ` Marta Rybczynska
2017-04-21  8:01                             ` Marta Rybczynska
2017-05-22 18:51   ` [PATCH v2] " Christoph Hellwig
2017-05-22 18:51     ` Christoph Hellwig
     [not found]     ` <20170522185127.GA27542-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-05-23 15:32       ` Marta Rybczynska
2017-05-23 15:32         ` Marta Rybczynska
     [not found]         ` <2015307566.53586286.1495553554276.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
2017-06-05  9:47           ` Marta Rybczynska
2017-06-05  9:47             ` Marta Rybczynska

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.