All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leonro@mellanox.com>
To: Yamin Friedman <yaminf@mellanox.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>,
	Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH V2 2/4] RDMA/core: Introduce shared CQ pool API
Date: Mon, 18 May 2020 20:48:21 +0300	[thread overview]
Message-ID: <20200518174821.GB188135@unreal> (raw)
In-Reply-To: <a6281ce2-71e1-db0f-1038-8aa0db0aa6be@mellanox.com>

On Mon, May 18, 2020 at 04:16:05PM +0300, Yamin Friedman wrote:
>
> On 5/18/2020 11:30 AM, Leon Romanovsky wrote:
> > On Wed, May 13, 2020 at 02:52:41PM +0300, Yamin Friedman wrote:
> > > Allow a ULP to ask the core to provide a completion queue based on a
> > > least-used search on a per-device CQ pools. The device CQ pools grow in a
> > > lazy fashion when more CQs are requested.
> > >
> > > This feature reduces the amount of interrupts when using many QPs.
> > > Using shared CQs allows for more effcient completion handling. It also
> > > reduces the amount of overhead needed for CQ contexts.
> > >
> > > Test setup:
> > > Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
> > > Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
> > > TX-depth = 32. The patch was applied in the nvme driver on both the target
> > > and initiator. Four controllers are accessed from each core. In the
> > > current test case we have exposed sixteen NVMe namespaces using four
> > > different subsystems (four namespaces per subsystem) from one NVM port.
> > > Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
> > > Before this series we had X == Y, i.e for four controllers we've created
> > > total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
> > > only X CQs which means that we have four controllers that share a
> > > completion queue per core. Until fourteen cores there is no significant
> > > change in performance and the number of interrupts per second is less than
> > > a million in the current case.
> > > ==================================================
> > > |Cores|Current KIOPs  |Shared KIOPs  |improvement|
> > > |-----|---------------|--------------|-----------|
> > > |14   |2332           |2723          |16.7%      |
> > > |-----|---------------|--------------|-----------|
> > > |20   |2086           |2712          |30%        |
> > > |-----|---------------|--------------|-----------|
> > > |28   |1971           |2669          |35.4%      |
> > > |=================================================
> > > |Cores|Current avg lat|Shared avg lat|improvement|
> > > |-----|---------------|--------------|-----------|
> > > |14   |767us          |657us         |14.3%      |
> > > |-----|---------------|--------------|-----------|
> > > |20   |1225us         |943us         |23%        |
> > > |-----|---------------|--------------|-----------|
> > > |28   |1816us         |1341us        |26.1%      |
> > > ========================================================
> > > |Cores|Current interrupts|Shared interrupts|improvement|
> > > |-----|------------------|-----------------|-----------|
> > > |14   |1.6M/sec          |0.4M/sec         |72%        |
> > > |-----|------------------|-----------------|-----------|
> > > |20   |2.8M/sec          |0.6M/sec         |72.4%      |
> > > |-----|------------------|-----------------|-----------|
> > > |28   |2.9M/sec          |0.8M/sec         |63.4%      |
> > > ====================================================================
> > > |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
> > > |-----|------------------------|-----------------------|-----------|
> > > |14   |67ms                    |6ms                    |90.9%      |
> > > |-----|------------------------|-----------------------|-----------|
> > > |20   |5ms                     |6ms                    |-10%       |
> > > |-----|------------------------|-----------------------|-----------|
> > > |28   |8.7ms                   |6ms                    |25.9%      |
> > > |===================================================================
> > >
> > > Performance improvement with sixteen disks (sixteen CQs per core) is
> > > comparable.
> > >
> > > Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
> > > Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
> > > Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> > > ---
> > >   drivers/infiniband/core/core_priv.h |   4 ++
> > >   drivers/infiniband/core/cq.c        | 137 ++++++++++++++++++++++++++++++++++++
> > >   drivers/infiniband/core/device.c    |   2 +
> > >   include/rdma/ib_verbs.h             |  35 +++++++++
> > >   4 files changed, 178 insertions(+)
> > >
> > > diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> > > index cf42acc..fa3151b 100644
> > > --- a/drivers/infiniband/core/core_priv.h
> > > +++ b/drivers/infiniband/core/core_priv.h
> > > @@ -414,4 +414,8 @@ void rdma_umap_priv_init(struct rdma_umap_priv *priv,
> > >   			 struct vm_area_struct *vma,
> > >   			 struct rdma_user_mmap_entry *entry);
> > >
> > > +void ib_cq_pool_init(struct ib_device *dev);
> > > +
> > > +void ib_cq_pool_destroy(struct ib_device *dev);
> > > +
> > >   #endif /* _CORE_PRIV_H */
> > > diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> > > index 04046eb..5319c14 100644
> > > --- a/drivers/infiniband/core/cq.c
> > > +++ b/drivers/infiniband/core/cq.c
> > > @@ -7,7 +7,11 @@
> > >   #include <linux/slab.h>
> > >   #include <rdma/ib_verbs.h>
> > >
> > > +#include "core_priv.h"
> > > +
> > >   #include <trace/events/rdma_core.h>
> > > +/* Max size for shared CQ, may require tuning */
> > > +#define IB_MAX_SHARED_CQ_SZ		4096
> > >
> > >   /* # of WCs to poll for with a single call to ib_poll_cq */
> > >   #define IB_POLL_BATCH			16
> > > @@ -218,6 +222,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
> > >   	cq->cq_context = private;
> > >   	cq->poll_ctx = poll_ctx;
> > >   	atomic_set(&cq->usecnt, 0);
> > > +	cq->comp_vector = comp_vector;
> > >
> > >   	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
> > >   	if (!cq->wc)
> > > @@ -304,6 +309,8 @@ static void _ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
> > >   {
> > >   	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
> > >   		return;
> > > +	if (WARN_ON_ONCE(cq->cqe_used != 0))
> > Let's do WARN_ON_ONCE(cq->cqe_used)
> >
> > > +		return;
> > >
> > >   	switch (cq->poll_ctx) {
> > >   	case IB_POLL_DIRECT:
> > > @@ -340,3 +347,133 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
> > >   		_ib_free_cq_user(cq, udata);
> > >   }
> > >   EXPORT_SYMBOL(ib_free_cq_user);
> > > +
> > > +void ib_cq_pool_init(struct ib_device *dev)
> > > +{
> > > +	int i;
> > > +
> > > +	spin_lock_init(&dev->cq_pools_lock);
> > > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> > > +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> > > +}
> > > +
> > > +void ib_cq_pool_destroy(struct ib_device *dev)
> > > +{
> > > +	struct ib_cq *cq, *n;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> > > +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i], pool_entry)
> > > +			_ib_free_cq_user(cq, NULL);
> > > +	}
> > > +
> > > +}
> > > +
> > > +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> > > +			enum ib_poll_context poll_ctx)
> > > +{
> > > +	LIST_HEAD(tmp_list);
> > > +	struct ib_cq *cq;
> > > +	unsigned long flags;
> > > +	int nr_cqs, ret, i;
> > > +
> > > +	/*
> > > +	 * Allocated at least as many CQEs as requested, and otherwise
> > > +	 * a reasonable batch size so that we can share CQs between
> > > +	 * multiple users instead of allocating a larger number of CQs.
> > > +	 */
> > > +	nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> > > +	nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> > > +	for (i = 0; i < nr_cqs; i++) {
> > > +		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> > > +		if (IS_ERR(cq)) {
> > > +			ret = PTR_ERR(cq);
> > > +			goto out_free_cqs;
> > > +		}
> > > +		cq->shared = true;
> > > +		list_add_tail(&cq->pool_entry, &tmp_list);
> > > +	}
> > > +
> > > +	spin_lock_irqsave(&dev->cq_pools_lock, flags);
> > > +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> > > +	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> > > +
> > > +	return 0;
> > > +
> > > +out_free_cqs:
> > > +	list_for_each_entry(cq, &tmp_list, pool_entry)
> > > +		ib_free_cq(cq);
> > > +	return ret;
> > > +}
> > > +
> > > +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> > > +			     int comp_vector_hint,
> > > +			     enum ib_poll_context poll_ctx)
> > > +{
> > > +	static unsigned int default_comp_vector;
> > > +	int vector, ret, num_comp_vectors;
> > > +	struct ib_cq *cq, *found = NULL;
> > > +	unsigned long flags;
> > > +
> > > +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> > > +		return ERR_PTR(-EINVAL);
> > > +
> > > +	num_comp_vectors = min_t(int, dev->num_comp_vectors,
> > > +				 num_online_cpus());
> > > +	/* Project the affinty to the device completion vector range */
> > > +	if (comp_vector_hint < 0)
> > > +		vector = default_comp_vector++ % num_comp_vectors;
> > > +	else
> > > +		vector = comp_vector_hint % num_comp_vectors;
> > > +
> > > +	/*
> > > +	 * Find the least used CQ with correct affinity and
> > > +	 * enough free CQ entries
> > > +	 */
> > > +	while (!found) {
> > > +		spin_lock_irqsave(&dev->cq_pools_lock, flags);
> > > +		list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
> > > +				    pool_entry) {
> > > +			if (vector != cq->comp_vector)
> > I think that this check worth to have a comment.
> > At least for me, it is not clear if it will work correctly if
> > comp_vector == 0.
> >
> > > +				continue;
> > > +			if (cq->cqe_used + nr_cqe > cq->cqe)
> > > +				continue;
> > > +			if (found && cq->cqe_used >= found->cqe_used)
> > > +				continue;
> > > +			found = cq;
> > > +			break;
> > > +		}
> > > +
> > > +		if (found) {
> > > +			found->cqe_used += nr_cqe;
> > > +			spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> > > +
> > > +			return found;
> > > +		}
> > > +		spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> > > +
> > > +		/*
> > > +		 * Didn't find a match or ran out of CQs in the device
> > > +		 * pool, allocate a new array of CQs.
> > > +		 */
> > > +		ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> > > +		if (ret)
> > > +			return ERR_PTR(ret);
> > > +	}
> > > +
> > > +	return found;
> > > +}
> > > +EXPORT_SYMBOL(ib_cq_pool_get);
> > > +
> > > +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	if (nr_cqe > cq->cqe_used)
> > > +		return;
> > Is it possible?
> > 1. It is racy
> > 2. It is a bug in the ib_cq_pool_put() caller.
>
> It is possible, the pool doesn't save the amount of cqes used per user.

So, #2 from the list above.

>
> I think to make it really secure I would have to never reduce the cqes used,
> save the number of active users, and have some form of garbage collection
> for used up CQs but that seems to me a lot for something that should not
> occur during proper use.
>
> Would it be better to just have a WARN for this case?

I think so.

Thanks

  reply	other threads:[~2020-05-18 17:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-13 11:52 [PATCH V2 0/4] Introducing RDMA shared CQ pool Yamin Friedman
2020-05-13 11:52 ` [PATCH V2 1/4] RDMA/core: Add protection for shared CQs used by ULPs Yamin Friedman
2020-05-18  7:54   ` Leon Romanovsky
2020-05-18 12:58     ` Yamin Friedman
2020-05-18 17:53       ` Leon Romanovsky
2020-05-13 11:52 ` [PATCH V2 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
2020-05-18  8:30   ` Leon Romanovsky
2020-05-18 13:16     ` Yamin Friedman
2020-05-18 17:48       ` Leon Romanovsky [this message]
2020-05-19  4:27         ` Devesh Sharma
2020-05-19  4:33           ` Leon Romanovsky
2020-05-13 11:52 ` [PATCH V2 3/4] nvme-rdma: use new shared CQ mechanism Yamin Friedman
2020-05-13 11:52 ` [PATCH V2 4/4] nvmet-rdma: " Yamin Friedman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200518174821.GB188135@unreal \
    --to=leonro@mellanox.com \
    --cc=hch@lst.de \
    --cc=jgg@mellanox.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=sagi@grimberg.me \
    --cc=yaminf@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.