All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-07  7:16 ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-07  7:16 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Sean Hefty, Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

If we don't limit cmd.ne then the multiplications can overflow.  This
will allocate a small amount of RAM successfully for the "resp" and
"wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

I chose to limit the number of entries to 1000.  That limits the
allocations to 52kb of RAM at the most.  I didn't want to choose a
lower number and break userspace for someone.

Also we don't necessarily fill the "resp" buffer so I changed the
kmalloc() to a kzalloc() to avoid an information leak.

CC: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -162,6 +162,7 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr);
 void ib_uverbs_event_handler(struct ib_event_handler *handler,
 			     struct ib_event *event);
 
+#define UVERBS_MAX_NUM_ENTRIES 1000
 #define IB_UVERBS_DECLARE_CMD(name)					\
 	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
 				 const char __user *buf, int in_len,	\
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -906,12 +906,15 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	if (cmd.ne > UVERBS_MAX_NUM_ENTRIES)
+		return -EINVAL;
+
 	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
 	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kzalloc(rsize, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-07  7:16 ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-07  7:16 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Sean Hefty, Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

If we don't limit cmd.ne then the multiplications can overflow.  This
will allocate a small amount of RAM successfully for the "resp" and
"wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

I chose to limit the number of entries to 1000.  That limits the
allocations to 52kb of RAM at the most.  I didn't want to choose a
lower number and break userspace for someone.

Also we don't necessarily fill the "resp" buffer so I changed the
kmalloc() to a kzalloc() to avoid an information leak.

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -162,6 +162,7 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr);
 void ib_uverbs_event_handler(struct ib_event_handler *handler,
 			     struct ib_event *event);
 
+#define UVERBS_MAX_NUM_ENTRIES 1000
 #define IB_UVERBS_DECLARE_CMD(name)					\
 	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
 				 const char __user *buf, int in_len,	\
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -906,12 +906,15 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	if (cmd.ne > UVERBS_MAX_NUM_ENTRIES)
+		return -EINVAL;
+
 	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
 	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kzalloc(rsize, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
  2010-10-07  7:16 ` Dan Carpenter
@ 2010-10-07 16:16   ` Jason Gunthorpe
  -1 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-07 16:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> If we don't limit cmd.ne then the multiplications can overflow.  This
> will allocate a small amount of RAM successfully for the "resp" and
> "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

I think you could cap the number of returned entries to
UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
compatible with user space..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-07 16:16   ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-07 16:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> If we don't limit cmd.ne then the multiplications can overflow.  This
> will allocate a small amount of RAM successfully for the "resp" and
> "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

I think you could cap the number of returned entries to
UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
compatible with user space..

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
       [not found]   ` <20101007161649.GD21206-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-10-07 16:59       ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-07 16:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
> On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> > If we don't limit cmd.ne then the multiplications can overflow.  This
> > will allocate a small amount of RAM successfully for the "resp" and
> > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
> 
> I think you could cap the number of returned entries to
> UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
> compatible with user space..
> 

Good idea.  I don't actually have this hardware, so I can't test it, but
that definitely sounds reasonable.

If we did that then UVERBS_MAX_NUM_ENTRIES could be lower than 1000.
What is a reasonable number?

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-07 16:59       ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-07 16:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
> On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> > If we don't limit cmd.ne then the multiplications can overflow.  This
> > will allocate a small amount of RAM successfully for the "resp" and
> > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
> 
> I think you could cap the number of returned entries to
> UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
> compatible with user space..
> 

Good idea.  I don't actually have this hardware, so I can't test it, but
that definitely sounds reasonable.

If we did that then UVERBS_MAX_NUM_ENTRIES could be lower than 1000.
What is a reasonable number?

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
  2010-10-07 16:59       ` Dan Carpenter
@ 2010-10-08  7:59         ` Nicolas Palix
  -1 siblings, 0 replies; 48+ messages in thread
From: Nicolas Palix @ 2010-10-08  7:59 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma, kernel-janitors

Hi,

On Thu, Oct 7, 2010 at 6:59 PM, Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
>> On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
>> > If we don't limit cmd.ne then the multiplications can overflow.  This
>> > will allocate a small amount of RAM successfully for the "resp" and
>> > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
>>
>> I think you could cap the number of returned entries to
>> UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
>> compatible with user space..
>>
>
> Good idea.  I don't actually have this hardware, so I can't test it, but
> that definitely sounds reasonable.
>
> If we did that then UVERBS_MAX_NUM_ENTRIES could be lower than 1000.
> What is a reasonable number?

You can also use kcalloc to allocate wc.

>
> regards,
> dan carpenter
>
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Nicolas Palix
Tel: +33 6 81 07 91 72
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-08  7:59         ` Nicolas Palix
  0 siblings, 0 replies; 48+ messages in thread
From: Nicolas Palix @ 2010-10-08  7:59 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma, kernel-janitors

Hi,

On Thu, Oct 7, 2010 at 6:59 PM, Dan Carpenter <error27@gmail.com> wrote:
> On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
>> On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
>> > If we don't limit cmd.ne then the multiplications can overflow.  This
>> > will allocate a small amount of RAM successfully for the "resp" and
>> > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
>>
>> I think you could cap the number of returned entries to
>> UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
>> compatible with user space..
>>
>
> Good idea.  I don't actually have this hardware, so I can't test it, but
> that definitely sounds reasonable.
>
> If we did that then UVERBS_MAX_NUM_ENTRIES could be lower than 1000.
> What is a reasonable number?

You can also use kcalloc to allocate wc.

>
> regards,
> dan carpenter
>
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Nicolas Palix
Tel: +33 6 81 07 91 72
--
To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch v2] infiniband: uverbs: limit the number of entries
       [not found]         ` <AANLkTin5zou2JHsdDyhGESuxyPonOs3kLo9Th0vg-kd8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-10-08 14:25             ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-08 14:25 UTC (permalink / raw)
  To: Nicolas Palix
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma, kernel-janitors


If we don't limit cmd.ne then the multiplications can overflow.  This
will allocate a small amount of RAM successfully for the "resp" and
"wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

I chose to limit the number of entries to 500.  That limits the
allocations to 26 kb of RAM at the most.

Also we don't necessarily fill the "resp" buffer so I changed the
kmalloc() to a kzalloc() to avoid an information leak.  And I changed
the wc allocation to kcalloc() as a cleanup.

CC: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -162,6 +162,7 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr);
 void ib_uverbs_event_handler(struct ib_event_handler *handler,
 			     struct ib_event *event);
 
+#define UVERBS_MAX_NUM_ENTRIES 500
 #define IB_UVERBS_DECLARE_CMD(name)					\
 	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
 				 const char __user *buf, int in_len,	\
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -906,12 +906,15 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	if (cmd.ne > UVERBS_MAX_NUM_ENTRIES)
+		cmd.ne = UVERBS_MAX_NUM_ENTRIES;
+
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
+	wc = kcalloc(cmd.ne, sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
 	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kzalloc(rsize, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch v2] infiniband: uverbs: limit the number of entries
@ 2010-10-08 14:25             ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-08 14:25 UTC (permalink / raw)
  To: Nicolas Palix
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma, kernel-janitors


If we don't limit cmd.ne then the multiplications can overflow.  This
will allocate a small amount of RAM successfully for the "resp" and
"wc" buffers.  The heap will get corrupted when we call ib_poll_cq().

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

I chose to limit the number of entries to 500.  That limits the
allocations to 26 kb of RAM at the most.

Also we don't necessarily fill the "resp" buffer so I changed the
kmalloc() to a kzalloc() to avoid an information leak.  And I changed
the wc allocation to kcalloc() as a cleanup.

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -162,6 +162,7 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr);
 void ib_uverbs_event_handler(struct ib_event_handler *handler,
 			     struct ib_event *event);
 
+#define UVERBS_MAX_NUM_ENTRIES 500
 #define IB_UVERBS_DECLARE_CMD(name)					\
 	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
 				 const char __user *buf, int in_len,	\
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -906,12 +906,15 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	if (cmd.ne > UVERBS_MAX_NUM_ENTRIES)
+		cmd.ne = UVERBS_MAX_NUM_ENTRIES;
+
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
+	wc = kcalloc(cmd.ne, sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
 	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kzalloc(rsize, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
  2010-10-07 16:59       ` Dan Carpenter
@ 2010-10-09 23:16         ` Jason Gunthorpe
  -1 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-09 23:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 06:59:47PM +0200, Dan Carpenter wrote:
> On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
> > On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> > > If we don't limit cmd.ne then the multiplications can overflow.  This
> > > will allocate a small amount of RAM successfully for the "resp" and
> > > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
> > 
> > I think you could cap the number of returned entries to
> > UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
> > compatible with user space..
> > 
> 
> Good idea.  I don't actually have this hardware, so I can't test it, but
> that definitely sounds reasonable.

I was just writing some code related to this, and I'm not so sure
anymore. The problem is that the CQ has to be fully drained to properly
synchronize with the edge triggered poll. I was just about to write
a check that assumes it is drained based on returning not equal to
what was requested rather than calling it again and possibly getting 0
back.

I'm wondering if other apps have done something like this? Things
will 'work' but it introduces a subtle race condition with CQ draining
and CQ poll events. 

I guess the ideal thing would be to code this routine to handle
an arbitary count without allocating arbitary memory - by copying
to user space in limited batches.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch] infiniband: uverbs: limit the number of entries
@ 2010-10-09 23:16         ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-09 23:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 07, 2010 at 06:59:47PM +0200, Dan Carpenter wrote:
> On Thu, Oct 07, 2010 at 10:16:49AM -0600, Jason Gunthorpe wrote:
> > On Thu, Oct 07, 2010 at 09:16:10AM +0200, Dan Carpenter wrote:
> > > If we don't limit cmd.ne then the multiplications can overflow.  This
> > > will allocate a small amount of RAM successfully for the "resp" and
> > > "wc" buffers.  The heap will get corrupted when we call ib_poll_cq().
> > 
> > I think you could cap the number of returned entries to
> > UVERBS_MAX_NUM_ENTRIES rather than return EINVAL. That might be more
> > compatible with user space..
> > 
> 
> Good idea.  I don't actually have this hardware, so I can't test it, but
> that definitely sounds reasonable.

I was just writing some code related to this, and I'm not so sure
anymore. The problem is that the CQ has to be fully drained to properly
synchronize with the edge triggered poll. I was just about to write
a check that assumes it is drained based on returning not equal to
what was requested rather than calling it again and possibly getting 0
back.

I'm wondering if other apps have done something like this? Things
will 'work' but it introduces a subtle race condition with CQ draining
and CQ poll events. 

I guess the ideal thing would be to code this routine to handle
an arbitary count without allocating arbitary memory - by copying
to user space in limited batches.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch v3] infiniband: uverbs: handle large number of entries
       [not found]         ` <20101009231607.GA24649-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-10-12 11:31             ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-12 11:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

In the original code there was a potential integer overflow if you
passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
buffers than intended, leading to memory corruption.

There was also an information leak.

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

Jason Gunthorpe suggested that I should modify it to pass the data to
the user bit by bit and avoid the kmalloc() entirely.

CC: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
Please, please, check this.  I've think I've done it right, but I don't
have the hardware and can not test it.

It's strange to me that we return "in_len" on success.

struct ib_uverbs_poll_cq_resp is used by userspace libraries right?
Otherwise I could delete it.

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..b0788b6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -891,68 +891,89 @@ out:
 	return ret ? ret : in_len;
 }
 
+static int copy_header_to_user(void __user *dest, u32 count)
+{
+	u32 header[2];  /* the second u32 is reserved */
+
+	memset(header, 0, sizeof(header));
+	if (copy_to_user(dest, header, sizeof(header)))
+		return -EFAULT;
+	return 0;
+}
+
+static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
+{
+	struct ib_uverbs_wc tmp;
+
+	memset(&tmp, 0, sizeof(tmp));
+
+	tmp.wr_id	   = wc->wr_id;
+	tmp.status	   = wc->status;
+	tmp.opcode	   = wc->opcode;
+	tmp.vendor_err	   = wc->vendor_err;
+	tmp.byte_len	   = wc->byte_len;
+	tmp.ex.imm_data	   = (__u32 __force) wc->ex.imm_data;
+	tmp.qp_num	   = wc->qp->qp_num;
+	tmp.src_qp	   = wc->src_qp;
+	tmp.wc_flags	   = wc->wc_flags;
+	tmp.pkey_index	   = wc->pkey_index;
+	tmp.slid	   = wc->slid;
+	tmp.sl		   = wc->sl;
+	tmp.dlid_path_bits = wc->dlid_path_bits;
+	tmp.port_num	   = wc->port_num;
+
+	if (copy_to_user(dest, &tmp, sizeof(tmp)))
+		return -EFAULT;
+	return 0;
+}
+
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 			  const char __user *buf, int in_len,
 			  int out_len)
 {
 	struct ib_uverbs_poll_cq       cmd;
-	struct ib_uverbs_poll_cq_resp *resp;
+	u8 __user                     *header_ptr;
+	u8 __user                     *data_ptr;
 	struct ib_cq                  *cq;
-	struct ib_wc                  *wc;
-	int                            ret = 0;
+	struct ib_wc                   wc;
+	u32                            count = 0;
+	int                            ret;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
-	if (!wc)
-		return -ENOMEM;
-
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
-	if (!resp) {
-		ret = -ENOMEM;
-		goto out_wc;
-	}
-
 	cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0);
-	if (!cq) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (!cq)
+		return -EINVAL;
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	/* we copy a struct ib_uverbs_poll_cq_resp to user space */
+	header_ptr = (void __user *)(unsigned long)cmd.response;
+	data_ptr = header_ptr + sizeof(u32) * 2;
 
-	put_cq_read(cq);
+	for (i = 0; i < cmd.ne; i++) {
+		ret = ib_poll_cq(cq, 1, &wc);
+		if (ret < 0)
+			goto out_put;
+		if (!ret)
+			break;
 
-	for (i = 0; i < resp->count; i++) {
-		resp->wc[i].wr_id 	   = wc[i].wr_id;
-		resp->wc[i].status 	   = wc[i].status;
-		resp->wc[i].opcode 	   = wc[i].opcode;
-		resp->wc[i].vendor_err 	   = wc[i].vendor_err;
-		resp->wc[i].byte_len 	   = wc[i].byte_len;
-		resp->wc[i].ex.imm_data    = (__u32 __force) wc[i].ex.imm_data;
-		resp->wc[i].qp_num 	   = wc[i].qp->qp_num;
-		resp->wc[i].src_qp 	   = wc[i].src_qp;
-		resp->wc[i].wc_flags 	   = wc[i].wc_flags;
-		resp->wc[i].pkey_index 	   = wc[i].pkey_index;
-		resp->wc[i].slid 	   = wc[i].slid;
-		resp->wc[i].sl 		   = wc[i].sl;
-		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
-		resp->wc[i].port_num 	   = wc[i].port_num;
+		ret = copy_wc_to_user(data_ptr, &wc);
+		if (ret)
+			goto out_put;
+		data_ptr += sizeof(struct ib_uverbs_wc);
+		count++;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
-		ret = -EFAULT;
+	ret = copy_header_to_user(header_ptr, count);
+	if (ret)
+		goto out_put;
 
-out:
-	kfree(resp);
+	ret = in_len;
 
-out_wc:
-	kfree(wc);
-	return ret ? ret : in_len;
+out_put:
+	put_cq_read(cq);
+	return ret;
 }
 
 ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [patch v3] infiniband: uverbs: handle large number of entries
@ 2010-10-12 11:31             ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-12 11:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

In the original code there was a potential integer overflow if you
passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
buffers than intended, leading to memory corruption.

There was also an information leak.

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

Jason Gunthorpe suggested that I should modify it to pass the data to
the user bit by bit and avoid the kmalloc() entirely.

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
Please, please, check this.  I've think I've done it right, but I don't
have the hardware and can not test it.

It's strange to me that we return "in_len" on success.

struct ib_uverbs_poll_cq_resp is used by userspace libraries right?
Otherwise I could delete it.

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..b0788b6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -891,68 +891,89 @@ out:
 	return ret ? ret : in_len;
 }
 
+static int copy_header_to_user(void __user *dest, u32 count)
+{
+	u32 header[2];  /* the second u32 is reserved */
+
+	memset(header, 0, sizeof(header));
+	if (copy_to_user(dest, header, sizeof(header)))
+		return -EFAULT;
+	return 0;
+}
+
+static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
+{
+	struct ib_uverbs_wc tmp;
+
+	memset(&tmp, 0, sizeof(tmp));
+
+	tmp.wr_id	   = wc->wr_id;
+	tmp.status	   = wc->status;
+	tmp.opcode	   = wc->opcode;
+	tmp.vendor_err	   = wc->vendor_err;
+	tmp.byte_len	   = wc->byte_len;
+	tmp.ex.imm_data	   = (__u32 __force) wc->ex.imm_data;
+	tmp.qp_num	   = wc->qp->qp_num;
+	tmp.src_qp	   = wc->src_qp;
+	tmp.wc_flags	   = wc->wc_flags;
+	tmp.pkey_index	   = wc->pkey_index;
+	tmp.slid	   = wc->slid;
+	tmp.sl		   = wc->sl;
+	tmp.dlid_path_bits = wc->dlid_path_bits;
+	tmp.port_num	   = wc->port_num;
+
+	if (copy_to_user(dest, &tmp, sizeof(tmp)))
+		return -EFAULT;
+	return 0;
+}
+
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 			  const char __user *buf, int in_len,
 			  int out_len)
 {
 	struct ib_uverbs_poll_cq       cmd;
-	struct ib_uverbs_poll_cq_resp *resp;
+	u8 __user                     *header_ptr;
+	u8 __user                     *data_ptr;
 	struct ib_cq                  *cq;
-	struct ib_wc                  *wc;
-	int                            ret = 0;
+	struct ib_wc                   wc;
+	u32                            count = 0;
+	int                            ret;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
-	if (!wc)
-		return -ENOMEM;
-
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
-	if (!resp) {
-		ret = -ENOMEM;
-		goto out_wc;
-	}
-
 	cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0);
-	if (!cq) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (!cq)
+		return -EINVAL;
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	/* we copy a struct ib_uverbs_poll_cq_resp to user space */
+	header_ptr = (void __user *)(unsigned long)cmd.response;
+	data_ptr = header_ptr + sizeof(u32) * 2;
 
-	put_cq_read(cq);
+	for (i = 0; i < cmd.ne; i++) {
+		ret = ib_poll_cq(cq, 1, &wc);
+		if (ret < 0)
+			goto out_put;
+		if (!ret)
+			break;
 
-	for (i = 0; i < resp->count; i++) {
-		resp->wc[i].wr_id 	   = wc[i].wr_id;
-		resp->wc[i].status 	   = wc[i].status;
-		resp->wc[i].opcode 	   = wc[i].opcode;
-		resp->wc[i].vendor_err 	   = wc[i].vendor_err;
-		resp->wc[i].byte_len 	   = wc[i].byte_len;
-		resp->wc[i].ex.imm_data    = (__u32 __force) wc[i].ex.imm_data;
-		resp->wc[i].qp_num 	   = wc[i].qp->qp_num;
-		resp->wc[i].src_qp 	   = wc[i].src_qp;
-		resp->wc[i].wc_flags 	   = wc[i].wc_flags;
-		resp->wc[i].pkey_index 	   = wc[i].pkey_index;
-		resp->wc[i].slid 	   = wc[i].slid;
-		resp->wc[i].sl 		   = wc[i].sl;
-		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
-		resp->wc[i].port_num 	   = wc[i].port_num;
+		ret = copy_wc_to_user(data_ptr, &wc);
+		if (ret)
+			goto out_put;
+		data_ptr += sizeof(struct ib_uverbs_wc);
+		count++;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
-		ret = -EFAULT;
+	ret = copy_header_to_user(header_ptr, count);
+	if (ret)
+		goto out_put;
 
-out:
-	kfree(resp);
+	ret = in_len;
 
-out_wc:
-	kfree(wc);
-	return ret ? ret : in_len;
+out_put:
+	put_cq_read(cq);
+	return ret;
 }
 
 ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [patch v3] infiniband: uverbs: handle large number of entries
  2010-10-12 11:31             ` Dan Carpenter
@ 2010-10-12 21:01               ` Jason Gunthorpe
  -1 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-12 21:01 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote:
> In the original code there was a potential integer overflow if you
> passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> buffers than intended, leading to memory corruption.

Keep in mind these are probably performance sensitive APIs, I was
imagining batching a small number and they copy_to_user ? No idea what
the various performance trades offs are..

> Please, please, check this.  I've think I've done it right, but I don't
> have the hardware and can not test it.

Nor, do I.. I actually don't know what hardware uses this path? The
Mellanox cards use a user-space only version.
 
Maybe an iwarp card? I kinda recall some recent messages concerning
memory allocations in these paths for iwarp. I wonder if removing the
allocation is such a big win the larger number of copy_to_user calls
does not matter?

> It's strange to me that we return "in_len" on success.

Agree..

> +static int copy_header_to_user(void __user *dest, u32 count)
> +{
> +	u32 header[2];  /* the second u32 is reserved */
> +
> +	memset(header, 0, sizeof(header));

Don't you need header[0] = count ?

Maybe:
  u32 header[2] = {count};

And let the compiler 0 the other word optimally. Also, I'm not matters
here, since you are zeroing user memory that isn't currently used..

> +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
> +{
> +	struct ib_uverbs_wc tmp;
> +
> +	memset(&tmp, 0, sizeof(tmp));

I'd really like to see that memset go away for performance. Again
maybe use named initializers and let the compiler zero the
uninitialized (does it zero padding, I wonder?). Or pre-zero this
memory outside the loop..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v3] infiniband: uverbs: handle large number of entries
@ 2010-10-12 21:01               ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-10-12 21:01 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote:
> In the original code there was a potential integer overflow if you
> passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> buffers than intended, leading to memory corruption.

Keep in mind these are probably performance sensitive APIs, I was
imagining batching a small number and they copy_to_user ? No idea what
the various performance trades offs are..

> Please, please, check this.  I've think I've done it right, but I don't
> have the hardware and can not test it.

Nor, do I.. I actually don't know what hardware uses this path? The
Mellanox cards use a user-space only version.
 
Maybe an iwarp card? I kinda recall some recent messages concerning
memory allocations in these paths for iwarp. I wonder if removing the
allocation is such a big win the larger number of copy_to_user calls
does not matter?

> It's strange to me that we return "in_len" on success.

Agree..

> +static int copy_header_to_user(void __user *dest, u32 count)
> +{
> +	u32 header[2];  /* the second u32 is reserved */
> +
> +	memset(header, 0, sizeof(header));

Don't you need header[0] = count ?

Maybe:
  u32 header[2] = {count};

And let the compiler 0 the other word optimally. Also, I'm not matters
here, since you are zeroing user memory that isn't currently used..

> +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
> +{
> +	struct ib_uverbs_wc tmp;
> +
> +	memset(&tmp, 0, sizeof(tmp));

I'd really like to see that memset go away for performance. Again
maybe use named initializers and let the compiler zero the
uninitialized (does it zero padding, I wonder?). Or pre-zero this
memory outside the loop..

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v3] infiniband: uverbs: handle large number of entries
       [not found]               ` <20101012210118.GR24268-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-10-13  9:05                   ` Dan Carpenter
  2010-10-13  9:13                   ` Dan Carpenter
  1 sibling, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-13  9:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 12, 2010 at 03:01:18PM -0600, Jason Gunthorpe wrote:
> On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote:
> > In the original code there was a potential integer overflow if you
> > passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> > buffers than intended, leading to memory corruption.
> 
> Keep in mind these are probably performance sensitive APIs, I was
> imagining batching a small number and they copy_to_user ? No idea what
> the various performance trades offs are..
> 
> > Please, please, check this.  I've think I've done it right, but I don't
> > have the hardware and can not test it.
> 
> Nor, do I.. I actually don't know what hardware uses this path? The
> Mellanox cards use a user-space only version.
>  
> Maybe an iwarp card? I kinda recall some recent messages concerning
> memory allocations in these paths for iwarp. I wonder if removing the
> allocation is such a big win the larger number of copy_to_user calls
> does not matter?
> 

Who knows?

The reason I'm writing this is to fix a potential security issue, but I
think that viewed from a holistic perspective this patch is also a
performance improvement over the original code because it avoids the big
kmalloc()s.  Doing the copy_to_user() in batches of PAGE_SIZE might be
better but it's more complicated and I'm very lazy... :/  If someone
steps up to do the benchmarks then I might take a look at it.

> > It's strange to me that we return "in_len" on success.
> 
> Agree..
> 
> > +static int copy_header_to_user(void __user *dest, u32 count)
> > +{
> > +	u32 header[2];  /* the second u32 is reserved */
> > +
> > +	memset(header, 0, sizeof(header));
> 
> Don't you need header[0] = count ?
> 

Yes.  Thank you for catching that.

> Maybe:
>   u32 header[2] = {count};
> 
> And let the compiler 0 the other word optimally. Also, I'm not matters
> here, since you are zeroing user memory that isn't currently used..

It does matter, because we don't want to leak information to the user.

> 
> > +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
> > +{
> > +	struct ib_uverbs_wc tmp;
> > +
> > +	memset(&tmp, 0, sizeof(tmp));
> 
> I'd really like to see that memset go away for performance. Again
> maybe use named initializers and let the compiler zero the
> uninitialized (does it zero padding, I wonder?). Or pre-zero this
> memory outside the loop..
> 

Good idea.  Yes, it does do zero padding.

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v3] infiniband: uverbs: handle large number of entries
@ 2010-10-13  9:05                   ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-13  9:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 12, 2010 at 03:01:18PM -0600, Jason Gunthorpe wrote:
> On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote:
> > In the original code there was a potential integer overflow if you
> > passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> > buffers than intended, leading to memory corruption.
> 
> Keep in mind these are probably performance sensitive APIs, I was
> imagining batching a small number and they copy_to_user ? No idea what
> the various performance trades offs are..
> 
> > Please, please, check this.  I've think I've done it right, but I don't
> > have the hardware and can not test it.
> 
> Nor, do I.. I actually don't know what hardware uses this path? The
> Mellanox cards use a user-space only version.
>  
> Maybe an iwarp card? I kinda recall some recent messages concerning
> memory allocations in these paths for iwarp. I wonder if removing the
> allocation is such a big win the larger number of copy_to_user calls
> does not matter?
> 

Who knows?

The reason I'm writing this is to fix a potential security issue, but I
think that viewed from a holistic perspective this patch is also a
performance improvement over the original code because it avoids the big
kmalloc()s.  Doing the copy_to_user() in batches of PAGE_SIZE might be
better but it's more complicated and I'm very lazy... :/  If someone
steps up to do the benchmarks then I might take a look at it.

> > It's strange to me that we return "in_len" on success.
> 
> Agree..
> 
> > +static int copy_header_to_user(void __user *dest, u32 count)
> > +{
> > +	u32 header[2];  /* the second u32 is reserved */
> > +
> > +	memset(header, 0, sizeof(header));
> 
> Don't you need header[0] = count ?
> 

Yes.  Thank you for catching that.

> Maybe:
>   u32 header[2] = {count};
> 
> And let the compiler 0 the other word optimally. Also, I'm not matters
> here, since you are zeroing user memory that isn't currently used..

It does matter, because we don't want to leak information to the user.

> 
> > +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
> > +{
> > +	struct ib_uverbs_wc tmp;
> > +
> > +	memset(&tmp, 0, sizeof(tmp));
> 
> I'd really like to see that memset go away for performance. Again
> maybe use named initializers and let the compiler zero the
> uninitialized (does it zero padding, I wonder?). Or pre-zero this
> memory outside the loop..
> 

Good idea.  Yes, it does do zero padding.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch v4] infiniband: uverbs: handle large number of entries
       [not found]               ` <20101012210118.GR24268-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-10-13  9:13                   ` Dan Carpenter
  2010-10-13  9:13                   ` Dan Carpenter
  1 sibling, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-13  9:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

In the original code there was a potential integer overflow if you
passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
buffers than intended, leading to memory corruption. There was also an 
information leak if resp wasn't all used.

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

Special thanks to Jason Gunthorpe for his help and advice.

CC: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
v4:  Fixed a bug where count wasn't sent to the user.
     Removed the calls to memset() and used C99 initializers instead.

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..5fc1e29 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -891,68 +891,88 @@ out:
 	return ret ? ret : in_len;
 }
 
+static int copy_header_to_user(void __user *dest, u32 count)
+{
+	u32 header[2] = {count, 0};  /* the second u32 is reserved */
+
+	if (copy_to_user(dest, header, sizeof(header)))
+		return -EFAULT;
+	return 0;
+}
+
+static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
+{
+	struct ib_uverbs_wc tmp = {
+	.wr_id		= wc->wr_id,
+	.status		= wc->status,
+	.opcode		= wc->opcode,
+	.vendor_err	= wc->vendor_err,
+	.byte_len	= wc->byte_len,
+	.ex = {
+		.imm_data = (__u32 __force) wc->ex.imm_data,
+	},
+	.qp_num		= wc->qp->qp_num,
+	.src_qp		= wc->src_qp,
+	.wc_flags	= wc->wc_flags,
+	.pkey_index	= wc->pkey_index,
+	.slid		= wc->slid,
+	.sl		= wc->sl,
+	.dlid_path_bits = wc->dlid_path_bits,
+	.port_num	= wc->port_num,
+	};
+
+	if (copy_to_user(dest, &tmp, sizeof(tmp)))
+		return -EFAULT;
+	return 0;
+}
+
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 			  const char __user *buf, int in_len,
 			  int out_len)
 {
 	struct ib_uverbs_poll_cq       cmd;
-	struct ib_uverbs_poll_cq_resp *resp;
+	u8 __user                     *header_ptr;
+	u8 __user                     *data_ptr;
 	struct ib_cq                  *cq;
-	struct ib_wc                  *wc;
-	int                            ret = 0;
+	struct ib_wc                   wc;
+	u32                            count = 0;
+	int                            ret;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
-	if (!wc)
-		return -ENOMEM;
-
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
-	if (!resp) {
-		ret = -ENOMEM;
-		goto out_wc;
-	}
-
 	cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0);
-	if (!cq) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (!cq)
+		return -EINVAL;
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	/* we copy a struct ib_uverbs_poll_cq_resp to user space */
+	header_ptr = (void __user *)(unsigned long)cmd.response;
+	data_ptr = header_ptr + sizeof(u32) * 2;
 
-	put_cq_read(cq);
+	for (i = 0; i < cmd.ne; i++) {
+		ret = ib_poll_cq(cq, 1, &wc);
+		if (ret < 0)
+			goto out_put;
+		if (!ret)
+			break;
 
-	for (i = 0; i < resp->count; i++) {
-		resp->wc[i].wr_id 	   = wc[i].wr_id;
-		resp->wc[i].status 	   = wc[i].status;
-		resp->wc[i].opcode 	   = wc[i].opcode;
-		resp->wc[i].vendor_err 	   = wc[i].vendor_err;
-		resp->wc[i].byte_len 	   = wc[i].byte_len;
-		resp->wc[i].ex.imm_data    = (__u32 __force) wc[i].ex.imm_data;
-		resp->wc[i].qp_num 	   = wc[i].qp->qp_num;
-		resp->wc[i].src_qp 	   = wc[i].src_qp;
-		resp->wc[i].wc_flags 	   = wc[i].wc_flags;
-		resp->wc[i].pkey_index 	   = wc[i].pkey_index;
-		resp->wc[i].slid 	   = wc[i].slid;
-		resp->wc[i].sl 		   = wc[i].sl;
-		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
-		resp->wc[i].port_num 	   = wc[i].port_num;
+		ret = copy_wc_to_user(data_ptr, &wc);
+		if (ret)
+			goto out_put;
+		data_ptr += sizeof(struct ib_uverbs_wc);
+		count++;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
-		ret = -EFAULT;
+	ret = copy_header_to_user(header_ptr, count);
+	if (ret)
+		goto out_put;
 
-out:
-	kfree(resp);
+	ret = in_len;
 
-out_wc:
-	kfree(wc);
-	return ret ? ret : in_len;
+out_put:
+	put_cq_read(cq);
+	return ret;
 }
 
 ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-10-13  9:13                   ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-10-13  9:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

In the original code there was a potential integer overflow if you
passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
buffers than intended, leading to memory corruption. There was also an 
information leak if resp wasn't all used.

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

Special thanks to Jason Gunthorpe for his help and advice.

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
v4:  Fixed a bug where count wasn't sent to the user.
     Removed the calls to memset() and used C99 initializers instead.

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..5fc1e29 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -891,68 +891,88 @@ out:
 	return ret ? ret : in_len;
 }
 
+static int copy_header_to_user(void __user *dest, u32 count)
+{
+	u32 header[2] = {count, 0};  /* the second u32 is reserved */
+
+	if (copy_to_user(dest, header, sizeof(header)))
+		return -EFAULT;
+	return 0;
+}
+
+static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
+{
+	struct ib_uverbs_wc tmp = {
+	.wr_id		= wc->wr_id,
+	.status		= wc->status,
+	.opcode		= wc->opcode,
+	.vendor_err	= wc->vendor_err,
+	.byte_len	= wc->byte_len,
+	.ex = {
+		.imm_data = (__u32 __force) wc->ex.imm_data,
+	},
+	.qp_num		= wc->qp->qp_num,
+	.src_qp		= wc->src_qp,
+	.wc_flags	= wc->wc_flags,
+	.pkey_index	= wc->pkey_index,
+	.slid		= wc->slid,
+	.sl		= wc->sl,
+	.dlid_path_bits = wc->dlid_path_bits,
+	.port_num	= wc->port_num,
+	};
+
+	if (copy_to_user(dest, &tmp, sizeof(tmp)))
+		return -EFAULT;
+	return 0;
+}
+
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 			  const char __user *buf, int in_len,
 			  int out_len)
 {
 	struct ib_uverbs_poll_cq       cmd;
-	struct ib_uverbs_poll_cq_resp *resp;
+	u8 __user                     *header_ptr;
+	u8 __user                     *data_ptr;
 	struct ib_cq                  *cq;
-	struct ib_wc                  *wc;
-	int                            ret = 0;
+	struct ib_wc                   wc;
+	u32                            count = 0;
+	int                            ret;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
-	if (!wc)
-		return -ENOMEM;
-
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
-	if (!resp) {
-		ret = -ENOMEM;
-		goto out_wc;
-	}
-
 	cq = idr_read_cq(cmd.cq_handle, file->ucontext, 0);
-	if (!cq) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (!cq)
+		return -EINVAL;
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	/* we copy a struct ib_uverbs_poll_cq_resp to user space */
+	header_ptr = (void __user *)(unsigned long)cmd.response;
+	data_ptr = header_ptr + sizeof(u32) * 2;
 
-	put_cq_read(cq);
+	for (i = 0; i < cmd.ne; i++) {
+		ret = ib_poll_cq(cq, 1, &wc);
+		if (ret < 0)
+			goto out_put;
+		if (!ret)
+			break;
 
-	for (i = 0; i < resp->count; i++) {
-		resp->wc[i].wr_id 	   = wc[i].wr_id;
-		resp->wc[i].status 	   = wc[i].status;
-		resp->wc[i].opcode 	   = wc[i].opcode;
-		resp->wc[i].vendor_err 	   = wc[i].vendor_err;
-		resp->wc[i].byte_len 	   = wc[i].byte_len;
-		resp->wc[i].ex.imm_data    = (__u32 __force) wc[i].ex.imm_data;
-		resp->wc[i].qp_num 	   = wc[i].qp->qp_num;
-		resp->wc[i].src_qp 	   = wc[i].src_qp;
-		resp->wc[i].wc_flags 	   = wc[i].wc_flags;
-		resp->wc[i].pkey_index 	   = wc[i].pkey_index;
-		resp->wc[i].slid 	   = wc[i].slid;
-		resp->wc[i].sl 		   = wc[i].sl;
-		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
-		resp->wc[i].port_num 	   = wc[i].port_num;
+		ret = copy_wc_to_user(data_ptr, &wc);
+		if (ret)
+			goto out_put;
+		data_ptr += sizeof(struct ib_uverbs_wc);
+		count++;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
-		ret = -EFAULT;
+	ret = copy_header_to_user(header_ptr, count);
+	if (ret)
+		goto out_put;
 
-out:
-	kfree(resp);
+	ret = in_len;
 
-out_wc:
-	kfree(wc);
-	return ret ? ret : in_len;
+out_put:
+	put_cq_read(cq);
+	return ret;
 }
 
 ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
  2010-10-13  9:13                   ` Dan Carpenter
@ 2010-11-23  7:10                     ` Dan Carpenter
  -1 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-11-23  7:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 13, 2010 at 11:13:12AM +0200, Dan Carpenter wrote:
> In the original code there was a potential integer overflow if you
> passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> buffers than intended, leading to memory corruption. There was also an 
> information leak if resp wasn't all used.
> 
> Documentation/infiniband/user_verbs.txt suggests this function is meant
> for unprivileged access.
> 
> Special thanks to Jason Gunthorpe for his help and advice.
> 

Crap!  Apparently c99 initialization zeroes out the holes most of the
time but not all the time.

http://lkml.org/lkml/2010/11/22/367

I'm still waiting for some GCC people to chime in about what the rules
are here, but it looks like I may need to add memsets to this patch.

regards,
dan carpenter


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-11-23  7:10                     ` Dan Carpenter
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Carpenter @ 2010-11-23  7:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 13, 2010 at 11:13:12AM +0200, Dan Carpenter wrote:
> In the original code there was a potential integer overflow if you
> passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
> buffers than intended, leading to memory corruption. There was also an 
> information leak if resp wasn't all used.
> 
> Documentation/infiniband/user_verbs.txt suggests this function is meant
> for unprivileged access.
> 
> Special thanks to Jason Gunthorpe for his help and advice.
> 

Crap!  Apparently c99 initialization zeroes out the holes most of the
time but not all the time.

http://lkml.org/lkml/2010/11/22/367

I'm still waiting for some GCC people to chime in about what the rules
are here, but it looks like I may need to add memsets to this patch.

regards,
dan carpenter



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
  2010-11-23  7:10                     ` Dan Carpenter
@ 2010-11-24 22:07                       ` Roland Dreier
  -1 siblings, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2010-11-24 22:07 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

 > Crap!  Apparently c99 initialization zeroes out the holes most of the
 > time but not all the time.

I think in this case we would be OK, since the structs involved here
don't have holes.  Right?

However, I think this final patch is not really what I would like here.
I didn't follow the discussion carefully, but it seems things ended up
off track.

poll_cq is a data path operation where performance matters, and the
whole point of passing in cmd.ne (the number of completion entries to
poll) is to allow the low level driver to batch things up to save
overhead.  So converting this into a loop that polls one at a time is
not the best thing to do.

What was wrong with:

 - Fixing the potential integer overflow by capping cmd.ne at some high
   but safe value (1000 say)

 - Fixing the information leaks by setting resp->reserved = 0 and inside
   the loop resp->wc[i].reserved = 0, and then only copying the actual
   number of completions polled to userspace

 - And as a bonus handling the (nearly impossible) case of ib_poll_cq
   returning a negative value

IOW a patch like the below (compile tested only):

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index b342248..ec6e434 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -903,17 +903,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	struct ib_wc                  *wc;
 	int                            ret = 0;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	cmd.ne = min_t(u32, cmd.ne, IB_UVERBS_POLL_CQ_MAX_ENTRIES);
+
 	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kmalloc(sizeof *resp + cmd.ne * sizeof *resp->wc, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;
@@ -925,10 +925,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 		goto out;
 	}
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	ret = ib_poll_cq(cq, cmd.ne, wc);
+	if (ret < 0) {
+		ret = -EIO;
+		goto out;
+	}
 
 	put_cq_read(cq);
 
+	resp->count    = ret;
+	resp->reserved = 0;
+
 	for (i = 0; i < resp->count; i++) {
 		resp->wc[i].wr_id 	   = wc[i].wr_id;
 		resp->wc[i].status 	   = wc[i].status;
@@ -944,9 +951,11 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 		resp->wc[i].sl 		   = wc[i].sl;
 		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
 		resp->wc[i].port_num 	   = wc[i].port_num;
+		resp->wc[i].reserved	   = 0;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
+	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp,
+			 sizeof *resp + resp->count * sizeof *resp->wc))
 		ret = -EFAULT;
 
 out:
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index fe5b051..61b0286 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -278,6 +278,10 @@ struct ib_uverbs_resize_cq_resp {
 	__u64 driver_data[0];
 };
 
+enum {
+	IB_UVERBS_POLL_CQ_MAX_ENTRIES	= 1000
+};
+
 struct ib_uverbs_poll_cq {
 	__u64 response;
 	__u32 cq_handle;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-11-24 22:07                       ` Roland Dreier
  0 siblings, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2010-11-24 22:07 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jason Gunthorpe, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

 > Crap!  Apparently c99 initialization zeroes out the holes most of the
 > time but not all the time.

I think in this case we would be OK, since the structs involved here
don't have holes.  Right?

However, I think this final patch is not really what I would like here.
I didn't follow the discussion carefully, but it seems things ended up
off track.

poll_cq is a data path operation where performance matters, and the
whole point of passing in cmd.ne (the number of completion entries to
poll) is to allow the low level driver to batch things up to save
overhead.  So converting this into a loop that polls one at a time is
not the best thing to do.

What was wrong with:

 - Fixing the potential integer overflow by capping cmd.ne at some high
   but safe value (1000 say)

 - Fixing the information leaks by setting resp->reserved = 0 and inside
   the loop resp->wc[i].reserved = 0, and then only copying the actual
   number of completions polled to userspace

 - And as a bonus handling the (nearly impossible) case of ib_poll_cq
   returning a negative value

IOW a patch like the below (compile tested only):

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index b342248..ec6e434 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -903,17 +903,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 	struct ib_wc                  *wc;
 	int                            ret = 0;
 	int                            i;
-	int                            rsize;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	cmd.ne = min_t(u32, cmd.ne, IB_UVERBS_POLL_CQ_MAX_ENTRIES);
+
 	wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
 	if (!wc)
 		return -ENOMEM;
 
-	rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-	resp = kmalloc(rsize, GFP_KERNEL);
+	resp = kmalloc(sizeof *resp + cmd.ne * sizeof *resp->wc, GFP_KERNEL);
 	if (!resp) {
 		ret = -ENOMEM;
 		goto out_wc;
@@ -925,10 +925,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 		goto out;
 	}
 
-	resp->count = ib_poll_cq(cq, cmd.ne, wc);
+	ret = ib_poll_cq(cq, cmd.ne, wc);
+	if (ret < 0) {
+		ret = -EIO;
+		goto out;
+	}
 
 	put_cq_read(cq);
 
+	resp->count    = ret;
+	resp->reserved = 0;
+
 	for (i = 0; i < resp->count; i++) {
 		resp->wc[i].wr_id 	   = wc[i].wr_id;
 		resp->wc[i].status 	   = wc[i].status;
@@ -944,9 +951,11 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 		resp->wc[i].sl 		   = wc[i].sl;
 		resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits;
 		resp->wc[i].port_num 	   = wc[i].port_num;
+		resp->wc[i].reserved	   = 0;
 	}
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize))
+	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp,
+			 sizeof *resp + resp->count * sizeof *resp->wc))
 		ret = -EFAULT;
 
 out:
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index fe5b051..61b0286 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -278,6 +278,10 @@ struct ib_uverbs_resize_cq_resp {
 	__u64 driver_data[0];
 };
 
+enum {
+	IB_UVERBS_POLL_CQ_MAX_ENTRIES	= 1000
+};
+
 struct ib_uverbs_poll_cq {
 	__u64 response;
 	__u32 cq_handle;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
       [not found]                       ` <adahbf6gytv.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2010-11-24 22:18                           ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-11-24 22:18 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 02:07:40PM -0800, Roland Dreier wrote:
 
>  - Fixing the potential integer overflow by capping cmd.ne at some high
>    but safe value (1000 say)

This breaks how userspace wants to use poll, ie if you return anything
less than what I asked for then that means there is no more work to
do. Apps make this assumption, and is a reasonable thing to
do. Otherwise apps need a dummy call to ibv_poll to avoid races.

So if you are worried about how many times ib_poll_cq is called then
bound the kzalloc size and wrap the whole thing in a loop, but
realistically I have to think the performance trade off of
kzalloc/free vs calling ib_poll more often is not entirely obvious.

Who uses this path anyhow?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-11-24 22:18                           ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-11-24 22:18 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 02:07:40PM -0800, Roland Dreier wrote:
 
>  - Fixing the potential integer overflow by capping cmd.ne at some high
>    but safe value (1000 say)

This breaks how userspace wants to use poll, ie if you return anything
less than what I asked for then that means there is no more work to
do. Apps make this assumption, and is a reasonable thing to
do. Otherwise apps need a dummy call to ibv_poll to avoid races.

So if you are worried about how many times ib_poll_cq is called then
bound the kzalloc size and wrap the whole thing in a loop, but
realistically I have to think the performance trade off of
kzalloc/free vs calling ib_poll more often is not entirely obvious.

Who uses this path anyhow?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
       [not found]                           ` <20101124221845.GH2369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-11-25  4:05                               ` Roland Dreier
  0 siblings, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2010-11-25  4:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

 > This breaks how userspace wants to use poll, ie if you return anything
 > less than what I asked for then that means there is no more work to
 > do. Apps make this assumption, and is a reasonable thing to
 > do. Otherwise apps need a dummy call to ibv_poll to avoid races.

OIC.  forgot about that.  hmm...

 > So if you are worried about how many times ib_poll_cq is called then
 > bound the kzalloc size and wrap the whole thing in a loop, but
 > realistically I have to think the performance trade off of
 > kzalloc/free vs calling ib_poll more often is not entirely obvious.

That's true... maybe doing things one at a time but avoiding the allocs
is the right tradeoff.

 > Who uses this path anyhow?

AFAICT, cxgb3, cxgb4, nes, ipath and qib.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-11-25  4:05                               ` Roland Dreier
  0 siblings, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2010-11-25  4:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

 > This breaks how userspace wants to use poll, ie if you return anything
 > less than what I asked for then that means there is no more work to
 > do. Apps make this assumption, and is a reasonable thing to
 > do. Otherwise apps need a dummy call to ibv_poll to avoid races.

OIC.  forgot about that.  hmm...

 > So if you are worried about how many times ib_poll_cq is called then
 > bound the kzalloc size and wrap the whole thing in a loop, but
 > realistically I have to think the performance trade off of
 > kzalloc/free vs calling ib_poll more often is not entirely obvious.

That's true... maybe doing things one at a time but avoiding the allocs
is the right tradeoff.

 > Who uses this path anyhow?

AFAICT, cxgb3, cxgb4, nes, ipath and qib.

 - R.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
       [not found]                               ` <adad3pugi90.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2010-11-25  4:13                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-11-25  4:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 08:05:47PM -0800, Roland Dreier wrote:

>  > So if you are worried about how many times ib_poll_cq is called then
>  > bound the kzalloc size and wrap the whole thing in a loop, but
>  > realistically I have to think the performance trade off of
>  > kzalloc/free vs calling ib_poll more often is not entirely obvious.
> 
> That's true... maybe doing things one at a time but avoiding the allocs
> is the right tradeoff.

Hmm, considering your list is everything but Mellanox, maybe it makes
much more sense to push the copy_to_user down into the driver - 
ie a ibv_poll_cq_user - then the driver can construct each CQ entry on
the stack and copy it to userspace, avoid the double copy, allocation
and avoid any fixed overhead of ibv_poll_cq.

A bigger change to be sure, but remember this old thread:

http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg05114.html

2x improvement by removing allocs on the post path..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch v4] infiniband: uverbs: handle large number of entries
@ 2010-11-25  4:13                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-11-25  4:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dan Carpenter, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 08:05:47PM -0800, Roland Dreier wrote:

>  > So if you are worried about how many times ib_poll_cq is called then
>  > bound the kzalloc size and wrap the whole thing in a loop, but
>  > realistically I have to think the performance trade off of
>  > kzalloc/free vs calling ib_poll more often is not entirely obvious.
> 
> That's true... maybe doing things one at a time but avoiding the allocs
> is the right tradeoff.

Hmm, considering your list is everything but Mellanox, maybe it makes
much more sense to push the copy_to_user down into the driver - 
ie a ibv_poll_cq_user - then the driver can construct each CQ entry on
the stack and copy it to userspace, avoid the double copy, allocation
and avoid any fixed overhead of ibv_poll_cq.

A bigger change to be sure, but remember this old thread:

http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html

2x improvement by removing allocs on the post path..

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations (was:  uverbs: handle large number of entries)
       [not found]                                   ` <20101125041337.GA11049-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-11-25 15:00                                     ` Or Gerlitz
       [not found]                                       ` <4CEE7A22.2040706-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Or Gerlitz @ 2010-11-25 15:00 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Jason Gunthorpe, Roland Dreier, Roland Dreier, Sean Hefty,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Jason Gunthorpe wrote:
> Hmm, considering your list is everything but Mellanox, maybe it makes much more sense to push the copy_to_user down into the driver - ie a ibv_poll_cq_user - then the driver can construct each CQ entry on the stack and copy it to userspace, avoid the double copy, allocation and avoid any fixed overhead of ibv_poll_cq.
>
> A bigger change to be sure, but remember this old thread:
> http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg05114.html
> 2x improvement by removing allocs on the post path..
Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations (was:  uverbs: handle large number of entries)
       [not found]                                       ` <4CEE7A22.2040706-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2010-11-26 11:56                                         ` Walukiewicz, Miroslaw
       [not found]                                           ` <BE2BFE91933D1B4089447C644860408064B44854-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-11-26 11:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Roland Dreier, Roland Dreier, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Some time ago we discussed a possibility of removing usage of nes_ud_sksq for IMA driver as a blocker of pushing IMA solution to kernel.org. 

The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm instead of using private nes_ud_sksq device.

I made an implementation of such solution for checking the performance impact and looking for optimize the existing code. 

I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. 
Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec.
The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec.


I run oprofile on rdma_cm code and got a following results:

        samples  %        linenr info                 app name                 symbol name
2586067  24.5323  nes_uverbs.c:558            libnes-rdmav2.so         nes_upoll_cq
1198042  11.3650  (no location information)   vmlinux                  __up_read
539258    5.1156  (no location information)   vmlinux                  copy_user_generic_string
407884    3.8693  msa_verbs.c:1692            libmsa.so.1.0.0          msa_post_send
304569    2.8892  msa_verbs.c:2098            libmsa.so.1.0.0          usq_sendmsg_noblock
299954    2.8455  (no location information)   vmlinux                  __kmalloc
297463    2.8218  (no location information)   libibverbs.so.1.0.0      /usr/lib64/libibverbs.so.1.0.0
267951    2.5419  uverbs_cmd.c:1433           ib_uverbs.ko             ib_uverbs_post_send
264709    2.5111  (no location information)   vmlinux                  kfree
205107    1.9457  port.c:2947                 libmsa.so.1.0.0          sendto
146225    1.3871  (no location information)   vmlinux                  __down_read
145941    1.3844  (no location information)   libpthread-2.5.so        __write_nocancel
139934    1.3275  nes_ud.c:1746               iw_nes.ko                nes_ud_post_send_new_path
131879    1.2510  send.c:32                   msa_tst                  blocking_test_send(void*)
127519    1.2097  (no location information)   vmlinux                  system_call
123552    1.1721  port.c:858                  libmsa.so.1.0.0          find_mcast
109249    1.0364  nes_verbs.c:3478            iw_nes.ko                nes_post_send
92060     0.8733  (no location information)   vmlinux                  vfs_write
90187     0.8555  uverbs_cmd.c:144            ib_uverbs.ko             __idr_get_uobj
89563     0.8496  nes_uverbs.c:1460           libnes-rdmav2.so         nes_upost_send

Form the trace it looks like the    __up_read() - 11% wastes most of time. It is called from idr_read_qp when a  put_uobj_read is called. 

        if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame
                return -EFAULT;

and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted.

It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path. 

What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers.

 I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup. 

Do you have any idea how can we optimize this path?

Regards,

Mirek

-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org] 
Sent: Thursday, November 25, 2010 4:01 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

Jason Gunthorpe wrote:
> Hmm, considering your list is everything but Mellanox, maybe it makes much more sense to push the copy_to_user down into the driver - ie a ibv_poll_cq_user - then the driver can construct each CQ entry on the stack and copy it to userspace, avoid the double copy, allocation and avoid any fixed overhead of ibv_poll_cq.
>
> A bigger change to be sure, but remember this old thread:
> http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg05114.html
> 2x improvement by removing allocs on the post path..
Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                           ` <BE2BFE91933D1B4089447C644860408064B44854-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-12-01  8:11                                             ` Or Gerlitz
       [not found]                                               ` <4CF60343.7050602-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Or Gerlitz @ 2010-12-01  8:11 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw, Jason Gunthorpe, Roland Dreier
  Cc: Roland Dreier, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
> Form the trace it looks like the __up_read() - 11% wastes most of time. It is called from idr_read_qp when a  put_uobj_read is called. if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path.
>
> What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated && mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                               ` <4CF60343.7050602-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2010-12-08 12:14                                                 ` Walukiewicz, Miroslaw
       [not found]                                                   ` <BE2BFE91933D1B4089447C64486040806673CF38-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2010-12-14 14:12                                                 ` Walukiewicz, Miroslaw
  1 sibling, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-12-08 12:14 UTC (permalink / raw)
  To: Or Gerlitz, Jason Gunthorpe, Roland Dreier
  Cc: Roland Dreier, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Or, 

>I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the 
>rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
>e.g to support shared-page which is allocated && mmaped from uverbs to 
>user space and used in the same manner your implementation does.

The problem that I see is that the mmap is currently used for mapping of doorbell page in different drivers.

We can use it for mapping a page for transmit/receive operation when we are able to differentiate that we need to map 
Doorbell or our shared page. 

The second problem is that this rx/tx mmap should map the separate page per QP to avoid the unnecessary QP lookups so page identifier passed to mmap should be based on QP identifier. 

I cannot find a specific code for /dev/infiniband/uverbsX. Is this device driver sharing the same functions like /dev/infiniband/rdmacm or it has own implementation. 

Mirek

-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
> Form the trace it looks like the __up_read() - 11% wastes most of time. It is called from idr_read_qp when a  put_uobj_read is called. if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path.
>
> What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated && mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                   ` <BE2BFE91933D1B4089447C64486040806673CF38-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-12-08 18:30                                                     ` Jason Gunthorpe
  2010-12-08 19:04                                                     ` Roland Dreier
  1 sibling, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2010-12-08 18:30 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Or Gerlitz, Roland Dreier, Roland Dreier, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Dec 08, 2010 at 12:14:51PM +0000, Walukiewicz, Miroslaw wrote:
> Or, 
> 
> >I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the 
> >rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
> >e.g to support shared-page which is allocated && mmaped from uverbs to 
> >user space and used in the same manner your implementation does.
> 
> The problem that I see is that the mmap is currently used for
> mapping of doorbell page in different drivers.
> 
> We can use it for mapping a page for transmit/receive operation when
> we are able to differentiate that we need to map Doorbell or our
> shared page.

There is the 64 bit file offset field to mmap which I think is
driver-specific. You could use 0 for the doorbell page, QPN*PAGE_SIZE
+ QPN_OFFSET for the per-QP page, etc..

> The second problem is that this rx/tx mmap should map the separate
> page per QP to avoid the unnecessary QP lookups so page identifier
> passed to mmap should be based on QP identifier.
> 
> I cannot find a specific code for /dev/infiniband/uverbsX. Is this
> device driver sharing the same functions like /dev/infiniband/rdmacm
> or it has own implementation.

It is in drivers/infiniband/core/uverbs*

For mmap the call is just routed to the driver's ib_dev mmap function,
so you can do whatever you want in your driver and match the
functionality in your userspace libibverbs driver library.

I think you should be able to implement your driver-specific
optimization within the uverbs framework - that would be best all
round.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                   ` <BE2BFE91933D1B4089447C64486040806673CF38-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2010-12-08 18:30                                                     ` Jason Gunthorpe
@ 2010-12-08 19:04                                                     ` Roland Dreier
  1 sibling, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2010-12-08 19:04 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Or Gerlitz, Jason Gunthorpe, Roland Dreier, Hefty, Sean,
	linux-rdma@vger.kernel.org

 > The problem that I see is that the mmap is currently used for mapping
 > of doorbell page in different drivers.

The driver can use different offsets into the file to map different
things.  For example I believe ehca, ipath and qib already do this.

 > I cannot find a specific code for /dev/infiniband/uverbsX. Is this
 > device driver sharing the same functions like /dev/infiniband/rdmacm
 > or it has own implementation.

it is in drivers/infiniband/core/uverbs_main.c.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                               ` <4CF60343.7050602-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2010-12-08 12:14                                                 ` Walukiewicz, Miroslaw
@ 2010-12-14 14:12                                                 ` Walukiewicz, Miroslaw
       [not found]                                                   ` <BE2BFE91933D1B4089447C644860408066ABCF66-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-12-14 14:12 UTC (permalink / raw)
  To: Or Gerlitz, Jason Gunthorpe, Roland Dreier
  Cc: Roland Dreier, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Or,

I looked into shared page approach of passing post_send/post_recv info. I still have some concerns.

The shared page must be allocated per QP and there should be a common way to allocate such page for each driver.

As Jason and Roland said, the best way to pass this parameter through mmap is offset. There is no common way how the 
Offset is used per driver and it is driver specific parameter.

The next problem is how many shared pages should driver allocate to share with user-space. They should contain place for each posted buffer by application. It is a big concern to post_recv where large number of buffers are posted.
Current implementation has no such limit. 

Even the common offset definition would be defined and accepted, the shared page must be stored in ib_qp structure. 
When a post_send is called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp 
 Structure, what is a big time consumer on the path. 

The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly.
The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use.

Regards,

Mirek


-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
> Form the trace it looks like the __up_read() - 11% wastes most of time. It is called from idr_read_qp when a  put_uobj_read is called. if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path.
>
> What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated && mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                   ` <BE2BFE91933D1B4089447C644860408066ABCF66-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-12-14 18:17                                                     ` Jason Gunthorpe
       [not found]                                                       ` <20101214181735.GA2506-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2010-12-14 18:17 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Or Gerlitz, Roland Dreier, Roland Dreier, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Dec 14, 2010 at 02:12:56PM +0000, Walukiewicz, Miroslaw wrote:
> Or,
> 
> I looked into shared page approach of passing post_send/post_recv
> info. I still have some concerns.
> 
> The shared page must be allocated per QP and there should be a
> common way to allocate such page for each driver.

Why must it be common?

Aren't these pages just hidden away in your driver library or is
something visible to the app? I get that they are somehow used to
insert/remove the extra header information that the hardware cannot
stuff, but how?

> As Jason and Roland said, the best way to pass this parameter
> through mmap is offset. There is no common way how the Offset is
> used per driver and it is driver specific parameter.

Right, offset is driver specific, it is only used between your
userspace driver library and your kernel driver. You can define
whatever you want.

So use something like QPN*4096 + 1

> The next problem is how many shared pages should driver allocate to
> share with user-space. They should contain place for each posted
> buffer by application. It is a big concern to post_recv where large
> number of buffers are posted.  Current implementation has no such
> limit.

I don't see the problem. mmap also has a length argument. So
mmap(0,16*4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,QPN*4096+1)

Means map 16 pages associated with QPN. You don't have to have the
offset and length 'make sense' they are just parameters.

> Even the common offset definition would be defined and accepted, the
> shared page must be stored in ib_qp structure.  When a post_send is

You don't need a common definition, it is driver specific.

> called for many QPs, there is a single entry point to
> ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
> case there is a lookup to QP store (idr_read_qp) necessary to find a
> correct ibv_qp Structure, what is a big time consumer on the path.

I don't think this should be such a big problem. The simplest solution
would be to front the idr_read_qp with a small learning hashing table.
If you have only one active QP per verbs instance then you avoid the
idr calls. I'm guessing your raw ethernet app uses one QP?

> The NES IMA kernel path also has such QP lookup but the QP number
> format is designed to make such lookup very quickly.  The QP numbers
> in OFED are not defined so generic lookup functions like
> idr_read_qp() must be use.

Maybe look at moving the QPN to ibv_qp translation into the driver
then - or better yet, move allocation out of the driver, if Mellanox
could change their FW.. You are right that we could do this much
faster if the QPN was structured in some way

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                       ` <20101214181735.GA2506-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-12-27 12:38                                                         ` Or Gerlitz
       [not found]                                                           ` <4D1888CB.2010101-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Or Gerlitz @ 2010-12-27 12:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Walukiewicz, Miroslaw
  Cc: Roland Dreier, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Jason Gunthorpe wrote:
> Walukiewicz, Miroslaw wrote:

>> called for many QPs, there is a single entry point to
>> ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
>> case there is a lookup to QP store (idr_read_qp) necessary to find a
>> correct ibv_qp Structure, what is a big time consumer on the path.

> I don't think this should be such a big problem. The simplest solution
> would be to front the idr_read_qp with a small learning hashing table.

yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping
much more efficient, in a manner that fits fast path - which maybe wasn't the mindset 
when this code was written as its primary use was to invoke control plane commands.

>> The NES IMA kernel path also has such QP lookup but the QP number
>> format is designed to make such lookup very quickly.  The QP numbers in
>> OFED are not defined so generic lookup functions like idr_read_qp() must be use.

> Maybe look at moving the QPN to ibv_qp translation into the driver
> then - or better yet, move allocation out of the driver, if Mellanox
> could change their FW.. You are right that we could do this much
> faster if the QPN was structured in some way

I think there should be some validation on the uverbs level, as the caller is untrusted
user space application, e.g in a similar way for each system call done on a file-descriptor

Or.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                           ` <4D1888CB.2010101-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
@ 2010-12-27 15:18                                                             ` Walukiewicz, Miroslaw
       [not found]                                                               ` <BE2BFE91933D1B4089447C644860408066C547E0-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-12-27 15:18 UTC (permalink / raw)
  To: Or Gerlitz, Jason Gunthorpe
  Cc: Roland Dreier, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA

> I don't think this should be such a big problem. The simplest solution
> would be to front the idr_read_qp with a small learning hashing table.

I implemented the very small hash table and I achieved performance comparable to previous solution. 

>I think there should be some validation on the uverbs level, as the caller is untrusted
>user space application, e.g in a similar way for each system call done on a file-descriptor

Such validation is performed during QP creation. When uses has not sufficient privilege the RAW QP is not created.
For me there is no need to make double checking of this capability. When QP is not created it is not in the hash list so there is no possibility to send packets. 

The patch for ofed-1.5.3 looks like below. I will try to push it to kernel.org after porting.

    Now an uverbs  post_send/post_recv path is modified to make pre-lookup
    for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
    is used for posting buffers. for example using a shared page approach in
    cooperation with user-space library

Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---

 drivers/infiniband/core/uverbs_cmd.c |   94 +++++++++++++++++++++++++++++++++-
 include/rdma/ib_verbs.h              |    4 +
 2 files changed, 95 insertions(+), 3 deletions(-)


diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 3520182..234324f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -269,6 +269,54 @@ static void put_xrcd_read(struct ib_uobject *uobj)
 	put_uobj_read(uobj);
 }
 
+#define RAW_QP_HASH(h) ((((h) >> 24) ^ ((h) >> 16) ^ \
+			((h) >> 8) ^ (h)) & (MAX_RAW_QP_HASH - 1))
+
+static void init_raw_qp_hash(struct ib_ucontext *ucontext) {
+	int idx;
+
+	for (idx = 0; idx < MAX_RAW_QP_HASH; idx++)
+		INIT_LIST_HEAD(&ucontext->raw_qp_hash[idx]);
+}
+
+static void raw_qp_hash_add(struct ib_ucontext *ucontext,
+			u32 qp_handle,
+			struct ib_qp *qp)
+{
+	int key;
+
+	if (qp->qp_type != IB_QPT_RAW_ETH)
+		return;
+
+	key = RAW_QP_HASH(qp_handle);
+	list_add(&qp->uobject->raw_qp_list, &ucontext->raw_qp_hash[key]); }
+
+static void raw_qp_hash_del(struct ib_qp *qp) {
+	if (qp->qp_type != IB_QPT_RAW_ETH)
+		return;
+
+	list_del(&qp->uobject->raw_qp_list);
+}
+
+static struct ib_qp *raw_qp_lookup(u32 qp_handle, struct ib_ucontext 
+*ucontext) {
+	int key;
+	struct ib_qp *qp;
+	struct ib_uobject *uobj, *tmp;
+
+	key = RAW_QP_HASH(qp_handle);
+	list_for_each_entry_safe(uobj, tmp,
+				&ucontext->raw_qp_hash[key], raw_qp_list) {
+		qp = uobj->object;
+		if (qp->uobject->id == qp_handle)
+			return qp;
+	}
+	return NULL;
+}
+
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 			      const char __user *buf,
 			      int in_len, int out_len)
@@ -313,6 +361,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&ucontext->srq_list);
 	INIT_LIST_HEAD(&ucontext->ah_list);
 	INIT_LIST_HEAD(&ucontext->xrc_domain_list);
+	init_raw_qp_hash(ucontext);
 	ucontext->closing = 0;
 
 	resp.num_comp_vectors = file->device->num_comp_vectors; @@ -1155,6 +1204,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
 
 	mutex_lock(&file->mutex);
 	list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list);
+	raw_qp_hash_add(file->ucontext, obj->uevent.uobject.id, qp);
 	mutex_unlock(&file->mutex);
 
 	obj->uevent.uobject.live = 1;
@@ -1412,6 +1462,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 
 	mutex_lock(&file->mutex);
 	list_del(&uobj->list);
+	raw_qp_hash_del(qp);
 	mutex_unlock(&file->mutex);
 
 	ib_uverbs_release_uevent(file, &obj->uevent); @@ -1450,6 +1501,25 @@ ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
 	if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr))
 		return -EINVAL;
 
+	mutex_lock(&file->mutex);
+	qp = raw_qp_lookup(cmd.qp_handle, file->ucontext);
+	mutex_unlock(&file->mutex);
+	if (qp) {
+		if (qp->qp_type == IB_QPT_RAW_ETH) {
+			resp.bad_wr = 0;
+			ret = qp->device->post_send(qp, NULL, &bad_wr);
+			if (ret)
+				resp.bad_wr = cmd.wr_count;
+
+			if (copy_to_user((void __user *) (unsigned long)
+					cmd.response,
+					&resp,
+					sizeof resp))
+					ret = -EFAULT;
+					goto out_raw_qp;
+		}
+	}
+
 	user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL);
 	if (!user_wr)
 		return -ENOMEM;
@@ -1579,7 +1649,7 @@ out_put:
 
 out:
 	kfree(user_wr);
-
+out_raw_qp:
 	return ret ? ret : in_len;
 }
 
@@ -1664,7 +1734,6 @@ err:
 		kfree(wr);
 		wr = next;
 	}
-
 	return ERR_PTR(ret);
 }
 
@@ -1681,6 +1750,25 @@ ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	mutex_lock(&file->mutex);
+	qp = raw_qp_lookup(cmd.qp_handle, file->ucontext);
+	mutex_unlock(&file->mutex);
+	if (qp) {
+		if (qp->qp_type == IB_QPT_RAW_ETH) {
+			resp.bad_wr = 0;
+			ret = qp->device->post_recv(qp, NULL, &bad_wr);
+			if (ret)
+				resp.bad_wr = cmd.wr_count;
+
+			if (copy_to_user((void __user *) (unsigned long)
+					cmd.response,
+					&resp,
+					sizeof resp))
+				ret = -EFAULT;
+			goto out_raw_qp;
+		}
+	}
+
 	wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
 				       in_len - sizeof cmd, cmd.wr_count,
 				       cmd.sge_count, cmd.wqe_size); @@ -1713,7 +1801,7 @@ out:
 		kfree(wr);
 		wr = next;
 	}
-
+out_raw_qp:
 	return ret ? ret : in_len;
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f5b054a..adf1dd8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -838,6 +838,8 @@ struct ib_fmr_attr {
 	u8	page_shift;
 };
 
+#define MAX_RAW_QP_HASH 8
+
 struct ib_ucontext {
 	struct ib_device       *device;
 	struct list_head	pd_list;
@@ -848,6 +850,7 @@ struct ib_ucontext {
 	struct list_head	srq_list;
 	struct list_head	ah_list;
 	struct list_head	xrc_domain_list;
+	struct list_head	raw_qp_hash[MAX_RAW_QP_HASH];
 	int			closing;
 };
 
@@ -859,6 +862,7 @@ struct ib_uobject {
 	int			id;		/* index into kernel idr */
 	struct kref		ref;
 	struct rw_semaphore	mutex;		/* protects .live */
+	struct list_head        raw_qp_list;    /* raw qp hash */
 	int			live;
 };

Mirek

-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Or Gerlitz
Sent: Monday, December 27, 2010 1:39 PM
To: Jason Gunthorpe; Walukiewicz, Miroslaw
Cc: Roland Dreier; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

Jason Gunthorpe wrote:
> Walukiewicz, Miroslaw wrote:

>> called for many QPs, there is a single entry point to
>> ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
>> case there is a lookup to QP store (idr_read_qp) necessary to find a
>> correct ibv_qp Structure, what is a big time consumer on the path.

> I don't think this should be such a big problem. The simplest solution
> would be to front the idr_read_qp with a small learning hashing table.

yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping
much more efficient, in a manner that fits fast path - which maybe wasn't the mindset 
when this code was written as its primary use was to invoke control plane commands.

>> The NES IMA kernel path also has such QP lookup but the QP number
>> format is designed to make such lookup very quickly.  The QP numbers in
>> OFED are not defined so generic lookup functions like idr_read_qp() must be use.

> Maybe look at moving the QPN to ibv_qp translation into the driver
> then - or better yet, move allocation out of the driver, if Mellanox
> could change their FW.. You are right that we could do this much
> faster if the QPN was structured in some way

I think there should be some validation on the uverbs level, as the caller is untrusted
user space application, e.g in a similar way for each system call done on a file-descriptor

Or.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                               ` <BE2BFE91933D1B4089447C644860408066C547E0-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-12-27 15:22                                                                 ` Or Gerlitz
       [not found]                                                                   ` <4D18AF2D.1010109-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-01-05 18:16                                                                 ` Roland Dreier
  1 sibling, 1 reply; 48+ messages in thread
From: Or Gerlitz @ 2010-12-27 15:22 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Jason Gunthorpe, Roland Dreier, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote:
> I implemented the very small hash table and I achieved performance 
> comparable to previous solution.

Just to clarify, when saying "achieved performance comparable to 
previous solution" you refer to the approach which bypasses uverbs on 
the post send path? Also, why enhance only the raw qp flow?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                                   ` <4D18AF2D.1010109-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2010-12-27 15:40                                                                     ` Walukiewicz, Miroslaw
  0 siblings, 0 replies; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-12-27 15:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Roland Dreier, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Just to clarify, when saying "achieved performance comparable to 
> previous solution" you refer to the approach which bypasses uverbs on 
> the post send path? Also, why enhance only the raw qp flow?

I compare to my previous solution using private device for passing information about packets. Comparing to current approach I see more than 20% of improvement.

This solution introduces a new path for posting buffers using a shared page approach. It works following way:
1. create RAW qp and add it to the raw QP hash list. 
2. user space library mmaps the shared page (it is specific action per device and must implemented separately per each driver)
3. during buffer posting the library puts buffers info into shared page and calls uverbs.
4. uverbs detects the raw qp and inform the driver bypassing current path.

The solution cannot be shared between RDMA drivers because it needs redesign of the driver (share page format is vendor specific).
Now only NES driver implements the RAW QP path through kernel (other vendors uses pure user-space solution) so 
No other vendor  will use this path.

There is possibility to add a new QP capability or attribute that will inform  uverbs that it is a new transmit path used so then the solution could be extended for all drivers.

Mirek



-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org] 
Sent: Monday, December 27, 2010 4:22 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote:
> I implemented the very small hash table and I achieved performance 
> comparable to previous solution.

Just to clarify, when saying "achieved performance comparable to 
previous solution" you refer to the approach which bypasses uverbs on 
the post send path? Also, why enhance only the raw qp flow?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                               ` <BE2BFE91933D1B4089447C644860408066C547E0-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2010-12-27 15:22                                                                 ` Or Gerlitz
@ 2011-01-05 18:16                                                                 ` Roland Dreier
       [not found]                                                                   ` <ada4o9nfc6e.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Roland Dreier @ 2011-01-05 18:16 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Or Gerlitz, Jason Gunthorpe, Hefty, Sean, linux-rdma@vger.kernel.org

 > The patch for ofed-1.5.3 looks like below. I will try to push it to kernel.org after porting.
 > 
 >     Now an uverbs  post_send/post_recv path is modified to make pre-lookup
 >     for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
 >     is used for posting buffers. for example using a shared page approach in
 >     cooperation with user-space library

I don't quite see why a hash table helps performance much vs. an IDR.
Is the actual IDR lookup a significant part of the cost?  (By the way,
instead of list_head you could use hlist_head to make your hash table
denser and save cache footprint -- that way an 8-entry table on 64-bit
systems fits in one cacheline)

Also it seems that you get rid of all the locking on QPs when you look
them up in your hash table.  What protects against userspace posting a
send in one thread and destroying the QP in another thread, and ending
up having the destroy complete before the send is posted (leading to
use-after-free in the kernel)?

I would guess that your speedup is really coming from getting rid of
locking that is actually required for correctness.  Maybe I'm wrong
though, I'm just guessing here.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                                   ` <ada4o9nfc6e.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2011-01-10 14:15                                                                     ` Walukiewicz, Miroslaw
       [not found]                                                                       ` <BE2BFE91933D1B4089447C644860408066DDDF31-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2011-01-10 14:15 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, Jason Gunthorpe, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Roland,

You are right that the most of the speed-up is coming from avoid semaphores, but not only.

>From the oprof traces, the semaphores made half of difference.

The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead)

In my opinion, the responsibility for cases like protection of QP against destroy during buffer post (and other similar cases) should be moved to vendor driver. The OFED code should move only the code path to driver. 

Mirek

-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Roland Dreier
Sent: Wednesday, January 05, 2011 7:17 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

 > The patch for ofed-1.5.3 looks like below. I will try to push it to kernel.org after porting.
 > 
 >     Now an uverbs  post_send/post_recv path is modified to make pre-lookup
 >     for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
 >     is used for posting buffers. for example using a shared page approach in
 >     cooperation with user-space library

I don't quite see why a hash table helps performance much vs. an IDR.
Is the actual IDR lookup a significant part of the cost?  (By the way,
instead of list_head you could use hlist_head to make your hash table
denser and save cache footprint -- that way an 8-entry table on 64-bit
systems fits in one cacheline)

Also it seems that you get rid of all the locking on QPs when you look
them up in your hash table.  What protects against userspace posting a
send in one thread and destroying the QP in another thread, and ending
up having the destroy complete before the send is posted (leading to
use-after-free in the kernel)?

I would guess that your speedup is really coming from getting rid of
locking that is actually required for correctness.  Maybe I'm wrong
though, I'm just guessing here.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: ibv_post_send/recv kernel path optimizations
       [not found]                                                                       ` <BE2BFE91933D1B4089447C644860408066DDDF31-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-01-10 20:38                                                                         ` Roland Dreier
       [not found]                                                                           ` <adawrmc7av2.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Roland Dreier @ 2011-01-10 20:38 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw
  Cc: Or Gerlitz, Jason Gunthorpe, Hefty, Sean, linux-rdma@vger.kernel.org

 > You are right that the most of the speed-up is coming from avoid semaphores, but not only.
 > 
 > From the oprof traces, the semaphores made half of difference.
 > 
 > The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead)

OK, but in any case the switch from idr to hash table seems to be
insignificant.  I agree that using a shared page is a good idea, but
removing locking needed for correctness is not a good optimization.

 > In my opinion, the responsibility for cases like protection of QP
 > against destroy during buffer post (and other similar cases) should
 > be moved to vendor driver. The OFED code should move only the code
 > path to driver.

Not sure what OFED code you're talking about.  We're discussing the
kernel uverbs code, right?

In any case I'd be interested in seeing how it looks if you move the
protection into the individual drivers.  I'd be worried about having to
duplicate the same code everywhere (which leads to bugs in individual
drivers) -- I guess this could be resolved by having the code be a
library that individual drivers call into.  But also I'm not sure if I
see how you could make such a scheme work -- you need to make sure that
the data structures used in the uverbs dispatch to drivers remain
consistent.

In the end I don't think we should go too far optimizing the
non-kernel-bypass case of verbs -- the main thing we're designing for is
kernel bypass hardware, after all.  Perhaps you could make your case go
faster by using a different file descriptor for each QP or something
(you could pass the fd back as part of the driver-specific create QP path)?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                                           ` <adawrmc7av2.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2011-01-21 11:41                                                                             ` Walukiewicz, Miroslaw
       [not found]                                                                               ` <BE2BFE91933D1B4089447C644860408066F95285-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2011-01-21 11:41 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, Jason Gunthorpe, Hefty, Sean,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Roland, 

You are right that the idr implementation introduces insignificant change in performance. I made the version with idr and semaphores usage and I see a minimal change comparing to hash table. Now only a shared page is used instead of kmalloc and copy_to_user.

I simplified changes to uverbs and I achieved what I wanted in performance. Now the patch looks like below.

Are these changes acceptable for k.org

Regards,

Mirek

--- ../SOURCES_19012011/ofa_kernel-1.5.3/drivers/infiniband/core/uverbs_cmd.c	2011-01-19 05:37:55.000000000 +0100
+++ ofa_kernel-1.5.3_idr_qp/drivers/infiniband/core/uverbs_cmd.c	2011-01-21 04:10:07.000000000 +0100
@@ -1449,15 +1449,29 @@
 
 	if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr))
 		return -EINVAL;
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		goto out_raw_qp;
+
+	if (qp->qp_type == IB_QPT_RAW_ETH) {
+		resp.bad_wr = 0;
+		ret = qp->device->post_send(qp, NULL, NULL);
+		if (ret)
+			resp.bad_wr = cmd.wr_count;
+
+		if (copy_to_user((void __user *) (unsigned long)
+				cmd.response,
+				&resp,
+				sizeof resp))
+			ret = -EFAULT;
+		put_qp_read(qp);
+		goto out_raw_qp;
+	}
 
 	user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL);
 	if (!user_wr)
 		return -ENOMEM;
 
-	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
-	if (!qp)
-		goto out;
-
 	is_ud = qp->qp_type == IB_QPT_UD;
 	sg_ind = 0;
 	last = NULL;
@@ -1577,9 +1591,8 @@
 		wr = next;
 	}
 
-out:
 	kfree(user_wr);
-
+out_raw_qp:
 	return ret ? ret : in_len;
 }
 
@@ -1681,16 +1694,31 @@
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		goto out_raw_qp;
+
+        if (qp->qp_type == IB_QPT_RAW_ETH) {
+		resp.bad_wr = 0;
+		ret = qp->device->post_recv(qp, NULL, NULL);
+                if (ret)
+                	resp.bad_wr = cmd.wr_count;
+
+                if (copy_to_user((void __user *) (unsigned long)
+				cmd.response,
+				&resp,
+				sizeof resp))
+			ret = -EFAULT;
+		put_qp_read(qp);
+		goto out_raw_qp;
+	}
+
 	wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
 				       in_len - sizeof cmd, cmd.wr_count,
 				       cmd.sge_count, cmd.wqe_size);
 	if (IS_ERR(wr))
 		return PTR_ERR(wr);
 
-	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
-	if (!qp)
-		goto out;
-
 	resp.bad_wr = 0;
 	ret = qp->device->post_recv(qp, wr, &bad_wr);
 
@@ -1707,13 +1735,13 @@
 			 &resp, sizeof resp))
 		ret = -EFAULT;
 
-out:
 	while (wr) {
 		next = wr->next;
 		kfree(wr);
 		wr = next;
 	}
 
+out_raw_qp:
 	return ret ? ret : in_len;
 }




-----Original Message-----
From: Roland Dreier [mailto:rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org] 
Sent: Monday, January 10, 2011 9:38 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibv_post_send/recv kernel path optimizations

 > You are right that the most of the speed-up is coming from avoid semaphores, but not only.
 > 
 > From the oprof traces, the semaphores made half of difference.
 > 
 > The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead)

OK, but in any case the switch from idr to hash table seems to be
insignificant.  I agree that using a shared page is a good idea, but
removing locking needed for correctness is not a good optimization.

 > In my opinion, the responsibility for cases like protection of QP
 > against destroy during buffer post (and other similar cases) should
 > be moved to vendor driver. The OFED code should move only the code
 > path to driver.

Not sure what OFED code you're talking about.  We're discussing the
kernel uverbs code, right?

In any case I'd be interested in seeing how it looks if you move the
protection into the individual drivers.  I'd be worried about having to
duplicate the same code everywhere (which leads to bugs in individual
drivers) -- I guess this could be resolved by having the code be a
library that individual drivers call into.  But also I'm not sure if I
see how you could make such a scheme work -- you need to make sure that
the data structures used in the uverbs dispatch to drivers remain
consistent.

In the end I don't think we should go too far optimizing the
non-kernel-bypass case of verbs -- the main thing we're designing for is
kernel bypass hardware, after all.  Perhaps you could make your case go
faster by using a different file descriptor for each QP or something
(you could pass the fd back as part of the driver-specific create QP path)?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                                               ` <BE2BFE91933D1B4089447C644860408066F95285-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-01-21 15:49                                                                                 ` Hefty, Sean
       [not found]                                                                                   ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25C1D51C31-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Hefty, Sean @ 2011-01-21 15:49 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw, Roland Dreier
  Cc: Or Gerlitz, Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

> +	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
> +	if (!qp)
> +		goto out_raw_qp;
> +
> +	if (qp->qp_type == IB_QPT_RAW_ETH) {
> +		resp.bad_wr = 0;
> +		ret = qp->device->post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type.  I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive.
 
> +		if (ret)
> +			resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: ibv_post_send/recv kernel path optimizations
       [not found]                                                                                   ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25C1D51C31-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-01-24 17:03                                                                                     ` Walukiewicz, Miroslaw
  0 siblings, 0 replies; 48+ messages in thread
From: Walukiewicz, Miroslaw @ 2011-01-24 17:03 UTC (permalink / raw)
  To: Hefty, Sean, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Sean,

The assumption here is that user space library prepares the vendor specific data in user-space using a shared page allocated by vendor driver. Here information about posted buffers is passed not through ib_wr but using the shared page. It is a reason why pointers indicating ib_wr in post_send are not set, they are not passed to kernel at all to avoid copying them in kernel. 

As there is no ib_wr structure in kernel there is no reference to bad_wr and a buffer that failed in this context so the only reasonable information about operation state passed using bad_wr could be return of binary information - operation successful (bad_wr = 0) or not (bad_wr != 0) 

Instead of using a specific case for RAW_QP it is possible to pass some information about posting buffers method by
enum ib_qp_create_flags {
        IB_QP_CREATE_IPOIB_UD_LSO               = 1 << 0,
        IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1 << 1
};
Extending it with IB_QP_CREATE_USE_SHARED_PAGE            = 1 << 2,

In that case a new method could be used for any type of QP and will be backward compatible.

Regards,

Mirek
-----Original Message-----
From: Hefty, Sean 
Sent: Friday, January 21, 2011 4:50 PM
To: Walukiewicz, Miroslaw; Roland Dreier
Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: RE: ibv_post_send/recv kernel path optimizations

> +	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
> +	if (!qp)
> +		goto out_raw_qp;
> +
> +	if (qp->qp_type == IB_QPT_RAW_ETH) {
> +		resp.bad_wr = 0;
> +		ret = qp->device->post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type.  I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive.
 
> +		if (ret)
> +			resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2011-01-24 17:03 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-07  7:16 [patch] infiniband: uverbs: limit the number of entries Dan Carpenter
2010-10-07  7:16 ` Dan Carpenter
2010-10-07 16:16 ` Jason Gunthorpe
2010-10-07 16:16   ` Jason Gunthorpe
     [not found]   ` <20101007161649.GD21206-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-10-07 16:59     ` Dan Carpenter
2010-10-07 16:59       ` Dan Carpenter
2010-10-08  7:59       ` Nicolas Palix
2010-10-08  7:59         ` Nicolas Palix
     [not found]         ` <AANLkTin5zou2JHsdDyhGESuxyPonOs3kLo9Th0vg-kd8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-10-08 14:25           ` [patch v2] " Dan Carpenter
2010-10-08 14:25             ` Dan Carpenter
2010-10-09 23:16       ` [patch] " Jason Gunthorpe
2010-10-09 23:16         ` Jason Gunthorpe
     [not found]         ` <20101009231607.GA24649-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-10-12 11:31           ` [patch v3] infiniband: uverbs: handle large " Dan Carpenter
2010-10-12 11:31             ` Dan Carpenter
2010-10-12 21:01             ` Jason Gunthorpe
2010-10-12 21:01               ` Jason Gunthorpe
     [not found]               ` <20101012210118.GR24268-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-10-13  9:05                 ` Dan Carpenter
2010-10-13  9:05                   ` Dan Carpenter
2010-10-13  9:13                 ` [patch v4] " Dan Carpenter
2010-10-13  9:13                   ` Dan Carpenter
2010-11-23  7:10                   ` Dan Carpenter
2010-11-23  7:10                     ` Dan Carpenter
2010-11-24 22:07                     ` Roland Dreier
2010-11-24 22:07                       ` Roland Dreier
     [not found]                       ` <adahbf6gytv.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-11-24 22:18                         ` Jason Gunthorpe
2010-11-24 22:18                           ` Jason Gunthorpe
     [not found]                           ` <20101124221845.GH2369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-11-25  4:05                             ` Roland Dreier
2010-11-25  4:05                               ` Roland Dreier
     [not found]                               ` <adad3pugi90.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-11-25  4:13                                 ` Jason Gunthorpe
2010-11-25  4:13                                   ` Jason Gunthorpe
     [not found]                                   ` <20101125041337.GA11049-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-11-25 15:00                                     ` ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries) Or Gerlitz
     [not found]                                       ` <4CEE7A22.2040706-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2010-11-26 11:56                                         ` Walukiewicz, Miroslaw
     [not found]                                           ` <BE2BFE91933D1B4089447C644860408064B44854-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-12-01  8:11                                             ` ibv_post_send/recv kernel path optimizations Or Gerlitz
     [not found]                                               ` <4CF60343.7050602-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2010-12-08 12:14                                                 ` Walukiewicz, Miroslaw
     [not found]                                                   ` <BE2BFE91933D1B4089447C64486040806673CF38-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-12-08 18:30                                                     ` Jason Gunthorpe
2010-12-08 19:04                                                     ` Roland Dreier
2010-12-14 14:12                                                 ` Walukiewicz, Miroslaw
     [not found]                                                   ` <BE2BFE91933D1B4089447C644860408066ABCF66-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-12-14 18:17                                                     ` Jason Gunthorpe
     [not found]                                                       ` <20101214181735.GA2506-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-12-27 12:38                                                         ` Or Gerlitz
     [not found]                                                           ` <4D1888CB.2010101-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-12-27 15:18                                                             ` Walukiewicz, Miroslaw
     [not found]                                                               ` <BE2BFE91933D1B4089447C644860408066C547E0-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-12-27 15:22                                                                 ` Or Gerlitz
     [not found]                                                                   ` <4D18AF2D.1010109-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2010-12-27 15:40                                                                     ` Walukiewicz, Miroslaw
2011-01-05 18:16                                                                 ` Roland Dreier
     [not found]                                                                   ` <ada4o9nfc6e.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2011-01-10 14:15                                                                     ` Walukiewicz, Miroslaw
     [not found]                                                                       ` <BE2BFE91933D1B4089447C644860408066DDDF31-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-01-10 20:38                                                                         ` Roland Dreier
     [not found]                                                                           ` <adawrmc7av2.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2011-01-21 11:41                                                                             ` Walukiewicz, Miroslaw
     [not found]                                                                               ` <BE2BFE91933D1B4089447C644860408066F95285-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-01-21 15:49                                                                                 ` Hefty, Sean
     [not found]                                                                                   ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25C1D51C31-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-01-24 17:03                                                                                     ` Walukiewicz, Miroslaw

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.