{RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

* {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
@ 2010-08-06 10:03 Walukiewicz, Miroslaw
       [not found] ` <BE2BFE91933D1B4089447C64486040805BD83E5F-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-08-06 10:03 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Currently the ibv_post_send()/ibv_post_recv() path through kernel 
(using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory allocations on the path. 

Currently the transmit/receive path works following way:
User calls ibv_post_send() where vendor specific function is called. 
When the path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to kernel. 
As the number of sges is unknown the dynamic allocation for message body is performed. 
(see libibverbs/src/cmd.c)

In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel 
The goal of this operation is having a similar structure like in user space. 

The proposed path optimization is removing of dynamic allocations 
by redefining a structure definition passed to kernel. 
>From 

struct ibv_post_send {
        __u32 command;
        __u16 in_words;
        __u16 out_words;
        __u64 response;
        __u32 qp_handle;
        __u32 wr_count;
        __u32 sge_count;
        __u32 wqe_size;
        struct ibv_kern_send_wr send_wr[0];
};
To 

struct ibv_post_send {
        __u32 command;
        __u16 in_words;
        __u16 out_words;
        __u64 response;
        __u32 qp_handle;
        __u32 wr_count;
        __u32 sge_count;
        __u32 wqe_size;
        struct ibv_kern_send_wr send_wr[512];
};

Similar change is required in kernel  struct ib_uverbs_post_send defined in /ofa_kernel/include/rdma/ib_uverbs.h

This change limits a number of send_wr passed from unlimited (assured by dynamic allocation) to reasonable number of 512. 
I think this number should be a max number of QP entries available to send. 
As the all iB/iWARP applications are low latency applications so the number of WRs passed are never unlimited.

As the result instead of dynamic allocation the ibv_cmd_post_send() fills the proposed structure 
directly and passes it to kernel. Whenever the number of send_wr number exceeds the limit the ENOMEM error is returned.

In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures 
the table of 512  ib_send_wr structures  will be defined and 
all entries will be linked to unidirectional list so qp->device->post_send(qp, wr, &bad_wr) API will be not changed. 

As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver 
Would be a first application that can utilize the optimized path. 

Regards,

Mirek

Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
       [not found] ` <BE2BFE91933D1B4089447C64486040805BD83E5F-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-08-06 15:57   ` Roland Dreier
       [not found]     ` <adak4o320op.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  2010-08-06 16:32   ` Jason Gunthorpe
  2010-08-06 18:00   ` Ralph Campbell
  2 siblings, 1 reply; 6+ messages in thread
From: Roland Dreier @ 2010-08-06 15:57 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw; +Cc: linux-rdma@vger.kernel.org

 > The proposed path optimization is removing of dynamic allocations 
 > by redefining a structure definition passed to kernel. 

 > To 
 > 
 > struct ibv_post_send {
 >         __u32 command;
 >         __u16 in_words;
 >         __u16 out_words;
 >         __u64 response;
 >         __u32 qp_handle;
 >         __u32 wr_count;
 >         __u32 sge_count;
 >         __u32 wqe_size;
 >         struct ibv_kern_send_wr send_wr[512];
 > };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
       [not found] ` <BE2BFE91933D1B4089447C64486040805BD83E5F-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2010-08-06 15:57   ` Roland Dreier
@ 2010-08-06 16:32   ` Jason Gunthorpe
       [not found]     ` <20100806163237.GJ11306-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-08-06 18:00   ` Ralph Campbell
  2 siblings, 1 reply; 6+ messages in thread
From: Jason Gunthorpe @ 2010-08-06 16:32 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

> Currently the transmit/receive path works following way: User calls
> ibv_post_send() where vendor specific function is called.  When the
> path should go through kernel the ibv_cmd_post_send() is called.
> The function creates the POST_SEND message body that is passed to
> kernel.  As the number of sges is unknown the dynamic allocation for
> message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

> In the kernel the message body is parsed and a structure of wr and
> sges is recreated using dynamic allocations in kernel The goal of
> this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

> In kernel in ib_uverbs_post_send() instead of dynamic allocation of
> the ib_send_wr structures the table of 512 ib_send_wr structures
> will be defined and all entries will be linked to unidirectional
> list so qp->device->post_send(qp, wr, &bad_wr) API will be not
> changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

> As I know no driver uses that kernel path to posting buffers so
> iWARP multicast acceleration implemented in NES driver Would be a
> first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
       [not found] ` <BE2BFE91933D1B4089447C64486040805BD83E5F-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2010-08-06 15:57   ` Roland Dreier
  2010-08-06 16:32   ` Jason Gunthorpe
@ 2010-08-06 18:00   ` Ralph Campbell
  2 siblings, 0 replies; 6+ messages in thread
From: Ralph Campbell @ 2010-08-06 18:00 UTC (permalink / raw)
  To: Walukiewicz, Miroslaw; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, 2010-08-06 at 03:03 -0700, Walukiewicz, Miroslaw wrote:
> Currently the ibv_post_send()/ibv_post_recv() path through kernel 
> (using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory allocations on the path. 
> 
> Currently the transmit/receive path works following way:
> User calls ibv_post_send() where vendor specific function is called. 
> When the path should go through kernel the ibv_cmd_post_send() is called.
>  The function creates the POST_SEND message body that is passed to kernel. 
> As the number of sges is unknown the dynamic allocation for message body is performed. 
> (see libibverbs/src/cmd.c)
> 
> In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel 
> The goal of this operation is having a similar structure like in user space. 
> 
> The proposed path optimization is removing of dynamic allocations 
> by redefining a structure definition passed to kernel. 
> From 
> 
> struct ibv_post_send {
>         __u32 command;
>         __u16 in_words;
>         __u16 out_words;
>         __u64 response;
>         __u32 qp_handle;
>         __u32 wr_count;
>         __u32 sge_count;
>         __u32 wqe_size;
>         struct ibv_kern_send_wr send_wr[0];
> };
> To 
> 
> struct ibv_post_send {
>         __u32 command;
>         __u16 in_words;
>         __u16 out_words;
>         __u64 response;
>         __u32 qp_handle;
>         __u32 wr_count;
>         __u32 sge_count;
>         __u32 wqe_size;
>         struct ibv_kern_send_wr send_wr[512];
> };
> 
> Similar change is required in kernel  struct ib_uverbs_post_send defined in /ofa_kernel/include/rdma/ib_uverbs.h
> 
> This change limits a number of send_wr passed from unlimited (assured by dynamic allocation) to reasonable number of 512. 
> I think this number should be a max number of QP entries available to send. 
> As the all iB/iWARP applications are low latency applications so the number of WRs passed are never unlimited.
> 
> As the result instead of dynamic allocation the ibv_cmd_post_send() fills the proposed structure 
> directly and passes it to kernel. Whenever the number of send_wr number exceeds the limit the ENOMEM error is returned.
> 
> In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures 
> the table of 512  ib_send_wr structures  will be defined and 
> all entries will be linked to unidirectional list so qp->device->post_send(qp, wr, &bad_wr) API will be not changed. 
> 
> As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver 
> Would be a first application that can utilize the optimized path. 
> 
> Regards,
> 
> Mirek
> 
> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The libipathverbs.so plug-in for libibverbs and
the ib_ipath and ib_qib kernel modules use this path for
ibv_post_send().

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
       [not found]     ` <adak4o320op.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-08-10  7:33       ` Walukiewicz, Miroslaw
  0 siblings, 0 replies; 6+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-08-10  7:33 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I agree with you that changing kernel ABI is not necessary. 
I will follow your directions regarding a single allocation at start. 

Regards,

Mirek 

-----Original Message-----
From: Roland Dreier [mailto:rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org] 
Sent: Friday, August 06, 2010 5:58 PM
To: Walukiewicz, Miroslaw
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

 > The proposed path optimization is removing of dynamic allocations 
 > by redefining a structure definition passed to kernel. 

 > To 
 > 
 > struct ibv_post_send {
 >         __u32 command;
 >         __u16 in_words;
 >         __u16 out_words;
 >         __u64 response;
 >         __u32 qp_handle;
 >         __u32 wr_count;
 >         __u32 sge_count;
 >         __u32 wqe_size;
 >         struct ibv_kern_send_wr send_wr[512];
 > };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
       [not found]     ` <20100806163237.GJ11306-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-08-10  7:39       ` Walukiewicz, Miroslaw
  0 siblings, 0 replies; 6+ messages in thread
From: Walukiewicz, Miroslaw @ 2010-08-10  7:39 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hello Jason, 

Do you have any benchmarks that show the alloca is a measurable
overhead?  

We changed overall path (both kernel and user space) to allocation-less approach and 
We achieved twice better latency using call to kernel driver. I have no data which path 
Is dominant - kernel or user space. I think I will have some measurements next week, so I will share 
My results.

Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

I agree. I will go into this direction.

Regards,

Mirek

-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe
Sent: Friday, August 06, 2010 6:33 PM
To: Walukiewicz, Miroslaw
Cc: Roland Dreier; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

> Currently the transmit/receive path works following way: User calls
> ibv_post_send() where vendor specific function is called.  When the
> path should go through kernel the ibv_cmd_post_send() is called.
> The function creates the POST_SEND message body that is passed to
> kernel.  As the number of sges is unknown the dynamic allocation for
> message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

> In the kernel the message body is parsed and a structure of wr and
> sges is recreated using dynamic allocations in kernel The goal of
> this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

> In kernel in ib_uverbs_post_send() instead of dynamic allocation of
> the ib_send_wr structures the table of 512 ib_send_wr structures
> will be defined and all entries will be linked to unidirectional
> list so qp->device->post_send(qp, wr, &bad_wr) API will be not
> changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

> As I know no driver uses that kernel path to posting buffers so
> iWARP multicast acceleration implemented in NES driver Would be a
> first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-08-10  7:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-06 10:03 {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations Walukiewicz, Miroslaw
     [not found] ` <BE2BFE91933D1B4089447C64486040805BD83E5F-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-08-06 15:57   ` Roland Dreier
     [not found]     ` <adak4o320op.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-08-10  7:33       ` Walukiewicz, Miroslaw
2010-08-06 16:32   ` Jason Gunthorpe
     [not found]     ` <20100806163237.GJ11306-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-08-10  7:39       ` Walukiewicz, Miroslaw
2010-08-06 18:00   ` Ralph Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.