All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel fast memory registration API proposal [RFC]
@ 2015-07-10  9:09 Sagi Grimberg
       [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-10  9:09 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig, Jason Gunthorpe, Steve Wise, Or Gerlitz,
	Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

Hi,

Given the last discussions on our in-kernel memory registration API I
thought I'd propose another approach to address this.

As I said before, I think the stack needs to consolidate on a single
memory registration scheme. That scheme is the standard FRWR.

As you know, MRs have a consumers representation (e.g. ib_mr) and a
provider private context. In order to support generic kernel consumers
registration without enforcing the HW way of registering the memory we
keep the HW specifics in the provider private context.

struct provider_mr {
	u64		*page_list; // or what ever the HW uses
	... <more private stuff> ...
	struct ib_mr	ibmr;
};

And then provide helpers to populate the MR with generic kernel
structures such as struct scatterlist (for scsi and other ULPs),
struct page (for NFS) or struct bio_vec (for block ULPs later on).

/**
   * ib_mr_set_sg() - populate memory region buffers
   *     array from a SG list
   * @mr:          memory region
   * @sg:          sg list
   * @sg_nents:    number of elements in the sg
   *
   * Can fail if the HW is not able to register this
   * sg list. In case of failure - caller is responsible
   * to handle it (bounce-buffer, multiple registrations...)
   */
int ib_mr_set_sg(struct ib_mr *mr,
                  struct scatterlist *sg,
                  unsigned short sg_nents);

/**
   * ib_mr_set_pages() - populate memory region buffers
   *     array from a pages array
   * @mr:          memory region
   * @pages:       struct page array
   * @npages:      number of pages in the page array
   */
int ib_mr_set_pages(struct ib_mr *mr,
                     struct page **pages,
                     unsigned int npages);

In the future we can easily add biovec helper (if needed):
/**
   * ib_mr_set_biovec() - populate memory region buffers
   *     array from a bio_vec
   * @mr:          memory region
   * @bio_vec:     bio vector
   * @bi_vcnt:     number of elements in the bio_vec
   *
   * Can fail if the HW is not able to register this
   * sg list. In case of failure - caller is responsible
   * to handle it (bounce-buffer, multiple registrations...)
   */
int ib_mr_set_biovec(struct ib_mr *mr,
                      struct bio_vec *bio_vec,
                      unsigned short bi_vcnt);

These helpers allows the driver to hide the mechanics of the specific
HW implementation details of memory registration.

We *keep* the FRWR work request interface so the consumers can keep
track of what happens on their queue-pair when registering/invalidating
memory regions. However, the API is dramatically simpler.

struct ib_send_wr {
          ...
          union {
                  ...
                  struct {
                          struct ib_mr    *mr;
                          u64             iova;
                          u32             length;
                          int             access_flags;
                  } fast_reg;
                  ...
          } wr;
          ...
};

We can consider moving the iova and length to the population helpers.
Wasn't sure what is better...


Here is an example of how would a ULP use this:

int my_driver_register_sg(struct scatterlist *sg,
                            unsigned short sg_nents)
{
          struct ib_send_wr frwr, *bad_wr;
          struct ib_mr *mr;
          struct ib_mr_init_attr mr_attr = {
                  .mr_type = IB_MR_TYPE_FAST_REG,
                  .max_reg_descs = sg_nents,
          };

          /*
           * Allocate MR
           * With the MR the driver will allocate a page list
           * in its private context
           */
          mr = ib_create_mr(my_pd, &mr_attr);

          /*
           * Set the SG list in the MR, fail if the sg
	  * list is not well aligned (caller should handle
	  * it) or mr does not have enough room to fit the sg.
           */
          rc = ib_mr_set_sg(mr, sg, sg_nents);
	 if (rc)
		/* HW does not support - Need to handle it */

          /* register the MR */
          frwr.opcode = IB_WR_FAST_REG_MR;
          frwr.wrid = my_wrid;
          frwr.wr.fast_reg.mr = mr;
          frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
          frwr.wr.fast_reg.length = length;
          frwr.wr.fast_reg.access_flags = my_flags;

          ib_post_send(my_qp, &frwr, &bad_wr);

          /* do SEND/RDMA/RECV ... */

          /* Do local invalidate if needed */

          /* Free the MR */
          ib_dereg_mr(mr);
}

This generic approach allows for example to add arbitrary sg list
support just by adding a flag to the mr allocation (which will fail
if the device does not support):

int my_driver_register_arb_sg(struct scatterlist *sg,
                                unsigned short sg_nents)
{
          struct ib_send_wr frwr, *bad_wr;
          struct ib_mr *mr;
          struct ib_mr_init_attr mr_attr = {
                  .mr_type = IB_MR_TYPE_FAST_REG,
                  .create_flags = *IB_MR_REG_ARB_SG*,
                  .max_reg_descs = sg_nents,
          };

	 /* The rest is exactly the same... */
}



That is the general direction,

Thoughts? Comments?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-11 10:39   ` Christoph Hellwig
       [not found]     ` <20150711103920.GE14741-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-13 16:30   ` Jason Gunthorpe
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-11 10:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer, Chuck Lever,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey

On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
> And then provide helpers to populate the MR with generic kernel
> structures such as struct scatterlist (for scsi and other ULPs),
> struct page (for NFS) or struct bio_vec (for block ULPs later on).

Please stick to struct scatterlist for now.  Future block ULPs
will use that as well as the only way you can do a multi-page
DMA mapping is the scatterlist.  A page is just a subset of
an SGL, and we can map a page using a one element SGL trivial,
as we do in lots of places.

>          union {
>                  ...
>                  struct {
>                          struct ib_mr    *mr;
>                          u64             iova;
>                          u32             length;
>                          int             access_flags;
>                  } fast_reg;
>                  ...
>          } wr;
>          ...
> };
> 
> We can consider moving the iova and length to the population helpers.
> Wasn't sure what is better...

Move it to the population helpers.  The walk done in them is used to
generate those values anyway, so there is no need to expose them in
any sort of public API.  I'd also move the access_flags initialization
to the helpers, leaving post to do just that:  post an alredy
initialized mr structure.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]     ` <20150711103920.GE14741-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-12  7:57       ` Sagi Grimberg
       [not found]         ` <55A21DF6.6090909-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-12  7:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/11/2015 1:39 PM, Christoph Hellwig wrote:
> On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
>> And then provide helpers to populate the MR with generic kernel
>> structures such as struct scatterlist (for scsi and other ULPs),
>> struct page (for NFS) or struct bio_vec (for block ULPs later on).
>
> Please stick to struct scatterlist for now.  Future block ULPs
> will use that as well as the only way you can do a multi-page
> DMA mapping is the scatterlist.

I see.

> A page is just a subset of an SGL, and we can map a page using a one element SGL trivial,
> as we do in lots of places.
>

But won't that make sunrpc rdma consumers to hold an extra scatterlist
just for memory registration?

Chuck, Would a scatterlist API make life easier for you?

>> We can consider moving the iova and length to the population helpers.
>> Wasn't sure what is better...
>
> Move it to the population helpers.  The walk done in them is used to
> generate those values anyway, so there is no need to expose them in
> any sort of public API.

Yea, makes sense.

>  I'd also move the access_flags initialization to the helpers,
> leaving post to do just that:  post an alredy initialized mr structure.

I can do that, but it's not really related to the sg.

Say a consumer wants to register an SG for peer X and then change
access permissions for peer Y. There is no real reason to re-init
the mappings right?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <55A21DF6.6090909-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-12 18:15           ` Chuck Lever
       [not found]             ` <96901C8F-D916-4ECF-8DA4-C5C67FB8539E-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-12 18:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey


On Jul 12, 2015, at 3:57 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 7/11/2015 1:39 PM, Christoph Hellwig wrote:
>> On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
>>> And then provide helpers to populate the MR with generic kernel
>>> structures such as struct scatterlist (for scsi and other ULPs),
>>> struct page (for NFS) or struct bio_vec (for block ULPs later on).
>> 
>> Please stick to struct scatterlist for now.  Future block ULPs
>> will use that as well as the only way you can do a multi-page
>> DMA mapping is the scatterlist.
> 
> I see.
> 
>> A page is just a subset of an SGL, and we can map a page using a one element SGL trivial,
>> as we do in lots of places.
>> 
> 
> But won't that make sunrpc rdma consumers to hold an extra scatterlist
> just for memory registration?

Yes. xprtrdma’s FMR implementation already has a physaddrs array
for this purpose, and it’s FRWR implementation keeps an
ib_fast_reg_page_list array for each MR. 


> Chuck, Would a scatterlist API make life easier for you?

No benefit for me.

The NFS upper layer already slices and dices I/O until it is a
stream of contiguous single I/O requests for the server.

It passes down a vector of struct page pointers which xprtrdma’s
memory registration logic has to walk through and convert into
something the provider can deal with.

See fmr_op_map and frwr_op_map. The loop in there becomes costly
as the number of pages involved in an I/O request increases.

I suppose an s/g list wouldn’t be much different from the current
arrangement. And if NFS and SunRPC are the only users that deal
with struct page, then there’s no code sharing benefit to
providing a provider API based on struct page.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]             ` <96901C8F-D916-4ECF-8DA4-C5C67FB8539E-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-13  6:47               ` Christoph Hellwig
       [not found]                 ` <20150713064701.GB31842-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-13  6:47 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey

On Sun, Jul 12, 2015 at 02:15:56PM -0400, Chuck Lever wrote:
> > Chuck, Would a scatterlist API make life easier for you?
> 
> No benefit for me.
> 
> The NFS upper layer already slices and dices I/O until it is a
> stream of contiguous single I/O requests for the server.
> 
> It passes down a vector of struct page pointers which xprtrdma?s
> memory registration logic has to walk through and convert into
> something the provider can deal with.
> 
> See fmr_op_map and frwr_op_map. The loop in there becomes costly
> as the number of pages involved in an I/O request increases.
> 
> I suppose an s/g list wouldn?t be much different from the current
> arrangement. And if NFS and SunRPC are the only users that deal
> with struct page, then there?s no code sharing benefit to
> providing a provider API based on struct page.

NFS really should be using something more similar to a scatterlist,
as it maps pretty well to the sk_frags in the network layer as well.

Struct scatterlist is imprtant because it's the way the DMA mapping
functions takes a multi-page argument, so ayone who wants to batch
the S/G mapping calls needs it.  It might be worthwhile to find a way
to replace the struct ib_sge argumets in the RDMA code with a
scatterlist separate key argument to simplify the calling conventions
and avoid the need to allocate two list.  Note that I think for IB and
IB-like transports we'll always use the same lkey anyway, and from what
I understood about iWarp it probably should generate the lkey as part
of it's dma mapping operations instead of relying on the ULD to generate
one.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <20150713064701.GB31842-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-13 14:16                   ` Chuck Lever
       [not found]                     ` <1D9C0527-E277-4C3F-A80D-C4FBAA3D82E9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-13 14:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey


On Jul 13, 2015, at 2:47 AM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Sun, Jul 12, 2015 at 02:15:56PM -0400, Chuck Lever wrote:
>>> Chuck, Would a scatterlist API make life easier for you?
>> 
>> No benefit for me.
>> 
>> The NFS upper layer already slices and dices I/O until it is a
>> stream of contiguous single I/O requests for the server.
>> 
>> It passes down a vector of struct page pointers which xprtrdma?s
>> memory registration logic has to walk through and convert into
>> something the provider can deal with.
>> 
>> See fmr_op_map and frwr_op_map. The loop in there becomes costly
>> as the number of pages involved in an I/O request increases.
>> 
>> I suppose an s/g list wouldn?t be much different from the current
>> arrangement. And if NFS and SunRPC are the only users that deal
>> with struct page, then there?s no code sharing benefit to
>> providing a provider API based on struct page.
> 
> NFS really should be using something more similar to a scatterlist,
> as it maps pretty well to the sk_frags in the network layer as well.
> 
> Struct scatterlist is imprtant because it's the way the DMA mapping
> functions takes a multi-page argument, so ayone who wants to batch
> the S/G mapping calls needs it.

An excellent topic to bring up on linux-nfs-u79uwXL29TaiAVqoAR/hOA@public.gmane.org

In the meantime, I think rpcrdma.ko would have to be responsible for
converting struct page to struct scatterlist.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-11 10:39   ` Christoph Hellwig
@ 2015-07-13 16:30   ` Jason Gunthorpe
       [not found]     ` <20150713163015.GA23832-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-15  7:32   ` Christoph Hellwig
  2015-07-19  5:45   ` Sagi Grimberg
  3 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-13 16:30 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
> Given the last discussions on our in-kernel memory registration API I
> thought I'd propose another approach to address this.

I assume you can put your new indirect registrations under this API
without changing the callers?

>          /*
>           * Set the SG list in the MR, fail if the sg
> 	  * list is not well aligned (caller should handle
> 	  * it) or mr does not have enough room to fit the sg.
>           */
>          rc = ib_mr_set_sg(mr, sg, sg_nents);
> 	 if (rc)
> 		/* HW does not support - Need to handle it */

I think this call should also do the post.

There seems to be nothing the ULP can customize about the post step
that needs to be exposed.. and bundling the post allows more
flexability to implement different schemes without impacting the ULPs.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]     ` <20150713163015.GA23832-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-14  8:39       ` Sagi Grimberg
       [not found]         ` <55A4CABC.5050807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-14  8:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/13/2015 7:30 PM, Jason Gunthorpe wrote:
> On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
>> Given the last discussions on our in-kernel memory registration API I
>> thought I'd propose another approach to address this.
>
> I assume you can put your new indirect registrations under this API
> without changing the callers?

Correct, it's a bonus. The main idea here is to simplify the API for
ULPs.

>
>>           /*
>>            * Set the SG list in the MR, fail if the sg
>> 	  * list is not well aligned (caller should handle
>> 	  * it) or mr does not have enough room to fit the sg.
>>            */
>>           rc = ib_mr_set_sg(mr, sg, sg_nents);
>> 	 if (rc)
>> 		/* HW does not support - Need to handle it */
>
> I think this call should also do the post.
>
> There seems to be nothing the ULP can customize about the post step
> that needs to be exposed.. and bundling the post allows more
> flexability to implement different schemes without impacting the ULPs.

This is exactly what I don't want to do. I don't think that implicit
posting is a good idea for reasons that I mentioned earlier:

"This is where I have a problem. Providing an API that may or may not
post a work request on my QP is confusing, and I don't understand its
semantics at all. Do I need to reserve slots on my QP? should I ask for
a completion? If we suppress the completion will I see an error
completion? What should I expect to find in the wr_id?"

We're much better off with keeping the post interface in place but
have it much simpler.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                     ` <1D9C0527-E277-4C3F-A80D-C4FBAA3D82E9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-14  8:50                       ` Sagi Grimberg
       [not found]                         ` <55A4CD5B.9030000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-14  8:50 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe, Steve Wise,
	Or Gerlitz, Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On 7/13/2015 5:16 PM, Chuck Lever wrote:

>> NFS really should be using something more similar to a scatterlist,
>> as it maps pretty well to the sk_frags in the network layer as well.
>>
>> Struct scatterlist is imprtant because it's the way the DMA mapping
>> functions takes a multi-page argument, so ayone who wants to batch
>> the S/G mapping calls needs it.
>
> An excellent topic to bring up on linux-nfs-u79uwXL29TaiAVqoAR/hOA@public.gmane.org
>
> In the meantime, I think rpcrdma.ko would have to be responsible for
> converting struct page to struct scatterlist.
>

Fine by me, so I take it that you are OK with the proposed API?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <55A4CABC.5050807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-14 14:42           ` Steve Wise
  2015-07-14 15:33           ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Steve Wise @ 2015-07-14 14:42 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Jason Gunthorpe'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Christoph Hellwig',
	'Or Gerlitz', 'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Tuesday, July 14, 2015 3:39 AM
> To: Jason Gunthorpe
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Christoph Hellwig; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss;
Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On 7/13/2015 7:30 PM, Jason Gunthorpe wrote:
> > On Fri, Jul 10, 2015 at 12:09:37PM +0300, Sagi Grimberg wrote:
> >> Given the last discussions on our in-kernel memory registration API I
> >> thought I'd propose another approach to address this.
> >
> > I assume you can put your new indirect registrations under this API
> > without changing the callers?
> 
> Correct, it's a bonus. The main idea here is to simplify the API for
> ULPs.
> 
> >
> >>           /*
> >>            * Set the SG list in the MR, fail if the sg
> >> 	  * list is not well aligned (caller should handle
> >> 	  * it) or mr does not have enough room to fit the sg.
> >>            */
> >>           rc = ib_mr_set_sg(mr, sg, sg_nents);
> >> 	 if (rc)
> >> 		/* HW does not support - Need to handle it */
> >
> > I think this call should also do the post.
> >
> > There seems to be nothing the ULP can customize about the post step
> > that needs to be exposed.. and bundling the post allows more
> > flexability to implement different schemes without impacting the ULPs.
> 
> This is exactly what I don't want to do. I don't think that implicit
> posting is a good idea for reasons that I mentioned earlier:
> 
> "This is where I have a problem. Providing an API that may or may not
> post a work request on my QP is confusing, and I don't understand its
> semantics at all. Do I need to reserve slots on my QP? should I ask for
> a completion? If we suppress the completion will I see an error
> completion? What should I expect to find in the wr_id?"
> 
> We're much better off with keeping the post interface in place but
> have it much simpler.
> 
> Sagi.

I agree...I don't like the idea of WR posting done under the covers.  


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <55A4CABC.5050807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-14 14:42           ` Steve Wise
@ 2015-07-14 15:33           ` Christoph Hellwig
       [not found]             ` <20150714153347.GA11026-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-14 15:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Steve Wise, Or Gerlitz, Oren Duer,
	Chuck Lever, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:
> This is exactly what I don't want to do. I don't think that implicit
> posting is a good idea for reasons that I mentioned earlier:
> 
> "This is where I have a problem. Providing an API that may or may not
> post a work request on my QP is confusing, and I don't understand its
> semantics at all. Do I need to reserve slots on my QP? should I ask for
> a completion? If we suppress the completion will I see an error
> completion? What should I expect to find in the wr_id?"
> 
> We're much better off with keeping the post interface in place but
> have it much simpler.

The ULP doesn't care if it needs to reserver the slot, and it generally
doesn't care about the notification either unless it needs to handle an
error.

Instead of the ib_device knows if a MR needs a post it can through
a helper set the right reservation.

The completions are another mad mightmare in the RDMA stack API.  Every
other subsystem would just allow submitter to attach a function pointer
that handles the completion to the posted item.  But no, the RDMA stack
instead require ID allocators and gigantic boilerplate code in the
consumer to untangle that gigantic mess again.

If we sort that out first the ULD doesn't have to care about FR
notifications.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]             ` <20150714153347.GA11026-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-14 15:53               ` Jason Gunthorpe
       [not found]                 ` <20150714155340.GA7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-14 16:12               ` Sagi Grimberg
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-14 15:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Tue, Jul 14, 2015 at 08:33:47AM -0700, Christoph Hellwig wrote:
> On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:
> > This is exactly what I don't want to do. I don't think that implicit
> > posting is a good idea for reasons that I mentioned earlier:
> > 
> > "This is where I have a problem. Providing an API that may or may not
> > post a work request on my QP is confusing, and I don't understand its
> > semantics at all. Do I need to reserve slots on my QP? should I ask for
> > a completion? If we suppress the completion will I see an error
> > completion? What should I expect to find in the wr_id?"
> > 
> > We're much better off with keeping the post interface in place but
> > have it much simpler.
> 
> The ULP doesn't care if it needs to reserver the slot, and it generally
> doesn't care about the notification either unless it needs to handle an
> error.
> 
> Instead of the ib_device knows if a MR needs a post it can through
> a helper set the right reservation.
> 
> The completions are another mad mightmare in the RDMA stack API.  Every
> other subsystem would just allow submitter to attach a function pointer
> that handles the completion to the posted item.  But no, the RDMA stack
> instead require ID allocators and gigantic boilerplate code in the
> consumer to untangle that gigantic mess again.
> 
> If we sort that out first the ULD doesn't have to care about FR
> notifications.

Right. We need to move away from our past. It was sort of reasonable
when we had brand new hardware and nobody knew what ULPs would look
like to just expose the raw HW primitives, and raw SQEs.

Now, the exercise should be a very simple code refactoring. Take the
duplciate stuff out of Lustre/iSER/SRP/NFS and share it. You can't ask
for a safer world to build an API from:

If all of those do posts w/ temp MR in sleepable contexts, then that
is OK for the shared API to require sleep.

If all those do FRMR setup, then post REG, then post ACT then factor
that three step pattern.

If all those do if (FMR) ... else ... then factor that pattern too.

Ditto for the Warp RDMA READ lkey buisness.

If all those want callbacks from work completions, then factor that.

It isn't even a question of API design, or what people like or don't
like. If those 4 ULPs do the same damn stuff, then factor it.

"It doesn't feel right" is not really a helpful response to an API
factoring excercise. "ULP XYZ cannot do that because of ABC" is a much
more productive reply.

We'd still have the low level API for new ULPs to experiment with, if
they really need.

I'm really disappointed by the negative emails on this subject..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]             ` <20150714153347.GA11026-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-14 15:53               ` Jason Gunthorpe
@ 2015-07-14 16:12               ` Sagi Grimberg
       [not found]                 ` <55A534D1.6030008-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-14 16:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/14/2015 6:33 PM, Christoph Hellwig wrote:
> On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:
>> This is exactly what I don't want to do. I don't think that implicit
>> posting is a good idea for reasons that I mentioned earlier:
>>
>> "This is where I have a problem. Providing an API that may or may not
>> post a work request on my QP is confusing, and I don't understand its
>> semantics at all. Do I need to reserve slots on my QP? should I ask for
>> a completion? If we suppress the completion will I see an error
>> completion? What should I expect to find in the wr_id?"
>>
>> We're much better off with keeping the post interface in place but
>> have it much simpler.
>
> The ULP doesn't care if it needs to reserver the slot, and it generally
> doesn't care about the notification either unless it needs to handle an
> error.

That's generally correct. But the ULP may want to suppress notifications
of other posts as well (for example a storage initiator may want to
suppress send completions since it needs to reserve the request context
until it receives the status anyway - which is a receive completion). It
needs to take what was posted and what not into account to avoid
wrapping.

If I'm not mistaken iWARP requires to ask for a completion once every
send-queue length and empirically, IB drivers requires it too. So again,
I don't think implicit posting is a very good idea.

> The completions are another mad mightmare in the RDMA stack API.  Every
> other subsystem would just allow submitter to attach a function pointer
> that handles the completion to the posted item.

This is actually a good idea. And it can be easily added I think (at
least to mlx drivers which I better). It can help removing a switch
statement for the ULPs when handling the completions.

   But no, the RDMA stack
> instead require ID allocators and gigantic boilerplate code in the
> consumer to untangle that gigantic mess again.

I don't think the wr_id was ever intended to be an allocated tag. It's
the ULP context, usually a pointer to a transaction structure. Even
with passing a function pointer you need to get back your original
context don't you?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <55A534D1.6030008-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-14 16:16                   ` Steve Wise
  2015-07-14 17:29                     ` Tom Talpey
  2015-07-14 16:35                   ` Jason Gunthorpe
  1 sibling, 1 reply; 68+ messages in thread
From: Steve Wise @ 2015-07-14 16:16 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig'
  Cc: 'Jason Gunthorpe',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Tuesday, July 14, 2015 11:12 AM
> To: Christoph Hellwig
> Cc: Jason Gunthorpe; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss;
Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On 7/14/2015 6:33 PM, Christoph Hellwig wrote:
> > On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:
> >> This is exactly what I don't want to do. I don't think that implicit
> >> posting is a good idea for reasons that I mentioned earlier:
> >>
> >> "This is where I have a problem. Providing an API that may or may not
> >> post a work request on my QP is confusing, and I don't understand its
> >> semantics at all. Do I need to reserve slots on my QP? should I ask for
> >> a completion? If we suppress the completion will I see an error
> >> completion? What should I expect to find in the wr_id?"
> >>
> >> We're much better off with keeping the post interface in place but
> >> have it much simpler.
> >
> > The ULP doesn't care if it needs to reserver the slot, and it generally
> > doesn't care about the notification either unless it needs to handle an
> > error.
> 
> That's generally correct. But the ULP may want to suppress notifications
> of other posts as well (for example a storage initiator may want to
> suppress send completions since it needs to reserve the request context
> until it receives the status anyway - which is a receive completion). It
> needs to take what was posted and what not into account to avoid
> wrapping.
> 
> If I'm not mistaken iWARP requires to ask for a completion once every
> send-queue length and empirically, IB drivers requires it too. So again,

Correct.  

> I don't think implicit posting is a very good idea.
>
> > The completions are another mad mightmare in the RDMA stack API.  Every
> > other subsystem would just allow submitter to attach a function pointer
> > that handles the completion to the posted item.
> 
> This is actually a good idea. And it can be easily added I think (at
> least to mlx drivers which I better). It can help removing a switch
> statement for the ULPs when handling the completions.
> 
>    But no, the RDMA stack
> > instead require ID allocators and gigantic boilerplate code in the
> > consumer to untangle that gigantic mess again.
> 
> I don't think the wr_id was ever intended to be an allocated tag. It's
> the ULP context, usually a pointer to a transaction structure. Even
> with passing a function pointer you need to get back your original
> context don't you?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <55A534D1.6030008-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-14 16:16                   ` Steve Wise
@ 2015-07-14 16:35                   ` Jason Gunthorpe
       [not found]                     ` <20150714163506.GC7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-14 16:35 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Tue, Jul 14, 2015 at 07:12:01PM +0300, Sagi Grimberg wrote:
> >The ULP doesn't care if it needs to reserver the slot, and it generally
> >doesn't care about the notification either unless it needs to handle an
> >error.
> 
> That's generally correct. But the ULP may want to suppress notifications
> of other posts as well (for example a storage initiator may want to
> suppress send completions since it needs to reserve the request context
> until it receives the status anyway - which is a receive completion). It
> needs to take what was posted and what not into account to avoid
> wrapping.

I don't see how this is a concern, API wise. All callers are paring
the MR post with a real op anyhow.

> If I'm not mistaken iWARP requires to ask for a completion once every
> send-queue length and empirically, IB drivers requires it too. So again,
> I don't think implicit posting is a very good idea.

Yes, all proper use of RDMA requires counting SQEs consumed and
freeing them only on completion. Nobody should be using the EAGAIN
scheme. But it is trivial to handle, you don't start an op until you
can post the entire compound SQE, and you must always tag the last SQE
for signal if the # of SQEs left drops below the # required for the
largest compound (or another similar scheme).

But you should never have to signal on the MR post, it should always
be at the end of the compound, on the RDMA, SEND, or invalidate..

That is just basic 'how do you use a SQ properly', every ULP should
do it, wrappering posts in an API doesn't change anything fundamental.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <20150714155340.GA7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-14 16:46                   ` Sagi Grimberg
       [not found]                     ` <55A53CFA.7070509-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-14 16:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

> I'm really disappointed by the negative emails on this subject..

Jason,

I'm really not trying to be negative. I'm hearing you out, and I agree
with a lot of what you have to say. I just don't agree with all of it.

You are right, ULPs do the same thing, the same wrong thing of
maintaining a fallback policy for memory registration. We should have
only one way - FRWR.

Which drivers doesn't support FRWR that we need to do other things?
ipath - depracated
mthca - soon to be deprecated
ehca - Not sure what is going on there. they only have phys_mr
        anyway, which just lost its only caller in the kernel
amso1100 - git log shows almost zero activity with this driver
usnic - who is completely not interesting to the kernel ULPs because:

int usnic_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
                                 struct ib_send_wr **bad_wr)
{
         usnic_dbg("\n");
         return -EINVAL;
}

So my point is, it's a great deal of effort to combine these
in any API that we come up with for a bunch of esoteric drivers.
Let's just let them die on their own, please.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                     ` <20150714163506.GC7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-14 16:55                       ` Sagi Grimberg
       [not found]                         ` <55A53F0B.5050009-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-14 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/14/2015 7:35 PM, Jason Gunthorpe wrote:
> On Tue, Jul 14, 2015 at 07:12:01PM +0300, Sagi Grimberg wrote:
>>> The ULP doesn't care if it needs to reserver the slot, and it generally
>>> doesn't care about the notification either unless it needs to handle an
>>> error.
>>
>> That's generally correct. But the ULP may want to suppress notifications
>> of other posts as well (for example a storage initiator may want to
>> suppress send completions since it needs to reserve the request context
>> until it receives the status anyway - which is a receive completion). It
>> needs to take what was posted and what not into account to avoid
>> wrapping.
>
> I don't see how this is a concern, API wise. All callers are paring
> the MR post with a real op anyhow.
>
>> If I'm not mistaken iWARP requires to ask for a completion once every
>> send-queue length and empirically, IB drivers requires it too. So again,
>> I don't think implicit posting is a very good idea.
>
> Yes, all proper use of RDMA requires counting SQEs consumed and
> freeing them only on completion. Nobody should be using the EAGAIN
> scheme. But it is trivial to handle, you don't start an op until you
> can post the entire compound SQE, and you must always tag the last SQE
> for signal if the # of SQEs left drops below the # required for the
> largest compound (or another similar scheme).
>
> But you should never have to signal on the MR post, it should always
> be at the end of the compound, on the RDMA, SEND, or invalidate..
>
> That is just basic 'how do you use a SQ properly', every ULP should
> do it, wrappering posts in an API doesn't change anything fundamental.
>

Well, I still think keeping the post interface is better.

But, if people think that it's better to have an API that does implicit
posting always without notification, and then silently consume error or
flush completions. I can try and look at it as well.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                     ` <55A53CFA.7070509-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-14 17:08                       ` Jason Gunthorpe
       [not found]                         ` <20150714170808.GA19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-14 17:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Tue, Jul 14, 2015 at 07:46:50PM +0300, Sagi Grimberg wrote:
> >I'm really disappointed by the negative emails on this subject..

> I'm really not trying to be negative. I'm hearing you out, and I agree
> with a lot of what you have to say. I just don't agree with all of it.

Sure, I just find it hard to be constructive against "I don't like it" :)

> Which drivers doesn't support FRWR that we need to do other things?
> ipath - depracated

We have permission to move this to staging and then RM it, so yay!

> mthca - soon to be deprecated

This one I have a problem with. There is alot of mthca hardware out
there, it feels wrong to nuke it.. If we can continue to support the
FMR scheme it uses transparently, that would be excellent.

I'm not hearing a strong reason why that shouldn't be the case...

> ehca - Not sure what is going on there. they only have phys_mr
>        anyway, which just lost its only caller in the kernel

I thought it supported fmr:

drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.alloc_fmr           = ehca_alloc_fmr;
drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.map_phys_fmr        = ehca_map_phys_fmr;
drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.unmap_fmr           = ehca_unmap_fmr;
drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.dealloc_fmr         = ehca_dealloc_fmr;

?

I'm not sure what the status of ehca is, but I somehow suspect any
remaining users are going to be on an old vendor kernel forever..

> amso1100 - git log shows almost zero activity with this driver

IIRC, this card was already/nearly discountinued and obsolete when the driver
was added, I certainly wouldn't object to remove it, or totally break
it for kernel storage ULPs - Steve?

> usnic - who is completely not interesting to the kernel ULPs because:

Right.

> So my point is, it's a great deal of effort to combine these
> in any API that we come up with for a bunch of esoteric drivers.
> Let's just let them die on their own, please.

If amso110 is the only iWarp driver that needs phys_mr, I'd be happy
to drop it and drop the requirement to support that. Is that right Steve?

But mthca, I have a hard time calling that esoteric.. Heck, even I
still have various mthca cards around here that are regularly used for
SDR/DDR/CX4 testing on modern kernels..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <55A53F0B.5050009-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-14 17:09                           ` Jason Gunthorpe
       [not found]                             ` <20150714170859.GB19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-14 17:09 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:

> But, if people think that it's better to have an API that does implicit
> posting always without notification, and then silently consume error or
> flush completions. I can try and look at it as well.

Can we do FMR transparently if we bundle the post? If yes, I'd call
that a winner..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
  2015-07-14 16:16                   ` Steve Wise
@ 2015-07-14 17:29                     ` Tom Talpey
  0 siblings, 0 replies; 68+ messages in thread
From: Tom Talpey @ 2015-07-14 17:29 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg', 'Christoph Hellwig'
  Cc: 'Jason Gunthorpe',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford'

On 7/14/2015 12:16 PM, Steve Wise wrote:
>
>
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Tuesday, July 14, 2015 11:12 AM
>> To: Christoph Hellwig
>> Cc: Jason Gunthorpe; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss;
> Hefty,
>> Sean; Doug Ledford; Tom Talpey
>> Subject: Re: Kernel fast memory registration API proposal [RFC]
>>
>> On 7/14/2015 6:33 PM, Christoph Hellwig wrote:
>>> On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:
>>>> This is exactly what I don't want to do. I don't think that implicit
>>>> posting is a good idea for reasons that I mentioned earlier:
>>>>
>>>> "This is where I have a problem. Providing an API that may or may not
>>>> post a work request on my QP is confusing, and I don't understand its
>>>> semantics at all. Do I need to reserve slots on my QP? should I ask for
>>>> a completion? If we suppress the completion will I see an error
>>>> completion? What should I expect to find in the wr_id?"
>>>>
>>>> We're much better off with keeping the post interface in place but
>>>> have it much simpler.
>>>
>>> The ULP doesn't care if it needs to reserver the slot, and it generally
>>> doesn't care about the notification either unless it needs to handle an
>>> error.
>>
>> That's generally correct. But the ULP may want to suppress notifications
>> of other posts as well (for example a storage initiator may want to
>> suppress send completions since it needs to reserve the request context
>> until it receives the status anyway - which is a receive completion). It
>> needs to take what was posted and what not into account to avoid
>> wrapping.
>>
>> If I'm not mistaken iWARP requires to ask for a completion once every
>> send-queue length and empirically, IB drivers requires it too. So again,
>
> Correct.

Indeed, correct. But this has nothing to do with the protocol. It's
necessary because without a completion, the driver can't keep the
work queue and completion queue pointers properly in sync with the
hardware. If the tail pointer catches up and crosses the head pointer,
boom.

This is a pure verbs issue, completely local behavior, and applies
to all providers.

>> I don't think implicit posting is a very good idea.

Specifically, it will work only if completions are never suppressed,
if the upper layer is aware of the additional work requests and their
ordering implications, and the overhead of the interrupts is acceptable.
None of these preconditions are desirable.

>>
>>> The completions are another mad mightmare in the RDMA stack API.  Every
>>> other subsystem would just allow submitter to attach a function pointer
>>> that handles the completion to the posted item.
>>
>> This is actually a good idea. And it can be easily added I think (at
>> least to mlx drivers which I better). It can help removing a switch
>> statement for the ULPs when handling the completions.
>>
>>     But no, the RDMA stack
>>> instead require ID allocators and gigantic boilerplate code in the
>>> consumer to untangle that gigantic mess again.
>>
>> I don't think the wr_id was ever intended to be an allocated tag. It's
>> the ULP context, usually a pointer to a transaction structure. Even
>> with passing a function pointer you need to get back your original
>> context don't you?
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <20150714170808.GA19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-14 18:07                           ` Steve Wise
  2015-07-15  3:05                           ` Doug Ledford
  1 sibling, 0 replies; 68+ messages in thread
From: Steve Wise @ 2015-07-14 18:07 UTC (permalink / raw)
  To: 'Jason Gunthorpe', 'Sagi Grimberg'
  Cc: 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Tuesday, July 14, 2015 12:08 PM
> To: Sagi Grimberg
> Cc: Christoph Hellwig; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss;
Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On Tue, Jul 14, 2015 at 07:46:50PM +0300, Sagi Grimberg wrote:
> > >I'm really disappointed by the negative emails on this subject..
> 
> > I'm really not trying to be negative. I'm hearing you out, and I agree
> > with a lot of what you have to say. I just don't agree with all of it.
> 
> Sure, I just find it hard to be constructive against "I don't like it" :)
> 
> > Which drivers doesn't support FRWR that we need to do other things?
> > ipath - depracated
> 
> We have permission to move this to staging and then RM it, so yay!
> 
> > mthca - soon to be deprecated
> 
> This one I have a problem with. There is alot of mthca hardware out
> there, it feels wrong to nuke it.. If we can continue to support the
> FMR scheme it uses transparently, that would be excellent.
> 
> I'm not hearing a strong reason why that shouldn't be the case...
> 
> > ehca - Not sure what is going on there. they only have phys_mr
> >        anyway, which just lost its only caller in the kernel
> 
> I thought it supported fmr:
> 
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.alloc_fmr           = ehca_alloc_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.map_phys_fmr        = ehca_map_phys_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.unmap_fmr           = ehca_unmap_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.dealloc_fmr         = ehca_dealloc_fmr;
> 
> ?
> 
> I'm not sure what the status of ehca is, but I somehow suspect any
> remaining users are going to be on an old vendor kernel forever..
> 
> > amso1100 - git log shows almost zero activity with this driver
> 
> IIRC, this card was already/nearly discountinued and obsolete when the driver
> was added, I certainly wouldn't object to remove it, or totally break
> it for kernel storage ULPs - Steve?
>

We can remove it or stage it then remove.
 
> > usnic - who is completely not interesting to the kernel ULPs because:
> 
> Right.
> 
> > So my point is, it's a great deal of effort to combine these
> > in any API that we come up with for a bunch of esoteric drivers.
> > Let's just let them die on their own, please.
> 
> If amso110 is the only iWarp driver that needs phys_mr, I'd be happy
> to drop it and drop the requirement to support that. Is that right Steve?
> 


Yes.

> But mthca, I have a hard time calling that esoteric.. Heck, even I
> still have various mthca cards around here that are regularly used for
> SDR/DDR/CX4 testing on modern kernels..
> 
> Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <55A4CD5B.9030000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-14 18:58                           ` Chuck Lever
  0 siblings, 0 replies; 68+ messages in thread
From: Chuck Lever @ 2015-07-14 18:58 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey


On Jul 14, 2015, at 4:50 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 7/13/2015 5:16 PM, Chuck Lever wrote:
> 
>>> NFS really should be using something more similar to a scatterlist,
>>> as it maps pretty well to the sk_frags in the network layer as well.
>>> 
>>> Struct scatterlist is imprtant because it's the way the DMA mapping
>>> functions takes a multi-page argument, so ayone who wants to batch
>>> the S/G mapping calls needs it.
>> 
>> An excellent topic to bring up on linux-nfs-u79uwXL29TaiAVqoAR/hOA@public.gmane.org
>> 
>> In the meantime, I think rpcrdma.ko would have to be responsible for
>> converting struct page to struct scatterlist.
>> 
> 
> Fine by me, so I take it that you are OK with the proposed API?

I think I can make struct scatterlist work as well as struct struct
ib_fast_reg_page_list works now.

If Christoph is correct, the pay-off for this change would be when
the NFS upper layers are converted to use struct scatterlist instead
of struct page.

Until then, I don’t see a specific benefit for RPC/RDMA. But I will
go with the flow.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <20150714170808.GA19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-14 18:07                           ` Steve Wise
@ 2015-07-15  3:05                           ` Doug Ledford
       [not found]                             ` <55A5CDE2.4060904-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Doug Ledford @ 2015-07-15  3:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Tom Talpey

[-- Attachment #1: Type: text/plain, Size: 1985 bytes --]

On 07/14/2015 01:08 PM, Jason Gunthorpe wrote:
> On Tue, Jul 14, 2015 at 07:46:50PM +0300, Sagi Grimberg wrote:
>> Which drivers doesn't support FRWR that we need to do other things?
>> ipath - depracated
> 
> We have permission to move this to staging and then RM it, so yay!

Correct.

>> mthca - soon to be deprecated
> 
> This one I have a problem with. There is alot of mthca hardware out
> there, it feels wrong to nuke it.. If we can continue to support the
> FMR scheme it uses transparently, that would be excellent.
> 
> I'm not hearing a strong reason why that shouldn't be the case...

I'm not so sure about deprecating mthca either.  There are still a
number of people I know that like to buy cheap SDR/DDR switches on EBay
and pair them with cheap mthca cards and have cheap 10/20GBit/s home
networks.  We have a couple people inside Red Hat that occasionally tell
people what to look for on EBay to do just this.

>> ehca - Not sure what is going on there. they only have phys_mr
>>        anyway, which just lost its only caller in the kernel
> 
> I thought it supported fmr:
> 
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.alloc_fmr           = ehca_alloc_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.map_phys_fmr        = ehca_map_phys_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.unmap_fmr           = ehca_unmap_fmr;
> drivers/infiniband/hw/ehca/ehca_main.c: shca->ib_device.dealloc_fmr         = ehca_dealloc_fmr;
> 
> ?
> 
> I'm not sure what the status of ehca is, but I somehow suspect any
> remaining users are going to be on an old vendor kernel forever..

Nothing of note has been done on ehca for a long time, and I wouldn't be
surprised if there is some bit rot in this driver.  We need to speak
with IBM, but I think it's a candidate for future removal.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-11 10:39   ` Christoph Hellwig
  2015-07-13 16:30   ` Jason Gunthorpe
@ 2015-07-15  7:32   ` Christoph Hellwig
       [not found]     ` <20150715073233.GA11535-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-19  5:45   ` Sagi Grimberg
  3 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-15  7:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Jason Gunthorpe, Steve Wise, Or Gerlitz, Oren Duer, Chuck Lever,
	Bart Van Assche, Liran Liss, Hefty, Sean, Doug Ledford,
	Tom Talpey

Hi Sagi,

I went over your proposal based on reviewing the ongoing MR threads
and my implementation of a similar in-driver abstraction, so here
are some proposed updates.

> struct provider_mr {
> 	u64		*page_list; // or what ever the HW uses
> 	... <more private stuff> ...
> 	struct ib_mr	ibmr;
> };

Call this rdma_mr to fit the scheme we use for "generic" APIs in the
RDMA stack?

Also let's hash out the API for allocating it, you suggest the existing
but currently almost unused ib_create_mr API, which isn't quite suitable
as it doesn't transparently allocate the page list or other MR-specific
data.  Another odd bit in ib_create_mr is that it doesnt actually say
which kind of MR to allocate.

I'd also get rid of the horrible style of using structs even for simple
attributes.

so how about:


int rdma_create_mr(struct ib_pd *pd, enum rdma_mr_type mr,
	u32 max_pages, int flags);

>   *     array from a SG list
>   * @mr:          memory region
>   * @sg:          sg list
>   * @sg_nents:    number of elements in the sg
>   *
>   * Can fail if the HW is not able to register this
>   * sg list. In case of failure - caller is responsible
>   * to handle it (bounce-buffer, multiple registrations...)
>   */
> int ib_mr_set_sg(struct ib_mr *mr,
>                  struct scatterlist *sg,
>                  unsigned short sg_nents);

Call this rdma_map_sg?

>          /* register the MR */
>          frwr.opcode = IB_WR_FAST_REG_MR;
>          frwr.wrid = my_wrid;
>          frwr.wr.fast_reg.mr = mr;
>          frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
>          frwr.wr.fast_reg.length = length;
>          frwr.wr.fast_reg.access_flags = my_flags;

Provide a helper to hide all this behind the scenes please:

void rdma_init_mr_wr(struct ib_send_wr *wr, struct rdma_mr *mr,
		u64 wr_id, int mr_access_flags);

Or if we got with Jason's suggestion split "int mr_access_flags" into
"bool remote, bool is_write".

To support FRMs if we care enough we'd create a purely in-memory MR
in rdma_create_mr and then map it to ib_fmr_pool_map_phys with a helper
that the driver can call instead of rdma_init_mr_wr.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                             ` <20150714170859.GB19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-15  8:01                               ` Sagi Grimberg
       [not found]                                 ` <55A6136A.8010204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-15  8:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
> On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
>
>> But, if people think that it's better to have an API that does implicit
>> posting always without notification, and then silently consume error or
>> flush completions. I can try and look at it as well.
>
> Can we do FMR transparently if we bundle the post? If yes, I'd call
> that a winner..

Doing FMR transparently is not possible as the unmap flow is scheduling.
Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
hard-IRQ context. Changing the context to thread context is not
acceptable. The best we can do is using FMR_POOLs transparently.
Other than polluting the API and its semantics I suspect people will
have other problems with it (leaving the MRs open).

I suggest to start with what I proposed. And in a later stage (if we
still think its needed) we can have a higher level API that hides the
post, something like:

rdma_reg_sg(struct ib_qp *qp,
             struct ib_mr *mr,
             struct scatterlist *sg,
             int sg_nents,
             u64 offset,
             u64 length,
             int access_flags)

rdma_unreg_mr(struct ib_qp *qp,
               struct ib_mr *mr)


Or incorporate that with a pool API, something like:

rdma_create_fr_pool(struct ib_qp *qp,
                     int nmrs,
                     int mr_size,
                     int create_flags)

rdma_destroy_fr_pool(struct rdma_fr_pool *pool)

rdma_fr_reg_sg(struct rdma_fr_pool *pool,
                struct scatterlist *sg,
                int sg_nents,
                u64 offset,
                u64 length,
                int access_flags)

rdma_fr_unreg_mr(struct rdma_fr_pool *pool,
                  struct ib_mr *mr)


Note that I expect problems with both approaches, but
we can look into it...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]     ` <20150715073233.GA11535-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-15  8:33       ` Sagi Grimberg
       [not found]         ` <55A61AE3.8020609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-15 17:07       ` Jason Gunthorpe
  1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-15  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/15/2015 10:32 AM, Christoph Hellwig wrote:
> Hi Sagi,
>
> I went over your proposal based on reviewing the ongoing MR threads
> and my implementation of a similar in-driver abstraction, so here
> are some proposed updates.
>
>> struct provider_mr {
>> 	u64		*page_list; // or what ever the HW uses
>> 	... <more private stuff> ...
>> 	struct ib_mr	ibmr;
>> };
>
> Call this rdma_mr to fit the scheme we use for "generic" APIs in the
> RDMA stack?

Umm, I think this can become weird given all other primitives have
ib_ prefix. I'd prefer to keep that prefix to stay consistent, and have
an incremental change to do it for all the primitives (structs & verbs).

>
> Also let's hash out the API for allocating it, you suggest the existing
> but currently almost unused ib_create_mr API, which isn't quite suitable
> as it doesn't transparently allocate the page list or other MR-specific
> data.  Another odd bit in ib_create_mr is that it doesnt actually say
> which kind of MR to allocate.

I can change it to be whatever we want. Unlike other mr allocation APIs,
it can be easily extended without changing the callers.

>
> I'd also get rid of the horrible style of using structs even for simple
> attributes.

The reason I thought an attr struct would benefit is that it can be
easily extended without changing every single caller (we might have
more attributes in the future?). Plus, it is consistent with QP and
CQ creation.

If you feel strongly about it, I can change it.

>
> so how about:
>
>
> int rdma_create_mr(struct ib_pd *pd, enum rdma_mr_type mr,
> 	u32 max_pages, int flags);
>
>>    *     array from a SG list
>>    * @mr:          memory region
>>    * @sg:          sg list
>>    * @sg_nents:    number of elements in the sg
>>    *
>>    * Can fail if the HW is not able to register this
>>    * sg list. In case of failure - caller is responsible
>>    * to handle it (bounce-buffer, multiple registrations...)
>>    */
>> int ib_mr_set_sg(struct ib_mr *mr,
>>                   struct scatterlist *sg,
>>                   unsigned short sg_nents);
>
> Call this rdma_map_sg?

OK.

>
>>           /* register the MR */
>>           frwr.opcode = IB_WR_FAST_REG_MR;
>>           frwr.wrid = my_wrid;
>>           frwr.wr.fast_reg.mr = mr;
>>           frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
>>           frwr.wr.fast_reg.length = length;
>>           frwr.wr.fast_reg.access_flags = my_flags;
>
> Provide a helper to hide all this behind the scenes please:
>
> void rdma_init_mr_wr(struct ib_send_wr *wr, struct rdma_mr *mr,
> 		u64 wr_id, int mr_access_flags);
>
> Or if we got with Jason's suggestion split "int mr_access_flags" into
> "bool remote, bool is_write".

Yea I can do that...

>
> To support FRMs if we care enough we'd create a purely in-memory MR
> in rdma_create_mr and then map it to ib_fmr_pool_map_phys with a helper
> that the driver can call instead of rdma_init_mr_wr.
>

Lets take it at a later stage.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                             ` <55A5CDE2.4060904-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-07-15  8:52                               ` Sagi Grimberg
  0 siblings, 0 replies; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-15  8:52 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Tom Talpey

On 7/15/2015 6:05 AM, Doug Ledford wrote:
> On 07/14/2015 01:08 PM, Jason Gunthorpe wrote:
>> On Tue, Jul 14, 2015 at 07:46:50PM +0300, Sagi Grimberg wrote:
>>> Which drivers doesn't support FRWR that we need to do other things?
>>> ipath - depracated
>>
>> We have permission to move this to staging and then RM it, so yay!
>
> Correct.
>
>>> mthca - soon to be deprecated
>>
>> This one I have a problem with. There is alot of mthca hardware out
>> there, it feels wrong to nuke it.. If we can continue to support the
>> FMR scheme it uses transparently, that would be excellent.
>>
>> I'm not hearing a strong reason why that shouldn't be the case...
>
> I'm not so sure about deprecating mthca either.  There are still a
> number of people I know that like to buy cheap SDR/DDR switches on EBay
> and pair them with cheap mthca cards and have cheap 10/20GBit/s home
> networks.  We have a couple people inside Red Hat that occasionally tell
> people what to look for on EBay to do just this.

I didn't mean deprecate it altogether, I meant that I don't think we
should do backflips for a single (old) driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <55A61AE3.8020609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-15  9:07           ` Christoph Hellwig
  2015-07-15 19:15           ` Jason Gunthorpe
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-15  9:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 11:33:39AM +0300, Sagi Grimberg wrote:
> Umm, I think this can become weird given all other primitives have
> ib_ prefix. I'd prefer to keep that prefix to stay consistent, and have
> an incremental change to do it for all the primitives (structs & verbs).

Fine with me, we're getting into bikeshedding here anyway..

> >I'd also get rid of the horrible style of using structs even for simple
> >attributes.
> 
> The reason I thought an attr struct would benefit is that it can be
> easily extended without changing every single caller (we might have
> more attributes in the future?). Plus, it is consistent with QP and
> CQ creation.
> 
> If you feel strongly about it, I can change it.

It's a very hard to follow API, and extending callers for new features
is not an issue but actually a "feature" as it requires / encurages
a review of the caller to check how the new feature fits in for them.

> >To support FRMs if we care enough we'd create a purely in-memory MR
> >in rdma_create_mr and then map it to ib_fmr_pool_map_phys with a helper
> >that the driver can call instead of rdma_init_mr_wr.
> >
> 
> Lets take it at a later stage.

For the API to be useful it should cover the existing users.  I think
the FMR pool support is fairly trivial, so if you come up with he FR
version I'll volunteer to add it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                 ` <55A6136A.8010204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-15 14:32                                   ` Chuck Lever
       [not found]                                     ` <A9EF2F26-E737-4E80-B2E3-F8D6406F9893-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2015-07-15 18:31                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-15 14:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 15, 2015, at 4:01 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
>> On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
>> 
>>> But, if people think that it's better to have an API that does implicit
>>> posting always without notification, and then silently consume error or
>>> flush completions. I can try and look at it as well.
>> 
>> Can we do FMR transparently if we bundle the post? If yes, I'd call
>> that a winner..
> 
> Doing FMR transparently is not possible as the unmap flow is scheduling.
> Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
> hard-IRQ context.

The context in which RPC/RDMA performs FMR unmap mustn’t sleep.
RPC/RDMA is in roughly the same situation as the other initiators.


> Changing the context to thread context is not
> acceptable. The best we can do is using FMR_POOLs transparently.
> Other than polluting the API and its semantics I suspect people will
> have other problems with it (leaving the MRs open).

Count me in that group.

I would rather not build a non-deterministic delay into the
unmap interface. Using a pool or having map do an implicit
unmap are both solutions I’d rather avoid.

In both situations, MRs can be left mapped indefinitely if,
say, the workload pauses.


> I suggest to start with what I proposed. And in a later stage (if we
> still think its needed) we can have a higher level API that hides the
> post, something like:

> rdma_reg_sg(struct ib_qp *qp,
>            struct ib_mr *mr,
>            struct scatterlist *sg,
>            int sg_nents,
>            u64 offset,
>            u64 length,
>            int access_flags)

I still wonder what “length” means in the context of a scatterlist.


> rdma_unreg_mr(struct ib_qp *qp,
>              struct ib_mr *mr)

An implicit caveat to using this is that the ULP would have to
ensure the “qp” parameter is not NULL and that the referenced
QP will not be destroyed during this call.

So these calls have to be serialized with transport connect and
device removal.

The philosophical preference would be that the API should take
care of this itself, but I’m not smart enough to see how that
can be done.


> Or incorporate that with a pool API, something like:

FRWR does not need a pool. I’d rather not burden this API
with what is essentially an FMR workaround that introduces a
non-deterministic exposure of the data in each MR.


> rdma_create_fr_pool(struct ib_qp *qp,
>                    int nmrs,
>                    int mr_size,
>                    int create_flags)
> 
> rdma_destroy_fr_pool(struct rdma_fr_pool *pool)
> 
> rdma_fr_reg_sg(struct rdma_fr_pool *pool,
>               struct scatterlist *sg,
>               int sg_nents,
>               u64 offset,
>               u64 length,
>               int access_flags)
> 
> rdma_fr_unreg_mr(struct rdma_fr_pool *pool,
>                 struct ib_mr *mr)
> 
> 
> Note that I expect problems with both approaches, but
> we can look into it...
> 
> Sagi.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                     ` <A9EF2F26-E737-4E80-B2E3-F8D6406F9893-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-15 14:39                                       ` Chuck Lever
  2015-07-15 17:19                                       ` Jason Gunthorpe
  2015-07-16  6:52                                       ` Sagi Grimberg
  2 siblings, 0 replies; 68+ messages in thread
From: Chuck Lever @ 2015-07-15 14:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 15, 2015, at 10:32 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> 
> On Jul 15, 2015, at 4:01 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
>> On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
>>> On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
>>> 
>>>> But, if people think that it's better to have an API that does implicit
>>>> posting always without notification, and then silently consume error or
>>>> flush completions. I can try and look at it as well.
>>> 
>>> Can we do FMR transparently if we bundle the post? If yes, I'd call
>>> that a winner..
>> 
>> Doing FMR transparently is not possible as the unmap flow is scheduling.
>> Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
>> hard-IRQ context.
> 
> The context in which RPC/RDMA performs FMR unmap mustn’t sleep.
> RPC/RDMA is in roughly the same situation as the other initiators.
> 
> 
>> Changing the context to thread context is not
>> acceptable. The best we can do is using FMR_POOLs transparently.
>> Other than polluting the API and its semantics I suspect people will
>> have other problems with it (leaving the MRs open).
> 
> Count me in that group.
> 
> I would rather not build a non-deterministic delay into the
> unmap interface. Using a pool or having map do an implicit
> unmap are both solutions I’d rather avoid.
> 
> In both situations, MRs can be left mapped indefinitely if,
> say, the workload pauses.
> 
> 
>> I suggest to start with what I proposed. And in a later stage (if we
>> still think its needed) we can have a higher level API that hides the
>> post, something like:
> 
>> rdma_reg_sg(struct ib_qp *qp,
>>           struct ib_mr *mr,
>>           struct scatterlist *sg,
>>           int sg_nents,
>>           u64 offset,
>>           u64 length,
>>           int access_flags)
> 
> I still wonder what “length” means in the context of a scatterlist.
> 
> 
>> rdma_unreg_mr(struct ib_qp *qp,
>>             struct ib_mr *mr)
> 
> An implicit caveat to using this is that the ULP would have to
> ensure the “qp” parameter is not NULL and that the referenced
> QP will not be destroyed during this call.
> 
> So these calls have to be serialized with transport connect and
> device removal.
> 
> The philosophical preference would be that the API should take
> care of this itself, but I’m not smart enough to see how that
> can be done.

Well, OK, there is an obvious way to do this: QP reference counting.


>> Or incorporate that with a pool API, something like:
> 
> FRWR does not need a pool. I’d rather not burden this API
> with what is essentially an FMR workaround that introduces a
> non-deterministic exposure of the data in each MR.
> 
> 
>> rdma_create_fr_pool(struct ib_qp *qp,
>>                   int nmrs,
>>                   int mr_size,
>>                   int create_flags)
>> 
>> rdma_destroy_fr_pool(struct rdma_fr_pool *pool)
>> 
>> rdma_fr_reg_sg(struct rdma_fr_pool *pool,
>>              struct scatterlist *sg,
>>              int sg_nents,
>>              u64 offset,
>>              u64 length,
>>              int access_flags)
>> 
>> rdma_fr_unreg_mr(struct rdma_fr_pool *pool,
>>                struct ib_mr *mr)
>> 
>> 
>> Note that I expect problems with both approaches, but
>> we can look into it...
>> 
>> Sagi.
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]     ` <20150715073233.GA11535-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-15  8:33       ` Sagi Grimberg
@ 2015-07-15 17:07       ` Jason Gunthorpe
       [not found]         ` <20150715170750.GA23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 12:32:33AM -0700, Christoph Hellwig wrote:
> int rdma_create_mr(struct ib_pd *pd, enum rdma_mr_type mr,
> 	u32 max_pages, int flags);
> 
> >   *     array from a SG list
> >   * @mr:          memory region
> >   * @sg:          sg list
> >   * @sg_nents:    number of elements in the sg
> >   *
> >   * Can fail if the HW is not able to register this
> >   * sg list. In case of failure - caller is responsible
> >   * to handle it (bounce-buffer, multiple registrations...)
> >   */
> > int ib_mr_set_sg(struct ib_mr *mr,
> >                  struct scatterlist *sg,
> >                  unsigned short sg_nents);
> 
> Call this rdma_map_sg?
> 
> >          /* register the MR */
> >          frwr.opcode = IB_WR_FAST_REG_MR;
> >          frwr.wrid = my_wrid;
> >          frwr.wr.fast_reg.mr = mr;
> >          frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
> >          frwr.wr.fast_reg.length = length;
> >          frwr.wr.fast_reg.access_flags = my_flags;
> 
> Provide a helper to hide all this behind the scenes please:
> 
> void rdma_init_mr_wr(struct ib_send_wr *wr, struct rdma_mr *mr,
> 		u64 wr_id, int mr_access_flags);
>
> Or if we got with Jason's suggestion split "int mr_access_flags" into
> "bool remote, bool is_write".

Yes please. Considering the security implications we need to be much
more careful API wise here. This is more of a code-as-documentation
issue than a functional issue.

Lets avoid the issue of implicit posting, but still delegate the WR
construction to the driver:

 rdma_map_sg_lkey(u32 *lkey_out,
                  struct rdma_mr *mr,
		  const struct scatterlist *sg,
 	          unsigned short sg_nents,
		  unsigned int ops_supported,
		  struct ib_send_wr *post_wr,
		  u64 wrid)
 rdma_map_sg_rkey(.. same args ..)

Used as:

 rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_SEND,
 		 &wr[0],...);
 wr[1].opcode = IB_SEND;
 ib_post_send(wr..)

I'd probably implement the above as two wrappers around a
generic driver/core callback. All the wrappers do is translation of
ops_supports to the IB_ACCESS scheme.

The two entry points interpret their ops_supported differently
(source/sink in Steve's model), so it becomes impossible to
inadvertantly create a remote access lkey. And the API makes it
unnatural to use a MR as both a lkey and rkey.

Just thinking .. I'd probably drop the wrid and have rdma_map_sg_x
create an unsignaled wr by default. If the caller wants to force
signaling they can fill in the write and change the flags.

---------

I'm still unhappy with this, from a broad perspective. Look at what
any ULP has to implement to do a RDMA READ properly:

if (iwarp)
{
    rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_RDMA_READ,
 		 &wr[0],...);
    wr[1].opcode = IB_RDMA_READ;
    wr[2].opcode = IB_WR_LOCAL_INV;
    wr[2].invalidate_rkey = wr[1].lkey;
}
else if (sg_nents <= device->read_sg_limit)
{
    wr[0].opcode = IB_RDMA_READ;
    wr[0].lkey = pd->local_dma_lkey;
    ... translate struct sctatter_list to a wr ...
}
else if (fmr)
{
     ... ? ...
}
} else { // For IB
    rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_RDMA_READ,
 		 &wr[0],...);
    wr[1].opcode = IB_RDMA_READ;
    // We can lazy invalidate when we recycle the MR (?)
}

And that isn't even considering the possibility of using multiple
RDMA_READ ops instead of a MR. (which would be smarter that FMR)

I see the above as a common, mandatory, pattern that should be
factored.. If a ULP is doing RDMA_READ and it isn't doing the above,
it is broken. It either doesn't support iWarp, or it is throwing IB
performance away.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                     ` <A9EF2F26-E737-4E80-B2E3-F8D6406F9893-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2015-07-15 14:39                                       ` Chuck Lever
@ 2015-07-15 17:19                                       ` Jason Gunthorpe
       [not found]                                         ` <20150715171926.GB23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-16  6:52                                       ` Sagi Grimberg
  2 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 17:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 10:32:55AM -0400, Chuck Lever wrote:

> I would rather not build a non-deterministic delay into the
> unmap interface. Using a pool or having map do an implicit
> unmap are both solutions I’d rather avoid.

Can you explain how NFS is using FMR today? When does it unmap a FMR
rkey and lkey?

If NFS/etc currently have a hole on rkey invalidation when using FMR,
and that hole simply cannot reasonably be solved, I'm actually mildly OK
with enshrining that in a new MR API..

So, it would seem to me, the only major addition we'd need to Sagi's
draft to support FMR, would be a way to catch the completion (the
rdma_unreg_mr) and trigger async MR recycling async in a work queue.

Sagi, how does cleanup of the temporary FRMR work in your draft
proposal? What does the ULP do upon completion?

[Also, just mildly curious, how do we get into an unsleepable
 context anyhow? is the IB completion pending callback called in a
 sleepable context?]

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                 ` <55A6136A.8010204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-15 14:32                                   ` Chuck Lever
@ 2015-07-15 18:31                                   ` Jason Gunthorpe
       [not found]                                     ` <20150715183129.GC23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 18:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 11:01:46AM +0300, Sagi Grimberg wrote:
> On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
> >On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
> >
> >>But, if people think that it's better to have an API that does implicit
> >>posting always without notification, and then silently consume error or
> >>flush completions. I can try and look at it as well.
> >
> >Can we do FMR transparently if we bundle the post? If yes, I'd call
> >that a winner..
> 
> Doing FMR transparently is not possible as the unmap flow is scheduling.
> Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
> hard-IRQ context. Changing the context to thread context is not
> acceptable. The best we can do is using FMR_POOLs transparently.
> Other than polluting the API and its semantics I suspect people will
> have other problems with it (leaving the MRs open).

Upon deeper thought, I think I see a fairly simple solution here.

1) Really, we probably never need a FMR for the lkey side, we should
   just use multiple READ/WRITE ops to get a long enough SG list.
   Even if this is not performant on mhca/ehca.

   If we absolutely need FMR for SEND/RECV lkey (do we? Anyone know?),
   then I have some good thoughts on how to make that work transparent..

   However, rather than do all that, I'd probably choose to just
   bounce buffer the few rare SEND/RECVs that need a MR. I'm guessing
   the usage is 0 or near zero??

2) The FMR completion flow for rkey is actually the same as the FRWR flow:
    - Catch the SEND that says the READ/WRITE is done
    - Issue an async invalidate
    - Catch the invalidate completion

So, my simple proposal is to have the core wrapper mthca/ehca's
poll_cq wrapper. The flow works like this:

  - ULP calls a 'rdma_post_close_rkey' helper
     * For FRWR this posts the INVALIDATE
     * For FMR this triggers a work queue that issues the invalidate
       async
  - ULP calls poll_cq
     * For FRWR no change, the driver is called directly
     * For FMR, the poll_cq wrapper looks at a 2nd queue
       filled in by the async work queue above. If it has entries they
       are copied out as IB_WC_LOCAL_INV before calling the driver's
       poll_cq.

This works best under the API I was talking about before, using
posting helpers to form the right SQEs for the hardware being used.

I'm not exactly clear on the recycling rules for either FRWR or FMR -
are they use-once-then-destroy, or can they be reused?

Basically.. I think something along your idea is a good first step, it
unifies the driver API for the posting MR schemes.

The next step would be the posting helpers I've been talking about
that do all the complicated logic for the ULPs. Those helpers would be
able to hide the OP segmentation and FMR rkey using the above
schemes.

This sounds very workable? Christoph?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                                         ` <20150715171926.GB23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-15 18:39                                           ` Steve Wise
  2015-07-15 21:25                                           ` Chuck Lever
  1 sibling, 0 replies; 68+ messages in thread
From: Steve Wise @ 2015-07-15 18:39 UTC (permalink / raw)
  To: 'Jason Gunthorpe', 'Chuck Lever'
  Cc: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Bart Van Assche',
	'Liran Liss', 'Hefty, Sean',
	'Doug Ledford', 'Tom Talpey'



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, July 15, 2015 12:19 PM
> To: Chuck Lever
> Cc: Sagi Grimberg; Christoph Hellwig; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Bart Van Assche; Liran Liss; Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On Wed, Jul 15, 2015 at 10:32:55AM -0400, Chuck Lever wrote:
> 
> > I would rather not build a non-deterministic delay into the
> > unmap interface. Using a pool or having map do an implicit
> > unmap are both solutions I’d rather avoid.
> 
> Can you explain how NFS is using FMR today? When does it unmap a FMR
> rkey and lkey?
> 
> If NFS/etc currently have a hole on rkey invalidation when using FMR,
> and that hole simply cannot reasonably be solved, I'm actually mildly OK
> with enshrining that in a new MR API..
> 
> So, it would seem to me, the only major addition we'd need to Sagi's
> draft to support FMR, would be a way to catch the completion (the
> rdma_unreg_mr) and trigger async MR recycling async in a work queue.
> 
> Sagi, how does cleanup of the temporary FRMR work in your draft
> proposal? What does the ULP do upon completion?
> 
> [Also, just mildly curious, how do we get into an unsleepable
>  context anyhow? is the IB completion pending callback called in a
>  sleepable context?]
> 

From Documentation/infiniband/core_locking.txt:

  The context in which completion event and asynchronous event
  callbacks run is not defined.  Depending on the low-level driver, it
  may be process context, softirq context, or interrupt context.
  Upper level protocol consumers may not sleep in a callback.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                                     ` <20150715183129.GC23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-15 18:50                                       ` Steve Wise
  2015-07-15 19:09                                         ` Jason Gunthorpe
  2015-07-16  8:02                                       ` Christoph Hellwig
  1 sibling, 1 reply; 68+ messages in thread
From: Steve Wise @ 2015-07-15 18:50 UTC (permalink / raw)
  To: 'Jason Gunthorpe', 'Sagi Grimberg'
  Cc: 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe
> Sent: Wednesday, July 15, 2015 1:31 PM
> To: Sagi Grimberg
> Cc: Christoph Hellwig; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss;
Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On Wed, Jul 15, 2015 at 11:01:46AM +0300, Sagi Grimberg wrote:
> > On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
> > >On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
> > >
> > >>But, if people think that it's better to have an API that does implicit
> > >>posting always without notification, and then silently consume error or
> > >>flush completions. I can try and look at it as well.
> > >
> > >Can we do FMR transparently if we bundle the post? If yes, I'd call
> > >that a winner..
> >
> > Doing FMR transparently is not possible as the unmap flow is scheduling.
> > Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
> > hard-IRQ context. Changing the context to thread context is not
> > acceptable. The best we can do is using FMR_POOLs transparently.
> > Other than polluting the API and its semantics I suspect people will
> > have other problems with it (leaving the MRs open).
> 
> Upon deeper thought, I think I see a fairly simple solution here.
> 
> 1) Really, we probably never need a FMR for the lkey side, we should
>    just use multiple READ/WRITE ops to get a long enough SG list.
>    Even if this is not performant on mhca/ehca.
> 
>    If we absolutely need FMR for SEND/RECV lkey (do we? Anyone know?),
>    then I have some good thoughts on how to make that work transparent..
> 
>    However, rather than do all that, I'd probably choose to just
>    bounce buffer the few rare SEND/RECVs that need a MR. I'm guessing
>    the usage is 0 or near zero??
> 
> 2) The FMR completion flow for rkey is actually the same as the FRWR flow:
>     - Catch the SEND that says the READ/WRITE is done
>     - Issue an async invalidate
>     - Catch the invalidate completion
> 
> So, my simple proposal is to have the core wrapper mthca/ehca's
> poll_cq wrapper. The flow works like this:
> 
>   - ULP calls a 'rdma_post_close_rkey' helper
>      * For FRWR this posts the INVALIDATE

Note: Some send operations automatically invalidate an rkey (and the lkey for IB?).  This is intended to avoid having to post the
invalidate WR explicitly.  Namely IB_WR_READ_WITH_INV and IB_WR_SEND_WITH_INV.

>      * For FMR this triggers a work queue that issues the invalidate
>        async
>   - ULP calls poll_cq
>      * For FRWR no change, the driver is called directly
>      * For FMR, the poll_cq wrapper looks at a 2nd queue
>        filled in by the async work queue above. If it has entries they
>        are copied out as IB_WC_LOCAL_INV before calling the driver's
>        poll_cq.
> 
> This works best under the API I was talking about before, using
> posting helpers to form the right SQEs for the hardware being used.
> 
> I'm not exactly clear on the recycling rules for either FRWR or FMR -
> are they use-once-then-destroy, or can they be reused?
> 

For FRWRs, the MR can be reused with the same key values, or the bottom 8b of the keys can be modified before re-registering using
ib_update_fast_reg_key().  This allows applications to detect when using stale keys. 

> Basically.. I think something along your idea is a good first step, it
> unifies the driver API for the posting MR schemes.
> 
> The next step would be the posting helpers I've been talking about
> that do all the complicated logic for the ULPs. Those helpers would be
> able to hide the OP segmentation and FMR rkey using the above
> schemes.
> 
> This sounds very workable? Christoph?
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
  2015-07-15 18:50                                       ` Steve Wise
@ 2015-07-15 19:09                                         ` Jason Gunthorpe
       [not found]                                           ` <20150715190947.GE23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 19:09 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'

> >   - ULP calls a 'rdma_post_close_rkey' helper
> >      * For FRWR this posts the INVALIDATE
> 
> Note: Some send operations automatically invalidate an rkey (and the
> lkey for IB?).  This is intended to avoid having to post the
> invalidate WR explicitly.  Namely IB_WR_READ_WITH_INV and
> IB_WR_SEND_WITH_INV.

IB_WR_RDMA_READ_WITH_INV is only meaningful/implemented for iWarp. Can
you confirm what it does?

 wc.opcode = IB_WR_RDMA_READ_WITH_INV
 wc.sg_list[0].lkey = LKEY;
 wc.rdma.rkey = RKEY;

Does it locally invalidate LKEY or remotely invalidate RKEY? I'm
feeling pretty confident to guess it invalidates LKEY..

'rdma_post_close_rkey' wouldn't work with lkeys, hence the name :)

IB_WR_SEND_WITH_INV is sadly never used in the kernel. It does
invalidate an rkey, and would have to interact with a
'rdma_post_close_rkey'.

A generic flow could be something like:

  if (rdma_post_close_rkey_wc(qp,rkey,wc,...) ..)
    // invalidate triggred by the peer, all done.
  else
    // Peer didn't invalidate, func queued it locally, ULP waits for the next completion, then all done.
  else
    // Error

Where the _wc varient checks the wc to confirm that the requested rkey
has been invalidated.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <55A61AE3.8020609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-15  9:07           ` Christoph Hellwig
@ 2015-07-15 19:15           ` Jason Gunthorpe
  1 sibling, 0 replies; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 19:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 11:33:39AM +0300, Sagi Grimberg wrote:

> >Call this rdma_mr to fit the scheme we use for "generic" APIs in the
> >RDMA stack?
> 
> Umm, I think this can become weird given all other primitives have
> ib_ prefix. I'd prefer to keep that prefix to stay consistent, and have
> an incremental change to do it for all the primitives (structs & verbs).

We've sort of had an informal convention that new preferred
cross-technology APIs are using rdma_ - but I don't really care.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                                           ` <20150715190947.GE23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-15 19:26                                             ` Steve Wise
  0 siblings, 0 replies; 68+ messages in thread
From: Steve Wise @ 2015-07-15 19:26 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Chuck Lever',
	'Bart Van Assche', 'Liran Liss',
	'Hefty, Sean', 'Doug Ledford',
	'Tom Talpey'



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, July 15, 2015 2:10 PM
> To: Steve Wise
> Cc: 'Sagi Grimberg'; 'Christoph Hellwig'; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; 'Or Gerlitz'; 'Oren Duer'; 'Chuck Lever'; 'Bart Van Assche';
'Liran
> Liss'; 'Hefty, Sean'; 'Doug Ledford'; 'Tom Talpey'
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> > >   - ULP calls a 'rdma_post_close_rkey' helper
> > >      * For FRWR this posts the INVALIDATE
> >
> > Note: Some send operations automatically invalidate an rkey (and the
> > lkey for IB?).  This is intended to avoid having to post the
> > invalidate WR explicitly.  Namely IB_WR_READ_WITH_INV and
> > IB_WR_SEND_WITH_INV.
> 
> IB_WR_RDMA_READ_WITH_INV is only meaningful/implemented for iWarp. Can
> you confirm what it does?
> 
>  wc.opcode = IB_WR_RDMA_READ_WITH_INV
>  wc.sg_list[0].lkey = LKEY;
>  wc.rdma.rkey = RKEY;
> 
> Does it locally invalidate LKEY or remotely invalidate RKEY? I'm
> feeling pretty confident to guess it invalidates LKEY..

Yes, it invalidates the MR associated with LKEY.

> 
> 'rdma_post_close_rkey' wouldn't work with lkeys, hence the name :)
> 
> IB_WR_SEND_WITH_INV is sadly never used in the kernel. It does
> invalidate an rkey, and would have to interact with a
> 'rdma_post_close_rkey'.
> 
> A generic flow could be something like:
> 
>   if (rdma_post_close_rkey_wc(qp,rkey,wc,...) ..)
>     // invalidate triggred by the peer, all done.
>   else
>     // Peer didn't invalidate, func queued it locally, ULP waits for the next completion, then all done.
>   else
>     // Error
> 
> Where the _wc varient checks the wc to confirm that the requested rkey
> has been invalidated.
> 
> Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                         ` <20150715171926.GB23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-15 18:39                                           ` Steve Wise
@ 2015-07-15 21:25                                           ` Chuck Lever
       [not found]                                             ` <F2C64EE9-38A5-4DEE-B60E-AD8430FE1049-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-15 21:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 15, 2015, at 1:19 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Wed, Jul 15, 2015 at 10:32:55AM -0400, Chuck Lever wrote:
> 
>> I would rather not build a non-deterministic delay into the
>> unmap interface. Using a pool or having map do an implicit
>> unmap are both solutions I’d rather avoid.
> 
> Can you explain how NFS is using FMR today? When does it unmap a FMR
> rkey and lkey?

Client side:

The content in RPC send buffers is sent with RDMA SEND using
local_dma_lkey, or with an lkey obtained from an MR allocated via
ib_dma_get_mr(). These are used to send RPC-over-RDMA headers
usually along with the content of small RPC requests.

These are left registered, essentially, during the lifetime of
the transport, and access to them is only local.

NFS READ and WRITE data payloads are mapped with ib_map_phys_mr()
just before the RPC is sent, and those payloads are unmapped
with ib_unmap_fmr() as soon as the client sees the server’s RPC
reply.

These memory regions require an rkey, which is sent in the RPC
call to the server. The server performs RDMA READ or WRITE on
these regions.

I don’t think the server ever uses FMR to register the target
memory regions for RDMA READ and WRITE.


> If NFS/etc currently have a hole on rkey invalidation when using FMR,
> and that hole simply cannot reasonably be solved, I'm actually mildly OK
> with enshrining that in a new MR API..
> 
> So, it would seem to me, the only major addition we'd need to Sagi's
> draft to support FMR, would be a way to catch the completion (the
> rdma_unreg_mr) and trigger async MR recycling async in a work queue.

As long as it is guaranteed that the unmap is scheduled as soon
as each RPC operation is complete, that might be tolerable, ie:

  rdma_unreg_mr_async()

Where the API hands the MR to a work queue to be unmapped, and
guarantees the MR cannot be reused until it knows it is unmapped.

I’m sure there’s a hole in there I’m missing.


> Sagi, how does cleanup of the temporary FRMR work in your draft
> proposal? What does the ULP do upon completion?
> 
> [Also, just mildly curious, how do we get into an unsleepable
> context anyhow? is the IB completion pending callback called in a
> sleepable context?]
> 
> Jason

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                             ` <F2C64EE9-38A5-4DEE-B60E-AD8430FE1049-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-15 22:49                                               ` Jason Gunthorpe
       [not found]                                                 ` <20150715224928.GA941-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-15 22:49 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 05:25:11PM -0400, Chuck Lever wrote:

> NFS READ and WRITE data payloads are mapped with ib_map_phys_mr()
> just before the RPC is sent, and those payloads are unmapped
> with ib_unmap_fmr() as soon as the client sees the server’s RPC
> reply.

Okay.. but.. ib_unmap_fmr is the thing that sleeps, so you must
already have a sleepable context when you call it?

I was poking around to see how NFS is working (to see how we might fit
a different API under here), I didn't find the call to ro_unmap I'd
expect? xprt_rdma_free is presumbly the place, but how it relates to
rpcrdma_reply_handler I could not obviously see. Does the upper layer
call back to xprt_rdma_free before any of the RDMA buffers are
touched?  Can you clear up the call chain for me?

Second, the FRWR stuff looks deeply suspicious, it is posting a
IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion)
triggers nothing. Handoff to the kernel must be done only after seeing
IB_WC_LOCAL_INV, never before.

Third all the unmaps do something like this:

frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
{
	invalidate_wr.opcode = IB_WR_LOCAL_INV;
   [..]
	while (seg1->mr_nsegs--)
		rpcrdma_unmap_one(ia->ri_device, seg++);
	read_lock(&ia->ri_qplock);
	rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);

That is the wrong order, the DMA unmap of rpcrdma_unmap_one must only
be done once the invalidate is complete. For FR this is ib_unmap_fmr
returning, for FRWR it is when you see IB_WC_LOCAL_INV.

Finally, where is the flow control for posting the IB_WR_LOCAL_INV to
the SQ? I'm guessing there is some kind of implicit flow control here
where the SEND buffer is recycled during RECV of the response, and
this limits the SQ usage, then there are guarenteed 3x as many SQEs as
SEND buffers to accommodate the REG_MR and INVALIDATE WRs??

> These memory regions require an rkey, which is sent in the RPC
> call to the server. The server performs RDMA READ or WRITE on
> these regions.
> 
> I don’t think the server ever uses FMR to register the target
> memory regions for RDMA READ and WRITE.

What happens if you hit the SGE limit when constructing the RDMA
READ/WRITE? Upper layer forbids that? What about iWARP, how do you
avoid the 1 SGE limit on RDMA READ?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                     ` <A9EF2F26-E737-4E80-B2E3-F8D6406F9893-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2015-07-15 14:39                                       ` Chuck Lever
  2015-07-15 17:19                                       ` Jason Gunthorpe
@ 2015-07-16  6:52                                       ` Sagi Grimberg
       [not found]                                         ` <55A754BC.6010706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-16  6:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On 7/15/2015 5:32 PM, Chuck Lever wrote:
>
> On Jul 15, 2015, at 4:01 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On 7/14/2015 8:09 PM, Jason Gunthorpe wrote:
>>> On Tue, Jul 14, 2015 at 07:55:39PM +0300, Sagi Grimberg wrote:
>>>
>>>> But, if people think that it's better to have an API that does implicit
>>>> posting always without notification, and then silently consume error or
>>>> flush completions. I can try and look at it as well.
>>>
>>> Can we do FMR transparently if we bundle the post? If yes, I'd call
>>> that a winner..
>>
>> Doing FMR transparently is not possible as the unmap flow is scheduling.
>> Unlike NFS, iSER unmaps from a soft-IRQ context, SRP unmaps from
>> hard-IRQ context.
>
> The context in which RPC/RDMA performs FMR unmap mustn’t sleep.
> RPC/RDMA is in roughly the same situation as the other initiators.
>
>
>> Changing the context to thread context is not
>> acceptable. The best we can do is using FMR_POOLs transparently.
>> Other than polluting the API and its semantics I suspect people will
>> have other problems with it (leaving the MRs open).
>
> Count me in that group.
>
> I would rather not build a non-deterministic delay into the
> unmap interface. Using a pool or having map do an implicit
> unmap are both solutions I’d rather avoid.
>
> In both situations, MRs can be left mapped indefinitely if,
> say, the workload pauses.
>
>
>> I suggest to start with what I proposed. And in a later stage (if we
>> still think its needed) we can have a higher level API that hides the
>> post, something like:
>
>> rdma_reg_sg(struct ib_qp *qp,
>>             struct ib_mr *mr,
>>             struct scatterlist *sg,
>>             int sg_nents,
>>             u64 offset,
>>             u64 length,
>>             int access_flags)
>
> I still wonder what “length” means in the context of a scatterlist.

Would byte_count be a more explanatory name?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                     ` <20150715183129.GC23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-07-15 18:50                                       ` Steve Wise
@ 2015-07-16  8:02                                       ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-16  8:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Wed, Jul 15, 2015 at 12:31:29PM -0600, Jason Gunthorpe wrote:
> This sounds very workable? Christoph?

This is close to what I had initially envisioned, but with all the
discussions here I'd rather stat out with something simpler.  E.g.
Sagi's proposal with a few refinements.  One we have all the ULPs
using those MR-type specific helpers we can decided to consolidate
it into your scheme, or maybe at that point just give up on FMRs.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                         ` <55A754BC.6010706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-16  8:07                                           ` Christoph Hellwig
       [not found]                                             ` <20150716080702.GD9093-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2015-07-16  8:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chuck Lever, Jason Gunthorpe, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Thu, Jul 16, 2015 at 09:52:44AM +0300, Sagi Grimberg wrote:
> >>I suggest to start with what I proposed. And in a later stage (if we
> >>still think its needed) we can have a higher level API that hides the
> >>post, something like:
> >
> >>rdma_reg_sg(struct ib_qp *qp,
> >>            struct ib_mr *mr,
> >>            struct scatterlist *sg,
> >>            int sg_nents,
> >>            u64 offset,
> >>            u64 length,
> >>            int access_flags)
> >
> >I still wonder what ?length? means in the context of a scatterlist.
> 
> Would byte_count be a more explanatory name?

What do you even need it for?  The total length is implicitly stored in
the S/G list as the list of all elements.

What's the offset for?  To allow skipping parts of a S/G list previously
mapped?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                             ` <20150716080702.GD9093-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-16  8:29                                               ` Sagi Grimberg
       [not found]                                                 ` <55A76B84.30504-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-16  8:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Steve Wise, Or Gerlitz, Oren Duer, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/16/2015 11:07 AM, Christoph Hellwig wrote:
> On Thu, Jul 16, 2015 at 09:52:44AM +0300, Sagi Grimberg wrote:
>>>> I suggest to start with what I proposed. And in a later stage (if we
>>>> still think its needed) we can have a higher level API that hides the
>>>> post, something like:
>>>
>>>> rdma_reg_sg(struct ib_qp *qp,
>>>>             struct ib_mr *mr,
>>>>             struct scatterlist *sg,
>>>>             int sg_nents,
>>>>             u64 offset,
>>>>             u64 length,
>>>>             int access_flags)
>>>
>>> I still wonder what ?length? means in the context of a scatterlist.
>>
>> Would byte_count be a more explanatory name?
>
> What do you even need it for?  The total length is implicitly stored in
> the S/G list as the list of all elements.
>
> What's the offset for?  To allow skipping parts of a S/G list previously
> mapped?
>

These were added when I thought of the pages helper. The memory region
HW tables consist of physical addresses, a first byte offset, and a
byte length to indicate when the region ends in the last page.

However, if we have only sg lists, I agree that the length is derived
from the sg_list itself and the offset will be sg[0]->offset.

I can drop it, unless anyone can think of a use-case where a ULP would
want to register a region with a different offset from sg[0]->offset
and/or ends before the sum(sg->length).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <20150715170750.GA23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-16 12:21           ` Sagi Grimberg
       [not found]             ` <55A7A1B0.5000808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-16 12:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On 7/15/2015 8:07 PM, Jason Gunthorpe wrote:
> On Wed, Jul 15, 2015 at 12:32:33AM -0700, Christoph Hellwig wrote:
>> int rdma_create_mr(struct ib_pd *pd, enum rdma_mr_type mr,
>> 	u32 max_pages, int flags);
>>
>>>    *     array from a SG list
>>>    * @mr:          memory region
>>>    * @sg:          sg list
>>>    * @sg_nents:    number of elements in the sg
>>>    *
>>>    * Can fail if the HW is not able to register this
>>>    * sg list. In case of failure - caller is responsible
>>>    * to handle it (bounce-buffer, multiple registrations...)
>>>    */
>>> int ib_mr_set_sg(struct ib_mr *mr,
>>>                   struct scatterlist *sg,
>>>                   unsigned short sg_nents);
>>
>> Call this rdma_map_sg?
>>
>>>           /* register the MR */
>>>           frwr.opcode = IB_WR_FAST_REG_MR;
>>>           frwr.wrid = my_wrid;
>>>           frwr.wr.fast_reg.mr = mr;
>>>           frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
>>>           frwr.wr.fast_reg.length = length;
>>>           frwr.wr.fast_reg.access_flags = my_flags;
>>
>> Provide a helper to hide all this behind the scenes please:
>>
>> void rdma_init_mr_wr(struct ib_send_wr *wr, struct rdma_mr *mr,
>> 		u64 wr_id, int mr_access_flags);
>>
>> Or if we got with Jason's suggestion split "int mr_access_flags" into
>> "bool remote, bool is_write".
>
> Yes please. Considering the security implications we need to be much
> more careful API wise here. This is more of a code-as-documentation
> issue than a functional issue.

I gotta say,

these suggestions of bool/write or supported_ops with a convert helper
seem (to me at least) to make things more complicated.

Why not just set the the access_flags as they are?
I want local use?
set IB_ACCESS_LOCAL_WRITE
I want a peer to read from me?
set IB_ACCESS_REMOTE_READ
I want a peer to write to me?
IB_ACCESS_REMOTE_WRITE
...

isn't it much simpler?

If we want to mask out iWARP difference, we use the
Steve's roles_to_access() helper thing...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                                                 ` <55A76B84.30504-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-16 14:25                                                   ` Steve Wise
  2015-07-16 14:40                                                     ` Sagi Grimberg
  0 siblings, 1 reply; 68+ messages in thread
From: Steve Wise @ 2015-07-16 14:25 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig'
  Cc: 'Chuck Lever', 'Jason Gunthorpe',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Bart Van Assche',
	'Liran Liss', 'Hefty, Sean',
	'Doug Ledford', 'Tom Talpey'



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Thursday, July 16, 2015 3:30 AM
> To: Christoph Hellwig
> Cc: Chuck Lever; Jason Gunthorpe; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Bart Van Assche; Liran Liss;
Hefty, Sean;
> Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> On 7/16/2015 11:07 AM, Christoph Hellwig wrote:
> > On Thu, Jul 16, 2015 at 09:52:44AM +0300, Sagi Grimberg wrote:
> >>>> I suggest to start with what I proposed. And in a later stage (if we
> >>>> still think its needed) we can have a higher level API that hides the
> >>>> post, something like:
> >>>
> >>>> rdma_reg_sg(struct ib_qp *qp,
> >>>>             struct ib_mr *mr,
> >>>>             struct scatterlist *sg,
> >>>>             int sg_nents,
> >>>>             u64 offset,
> >>>>             u64 length,
> >>>>             int access_flags)
> >>>
> >>> I still wonder what ?length? means in the context of a scatterlist.
> >>
> >> Would byte_count be a more explanatory name?
> >
> > What do you even need it for?  The total length is implicitly stored in
> > the S/G list as the list of all elements.
> >
> > What's the offset for?  To allow skipping parts of a S/G list previously
> > mapped?
> >
> 
> These were added when I thought of the pages helper. The memory region
> HW tables consist of physical addresses, a first byte offset, and a
> byte length to indicate when the region ends in the last page.
> 
> However, if we have only sg lists, I agree that the length is derived
> from the sg_list itself and the offset will be sg[0]->offset.
> 
> I can drop it, unless anyone can think of a use-case where a ULP would
> want to register a region with a different offset from sg[0]->offset
> and/or ends before the sum(sg->length).

What if the sg list has to be chunked up due to the device's FRWR pbl depth limits?  Or is that handled underneath rdma_reg_sg()?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
  2015-07-16 14:25                                                   ` Steve Wise
@ 2015-07-16 14:40                                                     ` Sagi Grimberg
  0 siblings, 0 replies; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-16 14:40 UTC (permalink / raw)
  To: Steve Wise, 'Christoph Hellwig'
  Cc: 'Chuck Lever', 'Jason Gunthorpe',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Bart Van Assche',
	'Liran Liss', 'Hefty, Sean',
	'Doug Ledford', 'Tom Talpey'


>> I can drop it, unless anyone can think of a use-case where a ULP would
>> want to register a region with a different offset from sg[0]->offset
>> and/or ends before the sum(sg->length).
>
> What if the sg list has to be chunked up due to the device's FRWR pbl depth limits?  Or is that handled underneath rdma_reg_sg()?
>

Should be handled underneath...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                 ` <20150715224928.GA941-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-16 14:45                                                   ` Chuck Lever
       [not found]                                                     ` <F0518DEF-D43C-4CB6-89ED-CA3E94A4DD72-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-16 14:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 15, 2015, at 6:49 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Wed, Jul 15, 2015 at 05:25:11PM -0400, Chuck Lever wrote:
> 
>> NFS READ and WRITE data payloads are mapped with ib_map_phys_mr()
>> just before the RPC is sent, and those payloads are unmapped
>> with ib_unmap_fmr() as soon as the client sees the server’s RPC
>> reply.
> 
> Okay.. but.. ib_unmap_fmr is the thing that sleeps, so you must
> already have a sleepable context when you call it?

The RPC scheduler operates on the assumption that the processing
during each step does not sleep.

We’re not holding a lock, so a short sleep here works. In general
this kind of thing can deadlock pretty easily, but right at this
step I think it’s avoiding deadlock "by accident."

For some time, I’ve been considering deferring ib_unmap_fmr() to
a work queue, but FMR is operational and is a bit of an antique
so I haven’t put much effort into bettering it.

The point is, this is not something that should be perpetuated
into a new API, and certainly the other initiators have a hard
intolerance for a sleep.


> I was poking around to see how NFS is working (to see how we might fit
> a different API under here), I didn't find the call to ro_unmap I'd
> expect? xprt_rdma_free is presumbly the place, but how it relates to
> rpcrdma_reply_handler I could not obviously see. Does the upper layer
> call back to xprt_rdma_free before any of the RDMA buffers are
> touched?  Can you clear up the call chain for me?

The server performs RDMA READ and WRITE operations, then SENDs the
RPC reply.

On the client, rpcrdma_recvcq_upcall() is invoked when the RPC
reply arrives and the RECV completes.

rpcrdma_schedule_tasklet() queues the incoming RPC reply on a
global list and spanks our reply tasklet.

The tasklet invokes rpcrdma_reply_handler() for each reply on the
list.

The reply handler parses the incoming reply, looks up the XID and
matches it to a waiting RPC request (xprt_lookup_rqst). It then
wakes that request (xprt_complete_rqst). The tasklet goes to the
next reply on the global list.

The RPC scheduler sees the awoken RPC request and steps the
finished request through to completion, at which point
xprt_release() is invoked to retire the request slot.

Here resources allocated to the RPC request are freed. For
RPC/RDMA transports, xprt->ops->buf_free is xprt_rdma_free().
xprt_rdma_free() invokes the ro_unmap method to unmap/invalidate
the MRs involved with the RPC request.


> Second, the FRWR stuff looks deeply suspicious, it is posting a
> IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion)
> triggers nothing. Handoff to the kernel must be done only after seeing
> IB_WC_LOCAL_INV, never before.

I don’t understand. Our LOCAL_INV is typically unsignalled because
there’s nothing to do in the normal case. frwr_sendcompletion is
there to handle only flushed sends.

Send queue ordering and the mw_list prevent each MR from being
reused before it is truly invalidated.


> Third all the unmaps do something like this:
> 
> frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> {
> 	invalidate_wr.opcode = IB_WR_LOCAL_INV;
>   [..]
> 	while (seg1->mr_nsegs--)
> 		rpcrdma_unmap_one(ia->ri_device, seg++);
> 	read_lock(&ia->ri_qplock);
> 	rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
> 
> That is the wrong order, the DMA unmap of rpcrdma_unmap_one must only
> be done once the invalidate is complete. For FR this is ib_unmap_fmr
> returning, for FRWR it is when you see IB_WC_LOCAL_INV.

I’m assuming you mean the DMA unmap has to be done after LINV
completes.

I’m not sure it matters here, because when the RPC reply shows
up at the client, it already means the server isn’t going to
access that MR/rkey again. (If the server does access that MR
again, it would be a protocol violation).

Can you provide an example in another kernel ULP?


> Finally, where is the flow control for posting the IB_WR_LOCAL_INV to
> the SQ? I'm guessing there is some kind of implicit flow control here
> where the SEND buffer is recycled during RECV of the response, and
> this limits the SQ usage, then there are guarenteed 3x as many SQEs as
> SEND buffers to accommodate the REG_MR and INVALIDATE WRs??

RPC/RDMA provides flow control via credits. The peers agree on
a maximum number of concurrent outstanding RPC requests.
Typically that is 32, though implementations are increasing that
default to 128.

There’s a comment in frwr_op_open that explains how we calculate
the maximum number of send queue entries for each credit.


>> These memory regions require an rkey, which is sent in the RPC
>> call to the server. The server performs RDMA READ or WRITE on
>> these regions.
>> 
>> I don’t think the server ever uses FMR to register the target
>> memory regions for RDMA READ and WRITE.
> 
> What happens if you hit the SGE limit when constructing the RDMA
> READ/WRITE? Upper layer forbids that? What about iWARP, how do you
> avoid the 1 SGE limit on RDMA READ?

I’m much less familiar with the server side. Maybe Steve knows,
but I suspect the RPC/RDMA code is careful to construct more
READ Work Requests if it runs out of sges.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: Kernel fast memory registration API proposal [RFC]
       [not found]                                                     ` <F0518DEF-D43C-4CB6-89ED-CA3E94A4DD72-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-16 14:56                                                       ` Steve Wise
  2015-07-16 17:40                                                       ` Jason Gunthorpe
  1 sibling, 0 replies; 68+ messages in thread
From: Steve Wise @ 2015-07-16 14:56 UTC (permalink / raw)
  To: 'Chuck Lever', 'Jason Gunthorpe'
  Cc: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Or Gerlitz',
	'Oren Duer', 'Bart Van Assche',
	'Liran Liss', 'Hefty, Sean',
	'Doug Ledford', 'Tom Talpey'



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Chuck Lever
> Sent: Thursday, July 16, 2015 9:46 AM
> To: Jason Gunthorpe
> Cc: Sagi Grimberg; Christoph Hellwig; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Steve Wise; Or Gerlitz; Oren Duer; Bart Van Assche; Liran Liss;
Hefty,
> Sean; Doug Ledford; Tom Talpey
> Subject: Re: Kernel fast memory registration API proposal [RFC]
> 
> 
> On Jul 15, 2015, at 6:49 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> > On Wed, Jul 15, 2015 at 05:25:11PM -0400, Chuck Lever wrote:
> >
> >> NFS READ and WRITE data payloads are mapped with ib_map_phys_mr()
> >> just before the RPC is sent, and those payloads are unmapped
> >> with ib_unmap_fmr() as soon as the client sees the server's RPC
> >> reply.
> >
> > Okay.. but.. ib_unmap_fmr is the thing that sleeps, so you must
> > already have a sleepable context when you call it?
> 
> The RPC scheduler operates on the assumption that the processing
> during each step does not sleep.
> 
> We're not holding a lock, so a short sleep here works. In general
> this kind of thing can deadlock pretty easily, but right at this
> step I think it's avoiding deadlock "by accident."
> 
> For some time, I've been considering deferring ib_unmap_fmr() to
> a work queue, but FMR is operational and is a bit of an antique
> so I haven't put much effort into bettering it.
> 
> The point is, this is not something that should be perpetuated
> into a new API, and certainly the other initiators have a hard
> intolerance for a sleep.
> 
> 
> > I was poking around to see how NFS is working (to see how we might fit
> > a different API under here), I didn't find the call to ro_unmap I'd
> > expect? xprt_rdma_free is presumbly the place, but how it relates to
> > rpcrdma_reply_handler I could not obviously see. Does the upper layer
> > call back to xprt_rdma_free before any of the RDMA buffers are
> > touched?  Can you clear up the call chain for me?
> 
> The server performs RDMA READ and WRITE operations, then SENDs the
> RPC reply.
> 
> On the client, rpcrdma_recvcq_upcall() is invoked when the RPC
> reply arrives and the RECV completes.
> 
> rpcrdma_schedule_tasklet() queues the incoming RPC reply on a
> global list and spanks our reply tasklet.
> 
> The tasklet invokes rpcrdma_reply_handler() for each reply on the
> list.
> 
> The reply handler parses the incoming reply, looks up the XID and
> matches it to a waiting RPC request (xprt_lookup_rqst). It then
> wakes that request (xprt_complete_rqst). The tasklet goes to the
> next reply on the global list.
> 
> The RPC scheduler sees the awoken RPC request and steps the
> finished request through to completion, at which point
> xprt_release() is invoked to retire the request slot.
> 
> Here resources allocated to the RPC request are freed. For
> RPC/RDMA transports, xprt->ops->buf_free is xprt_rdma_free().
> xprt_rdma_free() invokes the ro_unmap method to unmap/invalidate
> the MRs involved with the RPC request.
> 
> 
> > Second, the FRWR stuff looks deeply suspicious, it is posting a
> > IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion)
> > triggers nothing. Handoff to the kernel must be done only after seeing
> > IB_WC_LOCAL_INV, never before.
> 
> I don't understand. Our LOCAL_INV is typically unsignalled because
> there's nothing to do in the normal case. frwr_sendcompletion is
> there to handle only flushed sends.
> 
> Send queue ordering and the mw_list prevent each MR from being
> reused before it is truly invalidated.
> 
> 
> > Third all the unmaps do something like this:
> >
> > frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> > {
> > 	invalidate_wr.opcode = IB_WR_LOCAL_INV;
> >   [..]
> > 	while (seg1->mr_nsegs--)
> > 		rpcrdma_unmap_one(ia->ri_device, seg++);
> > 	read_lock(&ia->ri_qplock);
> > 	rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
> >
> > That is the wrong order, the DMA unmap of rpcrdma_unmap_one must only
> > be done once the invalidate is complete. For FR this is ib_unmap_fmr
> > returning, for FRWR it is when you see IB_WC_LOCAL_INV.
> 
> I'm assuming you mean the DMA unmap has to be done after LINV
> completes.
> 
> I'm not sure it matters here, because when the RPC reply shows
> up at the client, it already means the server isn't going to
> access that MR/rkey again. (If the server does access that MR
> again, it would be a protocol violation).
> 
> Can you provide an example in another kernel ULP?
> 
> 
> > Finally, where is the flow control for posting the IB_WR_LOCAL_INV to
> > the SQ? I'm guessing there is some kind of implicit flow control here
> > where the SEND buffer is recycled during RECV of the response, and
> > this limits the SQ usage, then there are guarenteed 3x as many SQEs as
> > SEND buffers to accommodate the REG_MR and INVALIDATE WRs??
> 
> RPC/RDMA provides flow control via credits. The peers agree on
> a maximum number of concurrent outstanding RPC requests.
> Typically that is 32, though implementations are increasing that
> default to 128.
> 
> There's a comment in frwr_op_open that explains how we calculate
> the maximum number of send queue entries for each credit.
> 
> 
> >> These memory regions require an rkey, which is sent in the RPC
> >> call to the server. The server performs RDMA READ or WRITE on
> >> these regions.
> >>
> >> I don't think the server ever uses FMR to register the target
> >> memory regions for RDMA READ and WRITE.
> >
> > What happens if you hit the SGE limit when constructing the RDMA
> > READ/WRITE? Upper layer forbids that? What about iWARP, how do you
> > avoid the 1 SGE limit on RDMA READ?
> 
> I'm much less familiar with the server side. Maybe Steve knows,
> but I suspect the RPC/RDMA code is careful to construct more
> READ Work Requests if it runs out of sges.


The server chunks it up based on the device limits and issues a series of rdma reads as required.  See rdma_read_chunk_frmr() and
rdma_read_chunks() which calls rdma_read_chunk_frmr() via xprt->sc_reader.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                     ` <F0518DEF-D43C-4CB6-89ED-CA3E94A4DD72-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2015-07-16 14:56                                                       ` Steve Wise
@ 2015-07-16 17:40                                                       ` Jason Gunthorpe
       [not found]                                                         ` <20150716174046.GB3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-16 17:40 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Thu, Jul 16, 2015 at 10:45:46AM -0400, Chuck Lever wrote:

> For some time, I’ve been considering deferring ib_unmap_fmr() to
> a work queue, but FMR is operational and is a bit of an antique
> so I haven’t put much effort into bettering it.

Okay, I think I get it..

The fmr unmap could be made async, but there would need to be some
additional logic to defer reuse of the MR until it completes..
 
> The reply handler parses the incoming reply, looks up the XID and
> matches it to a waiting RPC request (xprt_lookup_rqst). It then
> wakes that request (xprt_complete_rqst). The tasklet goes to the
> next reply on the global list.
> 
> The RPC scheduler sees the awoken RPC request and steps the
> finished request through to completion, at which point
> xprt_release() is invoked to retire the request slot.

At what point does the RPC layer touch the RDMA'd data? Based on your
description it must be after xprt_release is called?

> > Second, the FRWR stuff looks deeply suspicious, it is posting a
> > IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion)
> > triggers nothing. Handoff to the kernel must be done only after seeing
> > IB_WC_LOCAL_INV, never before.
> 
> I don’t understand. Our LOCAL_INV is typically unsignalled because
> there’s nothing to do in the normal case. frwr_sendcompletion is
> there to handle only flushed sends.

The purpose of invalidate in the spec is to fence the RDMA. The
correct, secure, way to use RDMA is to invalidate a rkey, then DMA
flush it's backing memory, then access that data with the CPU.

The scheme NFS is using for FRWR is trading security for performance
by running the invalidate async - so a black-hat peer can maliciously
abuse that.

Recovering the lost performance is what SEND WITH INVALIDATE is used
for.

Usually such a trade off should be a user option..

> I’m not sure it matters here, because when the RPC reply shows
> up at the client, it already means the server isn’t going to
> access that MR/rkey again. (If the server does access that MR
> again, it would be a protocol violation).

A simple contained protocol violation would be fine, but this elevates
into threatening the whole machine if it happens:

 - Random memory corruption: Freeing the RDMA buffers before
   invalidate completes recylces them to other users while the remote
   can still write to them
 - CPU mis-operation: A validating parse can become broken if the data
   under it is changed by the remote during operation
 - Linux DMA API contract failure:
    * If bounce buffering, recycling the buffer will allow corruption
      of someone eles's memory
    * Cache coherence is lost, resulting in unpredictable data
      corruption, happening unpredictably in time (ie perhaps after
      the buffer is recycled to some other use)
    * If using the IOMMU a spurious DMA may machine check and hard
      stop the machine.

[I wouldn't say this is serious enough to address immediately, but it
 means NFS is not an example of best practices..]

> Can you provide an example in another kernel ULP?

I'm still looking at them, sadly there is alot of 'interesting' stuff
in the ULPs :(

> > Finally, where is the flow control for posting the IB_WR_LOCAL_INV to
> > the SQ? I'm guessing there is some kind of implicit flow control here
> > where the SEND buffer is recycled during RECV of the response, and
> > this limits the SQ usage, then there are guarenteed 3x as many SQEs as
> > SEND buffers to accommodate the REG_MR and INVALIDATE WRs??
> 
> RPC/RDMA provides flow control via credits. The peers agree on
> a maximum number of concurrent outstanding RPC requests.
> Typically that is 32, though implementations are increasing that
> default to 128.
> 
> There’s a comment in frwr_op_open that explains how we calculate
> the maximum number of send queue entries for each credit.

That is what it looked like..

So, this scheme is dangerous too. The RDMA spec doesn't guarantee that
the receive side and send side run in lock step - that is to say, even
though you got a RECV that indicates a SEND was executed, that doesn't
mean that SQ or SCQ space has been made available to post a new SEND.

This is primarily because SQ/SCQ space is not freed until the ACK is
processed, and the behavior of ACKs is not simple: particularly if a
compound ACK is delivered concurrently with the reply's SEND.

So, lets say you completely fill the SQ with SENDs, and all the sends
get dumped on the wire. The SQ is still full. The far side returns a
SEND+ACK combo packet for the first send (either together or close in
time), the local HCA+system now runs two concurrent tracks: the first
is updating the SQ/SCQ space to reflect the ACK and the second is
delivering the SEND as a receive. This is a classic race, if the CPU
sees the SEND's recv and tries to post before the CPU sees the ACK
side then the QP blows up as full.

I've actually hit this - under very heavy loading (ie a continuously
full SQ) with a similar scheme that relied on RCQ to track SQ/SCQ
usage would very rarely blow up with full SQ/SCQ because of the above
lack of synchronicity.

The only absolutely correct way to run the RDMA stack is to keep track
of SQ/SCQ space directly, and only update that tracking by processing
SCQEs.

For us, the sucky part, is any new API we build will have to be based
on 'the correct way to use RDMA' (eg, invalidate used as security,
direct SQ/SCQ tracking, etc), retrofitting may be difficult on
ULPs that are taking shortcuts. :( :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]             ` <55A7A1B0.5000808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-16 18:08               ` Jason Gunthorpe
       [not found]                 ` <20150716180806.GC3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-16 18:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Thu, Jul 16, 2015 at 03:21:04PM +0300, Sagi Grimberg wrote:
> I gotta say,
> 
> these suggestions of bool/write or supported_ops with a convert helper
> seem (to me at least) to make things more complicated.
> 
> Why not just set the the access_flags as they are?
> I want local use?
> set IB_ACCESS_LOCAL_WRITE
> I want a peer to read from me?
> set IB_ACCESS_REMOTE_READ
> I want a peer to write to me?
> IB_ACCESS_REMOTE_WRITE
> ...
> 
> isn't it much simpler?

I don't have a really strong preference, but I like the idea of using
the OP names because it means the API cannot be used wrong.

Ie this, is absolutely wrong:

 rdma_map_sg_lkey(..,IB_ACCESS_REMOTE_READ);

But this:

 rdma_map_sg_lkey(..,IB_ACCESS_REMOTE_WRITE);

Is only wrong on IB.

While,
 rdma_map_sg_lkey(..,RDMA_OP_RDMA_READ)

Is always right, and can never be used wrong, and thus is less
complicated (for the ULP).

> If we want to mask out iWARP difference, we use the Steve's
> roles_to_access() helper thing...

If that scheme exists, it's usage should be mandatory, IMHO.

However, iwarp needs more than just flag translation, it also can't
use the local_dma_lkey ..

If we can come to a scheme to support iWarp (wrappers!) then the
rdma_map_sg_lkey is largely an internal function and it can stick with
IB_ACCESS flags...

The main use of rdma_map_sg_lkey is to support iWarp's RDMA READ
requirement.

rdma_map_sg_rkey is the commonly used API, and I can't think of a reason
why it needs the roles buisness beyond symmetry..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                         ` <20150716174046.GB3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-16 20:07                                                           ` Chuck Lever
       [not found]                                                             ` <F8484ABB-BED9-463F-8AEA-EB898EBDD93C-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-16 20:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 16, 2015, at 1:40 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Thu, Jul 16, 2015 at 10:45:46AM -0400, Chuck Lever wrote:

>> The reply handler parses the incoming reply, looks up the XID and
>> matches it to a waiting RPC request (xprt_lookup_rqst). It then
>> wakes that request (xprt_complete_rqst). The tasklet goes to the
>> next reply on the global list.
>> 
>> The RPC scheduler sees the awoken RPC request and steps the
>> finished request through to completion, at which point
>> xprt_release() is invoked to retire the request slot.
> 
> At what point does the RPC layer touch the RDMA'd data? Based on your
> description it must be after xprt_release is called?

There are three different data transfer mechanisms with RPC/RDMA.

1.  RDMA SEND: used for the RPC/RDMA header and small RPC messages

2.  RDMA READ: the server reads a data payload (NFS WRITE), or a
    large RPC call

3.  RDMA WRITE: the server writes a data payload (NFS READ), or
    a large RPC reply

In case 2, the RDMA READs are done as part of an RPC call, just
after the server receives the RPC call message. So, the READ is
complete before the server can even start processing the RPC call.

The MRs are registered only for remote read. I don’t think
catastrophic harm can occur on the client in this case if the
invalidation and DMA sync comes late. In fact, I’m unsure why
a DMA sync is even necessary as the MR is invalidated in this
case.

In case 3, the RDMA WRITEs are done as part of an RPC reply,
just before the server sends the RPC reply message. The server’s
send queue sees to it that the WRITE completes before the server
SENDs the reply.

The RPC client does not touch the data payload with NFS READ. But
there is no co-ordination between the LOCAL_INVALIDATE and when
the upper layer is finally awake and accessing that data.

Ultimately, yes, we’ll have to do the invalidation earlier so
that the ULP cannot touch that data payload until the MR has
been knocked down and synced.

In the case of incoming data payloads (NFS READ) the DMA sync
ordering is probably an important issue. The sync has to happen
before the ULP can touch the data, 100% of the time.

That could be addressed by performing a DMA sync on the write
list or reply chunk MRs right in the RPC reply handler (before
xprt_complete_rqst).


>>> Second, the FRWR stuff looks deeply suspicious, it is posting a
>>> IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion)
>>> triggers nothing. Handoff to the kernel must be done only after seeing
>>> IB_WC_LOCAL_INV, never before.
>> 
>> I don’t understand. Our LOCAL_INV is typically unsignalled because
>> there’s nothing to do in the normal case. frwr_sendcompletion is
>> there to handle only flushed sends.
> 
> The purpose of invalidate in the spec is to fence the RDMA. The
> correct, secure, way to use RDMA is to invalidate a rkey, then DMA
> flush it's backing memory, then access that data with the CPU.
> 
> The scheme NFS is using for FRWR is trading security for performance
> by running the invalidate async - so a black-hat peer can maliciously
> abuse that.
> 
> Recovering the lost performance is what SEND WITH INVALIDATE is used
> for.
> 
> Usually such a trade off should be a user option..
> 
>> I’m not sure it matters here, because when the RPC reply shows
>> up at the client, it already means the server isn’t going to
>> access that MR/rkey again. (If the server does access that MR
>> again, it would be a protocol violation).
> 
> A simple contained protocol violation would be fine, but this elevates
> into threatening the whole machine if it happens:
> 
> - Random memory corruption: Freeing the RDMA buffers before
>   invalidate completes recylces them to other users while the remote
>   can still write to them
> - CPU mis-operation: A validating parse can become broken if the data
>   under it is changed by the remote during operation
> - Linux DMA API contract failure:
>    * If bounce buffering, recycling the buffer will allow corruption
>      of someone eles's memory
>    * Cache coherence is lost, resulting in unpredictable data
>      corruption, happening unpredictably in time (ie perhaps after
>      the buffer is recycled to some other use)
>    * If using the IOMMU a spurious DMA may machine check and hard
>      stop the machine.
> 
> [I wouldn't say this is serious enough to address immediately, but it
> means NFS is not an example of best practices..]
> 
>> Can you provide an example in another kernel ULP?
> 
> I'm still looking at them, sadly there is alot of 'interesting' stuff
> in the ULPs :(
> 
>>> Finally, where is the flow control for posting the IB_WR_LOCAL_INV to
>>> the SQ? I'm guessing there is some kind of implicit flow control here
>>> where the SEND buffer is recycled during RECV of the response, and
>>> this limits the SQ usage, then there are guarenteed 3x as many SQEs as
>>> SEND buffers to accommodate the REG_MR and INVALIDATE WRs??
>> 
>> RPC/RDMA provides flow control via credits. The peers agree on
>> a maximum number of concurrent outstanding RPC requests.
>> Typically that is 32, though implementations are increasing that
>> default to 128.
>> 
>> There’s a comment in frwr_op_open that explains how we calculate
>> the maximum number of send queue entries for each credit.
> 
> That is what it looked like..
> 
> So, this scheme is dangerous too. The RDMA spec doesn't guarantee that
> the receive side and send side run in lock step - that is to say, even
> though you got a RECV that indicates a SEND was executed, that doesn't
> mean that SQ or SCQ space has been made available to post a new SEND.
> 
> This is primarily because SQ/SCQ space is not freed until the ACK is
> processed, and the behavior of ACKs is not simple: particularly if a
> compound ACK is delivered concurrently with the reply's SEND.
> 
> So, lets say you completely fill the SQ with SENDs, and all the sends
> get dumped on the wire. The SQ is still full. The far side returns a
> SEND+ACK combo packet for the first send (either together or close in
> time), the local HCA+system now runs two concurrent tracks: the first
> is updating the SQ/SCQ space to reflect the ACK and the second is
> delivering the SEND as a receive. This is a classic race, if the CPU
> sees the SEND's recv and tries to post before the CPU sees the ACK
> side then the QP blows up as full.
> 
> I've actually hit this - under very heavy loading (ie a continuously
> full SQ) with a similar scheme that relied on RCQ to track SQ/SCQ
> usage would very rarely blow up with full SQ/SCQ because of the above
> lack of synchronicity.

> The only absolutely correct way to run the RDMA stack is to keep track
> of SQ/SCQ space directly, and only update that tracking by processing
> SCQEs.

In other words, the only time it is truly safe to do a post_send is
after you’ve received a send completion that indicates you have
space on the send queue.

The problem then is how do you make the RDMA consumer wait until
there is send queue space. I suppose the xprt_complete_rqst()
could be postponed in this case, or simulated xprt congestion
could be used to prevent starting new RPCs while the send queue
is full.


> For us, the sucky part, is any new API we build will have to be based
> on 'the correct way to use RDMA' (eg, invalidate used as security,
> direct SQ/SCQ tracking, etc), retrofitting may be difficult on
> ULPs that are taking shortcuts. :( :(


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                             ` <F8484ABB-BED9-463F-8AEA-EB898EBDD93C-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-16 20:49                                                               ` Jason Gunthorpe
       [not found]                                                                 ` <20150716204932.GA10638-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-16 20:49 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Thu, Jul 16, 2015 at 04:07:04PM -0400, Chuck Lever wrote:

> The MRs are registered only for remote read. I don’t think
> catastrophic harm can occur on the client in this case if the
> invalidation and DMA sync comes late. In fact, I’m unsure why
> a DMA sync is even necessary as the MR is invalidated in this
> case.

For RDMA, the worst case would be some kind of information leakage or
machine check halt.

For read side the DMA API should be called before posting the FRWR, no
completion side issues.

> In the case of incoming data payloads (NFS READ) the DMA sync
> ordering is probably an important issue. The sync has to happen
> before the ULP can touch the data, 100% of the time.

Absolultely, the sync is critical.

> That could be addressed by performing a DMA sync on the write
> list or reply chunk MRs right in the RPC reply handler (before
> xprt_complete_rqst).

That sounds good to me, much more in line with what I'd expect to
see. The fmr unmap and invalidate post should also be in the reply
handler (for flow control reasons, see below)

> > The only absolutely correct way to run the RDMA stack is to keep track
> > of SQ/SCQ space directly, and only update that tracking by processing
> > SCQEs.
> 
> In other words, the only time it is truly safe to do a post_send is
> after you’ve received a send completion that indicates you have
> space on the send queue.

Yes.

Use a scheme where you supress signaling and use the SQE accounting to
request a completion entry and signal around every 1/2 length of the
SQ.

Use the WRID in some way to encode the # SQEs each completion
represents.

I've used a scheme where the wrid is a wrapping index into
an array of SQ length long, that holds any meta information..

That makes it trivial to track SQE accounting and avoids memory
allocations for wrids.

Generically:

  posted_sqes -= (wc->wrid - last_wrid);
  for (.. I = last_wrid; I != wc->wrid; ++I)
    complete(wr_data[I].ptr);

Many other options, too.

-----

There is a bit more going on too, *technically* the HCA owns the
buffer until a SCQE is produced. The recv proves the peer will drop
any re-transmits of the message, but it doesn't prove that the local
HCA won't create a re-transmit. Lost acks or other protocol weirdness
could *potentially* cause buffer re-read in the general RDMA
framework.

So if you use recv to drive re-use of the SEND buffer memory, it is
important that the SEND buffer remain full of data to send to that
peer and not be kfree'd, dma unmapped, or reused for another peer's
data.

kfree/dma unmap/etc may only be done on a SEND buffer after seeing a
SCQE proving that buffer is done, or tearing down the QP and halting
the send side.

> The problem then is how do you make the RDMA consumer wait until
> there is send queue space. I suppose the xprt_complete_rqst()

It depends on the overall ULP design..

For work that is created by the recv queue (ie invalidates, new posts,
etc) I've successfully simply stopped polling the rq if the sq doesn't
have room to issue the largest single compound a recv would require.

Ie on the client side a recv may require issuing an INVALIDATE, so
when the SQ fills, stop processing recv.

> could be postponed in this case, or simulated xprt congestion
> could be used to prevent starting new RPCs while the send queue
> is full.

Then the other half is async new work from someplace else, like the rq
case above, stop async work from advancing if the SQE cannot hold the
largerst required compound. Sounds like this is 2 (FMWR, SEND) for NFS
client.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                                 ` <20150716204932.GA10638-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-17 15:03                                                                   ` Chuck Lever
       [not found]                                                                     ` <62F9F5B8-0A18-4DF8-B47E-7408BFFE9904-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-17 15:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 16, 2015, at 4:49 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Thu, Jul 16, 2015 at 04:07:04PM -0400, Chuck Lever wrote:
> 
>> The MRs are registered only for remote read. I don’t think
>> catastrophic harm can occur on the client in this case if the
>> invalidation and DMA sync comes late. In fact, I’m unsure why
>> a DMA sync is even necessary as the MR is invalidated in this
>> case.
> 
> For RDMA, the worst case would be some kind of information leakage or
> machine check halt.
> 
> For read side the DMA API should be called before posting the FRWR, no
> completion side issues.

It is: rpcrdma_map_one() is done by .ro_map in both the RDMA READ
and WRITE cases.

Just to confirm: you’re saying that for MRs that are read-accessed,
no matching ib_dma_unmap_{page,single}() is required ?


>> In the case of incoming data payloads (NFS READ) the DMA sync
>> ordering is probably an important issue. The sync has to happen
>> before the ULP can touch the data, 100% of the time.
> 
> Absolultely, the sync is critical.
> 
>> That could be addressed by performing a DMA sync on the write
>> list or reply chunk MRs right in the RPC reply handler (before
>> xprt_complete_rqst).
> 
> That sounds good to me, much more in line with what I'd expect to
> see. The fmr unmap and invalidate post should also be in the reply
> handler (for flow control reasons, see below)

Sure. It might be possible to move both the DMA unmap and the
invalidate into the reply handler without a lot of surgery.
We’ll see.

There would be some performance cost. That’s unfortunate because
the scenarios we’re guarding against are exceptionally rare.


>>> The only absolutely correct way to run the RDMA stack is to keep track
>>> of SQ/SCQ space directly, and only update that tracking by processing
>>> SCQEs.
>> 
>> In other words, the only time it is truly safe to do a post_send is
>> after you’ve received a send completion that indicates you have
>> space on the send queue.
> 
> Yes.
> 
> Use a scheme where you supress signaling and use the SQE accounting to
> request a completion entry and signal around every 1/2 length of the
> SQ.

Actually Sagi and I have found we can’t leave more than about 80
sends unsignalled, no matter how long the pre-allocated SQ is.

xprtrdma caps the maximum number of unsignalled sends at 20,
though, as a margin of error. That gives about 95% send completion
mitigation.

Since most send completions are silenced, xprtrdma relies on seeing
the completion of a _subsequent_ WR.

So, if my reply handler were to issue a LOCAL_INV WR and wait for
its completion, then the completion of send WRs submitted before
that one, even if they are silent, is guaranteed.

In the cases where the reply handler issues a LOCAL_INV, waiting
for its completion before allowing the next RPC to be sent is
enough to guarantee space on the SQ, I would think.

For FMR and smaller RPCs that don’t need RDMA, we’d probably
have to wait on the completion of the RDMA SEND of the RPC call
message.

So, we could get away with signalling only the last send WR issued
for each RPC.


> Use the WRID in some way to encode the # SQEs each completion
> represents.
> 
> I've used a scheme where the wrid is a wrapping index into
> an array of SQ length long, that holds any meta information..
> 
> That makes it trivial to track SQE accounting and avoids memory
> allocations for wrids.
> 
> Generically:
> 
>  posted_sqes -= (wc->wrid - last_wrid);
>  for (.. I = last_wrid; I != wc->wrid; ++I)
>    complete(wr_data[I].ptr);
> 
> Many other options, too.
> 
> -----
> 
> There is a bit more going on too, *technically* the HCA owns the
> buffer until a SCQE is produced. The recv proves the peer will drop
> any re-transmits of the message, but it doesn't prove that the local
> HCA won't create a re-transmit. Lost acks or other protocol weirdness
> could *potentially* cause buffer re-read in the general RDMA
> framework.
> 
> So if you use recv to drive re-use of the SEND buffer memory, it is
> important that the SEND buffer remain full of data to send to that
> peer and not be kfree'd, dma unmapped, or reused for another peer's
> data.
> 
> kfree/dma unmap/etc may only be done on a SEND buffer after seeing a
> SCQE proving that buffer is done, or tearing down the QP and halting
> the send side.

The buffers the client uses to send an RPC call are DMA mapped once
when the transport is created, and a local lkey is used in the SEND
WR.

They are re-used for the next RPCs in the pipe, but as far as I can
tell the client’s send buffer contains the RPC call data until the
RPC request slot is retired (xprt_release).

I need to review the mechanism in rpcrdma_buffer_get() to see if
that logic does prevent early re-use.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                                     ` <62F9F5B8-0A18-4DF8-B47E-7408BFFE9904-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-17 17:21                                                                       ` Jason Gunthorpe
       [not found]                                                                         ` <20150717172141.GA15808-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-17 17:21 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Fri, Jul 17, 2015 at 11:03:45AM -0400, Chuck Lever wrote:
> 
> On Jul 16, 2015, at 4:49 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> > On Thu, Jul 16, 2015 at 04:07:04PM -0400, Chuck Lever wrote:
> > 
> >> The MRs are registered only for remote read. I don’t think
> >> catastrophic harm can occur on the client in this case if the
> >> invalidation and DMA sync comes late. In fact, I’m unsure why
> >> a DMA sync is even necessary as the MR is invalidated in this
> >> case.
> > 
> > For RDMA, the worst case would be some kind of information leakage or
> > machine check halt.
> > 
> > For read side the DMA API should be called before posting the FRWR, no
> > completion side issues.
> 
> It is: rpcrdma_map_one() is done by .ro_map in both the RDMA READ
> and WRITE cases.
> 
> Just to confirm: you’re saying that for MRs that are read-accessed,
> no matching ib_dma_unmap_{page,single}() is required ?

Sorry, I wasn't clear, dma_map/unmap must always be paired, they should
ideally be in the right ordering:
 dma_map(.)
 create MR
 invalidate MR
 dma_unmap()

Remember the DMA API could spin up IOMMU mappings or otherwise,
pairing is critical, ordering is critical, and I'd have some concern
around timeliness too..

When I said no issues, I was talking about running the MR invalidate
async with RPC processing. That should be fine.

But the dma unmap should be done from the SCQ processing loop, after
it is known the INVALIDATE for the ACCESS_REMOTE_READ MR is
complete. You could perhaps suppress completion for the invalidate, as
long as there is a scheme to track the needed invalidate (see my last
email)

A ACCES_REMOTE_WRITE MR side is basically the same, except the
INVALIDATE should be signaled and RPC processing should resume from
the SCQ side.

This is where you'd put a 'server trust' performance option to run
even the write invalidate async, then the dma_unmap should be done
when the invalidate is posted.

> Sure. It might be possible to move both the DMA unmap and the
> invalidate into the reply handler without a lot of surgery.
> We’ll see.
> 
> There would be some performance cost. That’s unfortunate because
> the scenarios we’re guarding against are exceptionally rare.

NFS needs to learn to do SEND WITH INVALIDATE to mitigate the
invalidate cost...

> > Use a scheme where you supress signaling and use the SQE accounting to
> > request a completion entry and signal around every 1/2 length of the
> > SQ.
> 
> Actually Sagi and I have found we can’t leave more than about 80
> sends unsignalled, no matter how long the pre-allocated SQ is.

Hum, I'm pretty sure I've done more than that before on mlx4 and
mthca. Certainly, I can't think of any reason (spec wise) for the
above to be true. Sagi, do you know what this is?

The fact you see unexplained problems like this is more likely to be a
reflection of NFS not following the rules for running the SQ, than a
driver bug. QP blow ups and posting failures are exactly the symptoms
of not following the rules :)

Once the ULP is absolutely certain, by direct accounting of consumed
SQEs that it is not over posting, would I look for a driver/hw
problem....

> Since most send completions are silenced, xprtrdma relies on seeing
> the completion of a _subsequent_ WR.

Right, since you don't care about the sends, you only need enough
information and signalling to flow control the SQ/SCQ. But, a SEND
that would other wise be silenced, should be signaled if it falls at
the 1/2 mark, or is the last WR placed into a becoming full SQ. That
minimum basic mandatory signalling is required to avoid deadlocking.

> So, if my reply handler were to issue a LOCAL_INV WR and wait for
> its completion, then the completion of send WRs submitted before
> that one, even if they are silent, is guaranteed.

Yes, the SQ is strongly ordered.

> In the cases where the reply handler issues a LOCAL_INV, waiting
> for its completion before allowing the next RPC to be sent is
> enough to guarantee space on the SQ, I would think.

> For FMR and smaller RPCs that don’t need RDMA, we’d probably
> have to wait on the completion of the RDMA SEND of the RPC call
> message.

> So, we could get away with signalling only the last send WR issued
> for each RPC.

I think I see you thinking about how to bolt on a different implicit
accounting scheme, again using inference about X completing meaning Y
is available?

I'm sure that can be made to work (and I think you've got the right
reasoning), but I strongly don't recommend it - it is complicated and
brittle to maintain. ie Perhaps NFS had a reasonable scheme like this
once, but the FRWR additions appear to have damanged it's basic
unstated assumptions.

Directly track the number of SQEs used and available, use WARN_ON
before every post to make sure the invariant isn't violated.

Because NFS has a mixed scheme where only INVALIDATE is required
synchronous, I'd optimize for free flow without requiring SEND to be
signaled.

Based on your comments, I think an accounting scheme like this makes
sense:
 0. Broadly we want to have three pools for a RPC slot:
     - Submitted to the upper layer and available for immediate use
     - Submitted to the network and currently executing
     - Waiting for resources to recycle
       * A recv buffer is posted to the local RQ
       * The far end has posted its recv buffer to its RQ
       * The SQ/SCQ has avilable space to issue any RPC
 1. Each RPC slot takes a maximum of N SQE credits. Figure this
    constant out at the start of time. I suspect it is 3 when using FRWR.
 2. When you pass a RPC slot to the upper layer, either at the start
    of time, or when completing recvs, decrease the SQE accounting
    by N. ie the upper layer is now free to use that RPC slot at any
    momement, the maximum N SQEs it could require are guaranteed
    available and nothing can steal them.

    If N SQEs are not available then do not give the slot to the
    upper layer.
 3. When the RPC is actually submitted figure out how many SQEs it
    really needs and adjust the accounting. Ie if only 1 is needed then
    return 2 SQE credits.
 4. Track SQE credits use at SCQ time using some scheme, and return
    credit for explicitly&implicitly completed SQEs.
 5. Figure out the right place to inject the 3rd pool of #0. This can
    absolutely be done by deferring advancing the recvQ until the RPC
    recycling conditions are all met, but it would be better
    (latency wise) to process the recv and then defer recycling the
    empty RPC slot.

Use signaling when necessary: at the 1/2 point, for all SQEs when free
space is < N (deadlock avoidance) and when NFS needs to wait for a
sync invalidate.

It sounds more complicated than it is. :)

If you have a work load with no sync-invalidates then the above still
functions at full speed without requiring extra SEND signaling.
sync-invalidates cause SQE credits to recycle faster and guarentees we
won't do the deferral in #5.

Size the SQ length to be at least something like 2*N*(the # of RPC)
slots..

I'd say the above is broadly typical for what I'd consider correct use
of a RDMA QP.. The three flow control loops of #0 should be fairly obvious
and explicit in the code.

> > kfree/dma unmap/etc may only be done on a SEND buffer after seeing a
> > SCQE proving that buffer is done, or tearing down the QP and halting
> > the send side.
> 
> The buffers the client uses to send an RPC call are DMA mapped once
> when the transport is created, and a local lkey is used in the SEND
> WR.
> 
> They are re-used for the next RPCs in the pipe, but as far as I can
> tell the client’s send buffer contains the RPC call data until the
> RPC request slot is retired (xprt_release).

It is what I'd expect based on your past descriptions - just making
sure you are aware :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                                         ` <20150717172141.GA15808-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-17 19:26                                                                           ` Chuck Lever
       [not found]                                                                             ` <9A70883F-9963-42D0-9F5C-EF49F822A037-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Lever @ 2015-07-17 19:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey


On Jul 17, 2015, at 1:21 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Fri, Jul 17, 2015 at 11:03:45AM -0400, Chuck Lever wrote:
>> 
>> On Jul 16, 2015, at 4:49 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>> 
>>> Use a scheme where you supress signaling and use the SQE accounting to
>>> request a completion entry and signal around every 1/2 length of the
>>> SQ.
>> 
>> Actually Sagi and I have found we can’t leave more than about 80
>> sends unsignalled, no matter how long the pre-allocated SQ is.
> 
> Hum, I'm pretty sure I've done more than that before on mlx4 and
> mthca. Certainly, I can't think of any reason (spec wise) for the
> above to be true. Sagi, do you know what this is?
> 
> The fact you see unexplained problems like this is more likely to be a
> reflection of NFS not following the rules for running the SQ, than a
> driver bug. QP blow ups and posting failures are exactly the symptoms
> of not following the rules :)
> 
> Once the ULP is absolutely certain, by direct accounting of consumed
> SQEs that it is not over posting, would I look for a driver/hw
> problem....
> 
>> Since most send completions are silenced, xprtrdma relies on seeing
>> the completion of a _subsequent_ WR.
> 
> Right, since you don't care about the sends, you only need enough
> information and signalling to flow control the SQ/SCQ. But, a SEND
> that would other wise be silenced, should be signaled if it falls at
> the 1/2 mark, or is the last WR placed into a becoming full SQ. That
> minimum basic mandatory signalling is required to avoid deadlocking.
> 
>> So, if my reply handler were to issue a LOCAL_INV WR and wait for
>> its completion, then the completion of send WRs submitted before
>> that one, even if they are silent, is guaranteed.
> 
> Yes, the SQ is strongly ordered.
> 
>> In the cases where the reply handler issues a LOCAL_INV, waiting
>> for its completion before allowing the next RPC to be sent is
>> enough to guarantee space on the SQ, I would think.
> 
>> For FMR and smaller RPCs that don’t need RDMA, we’d probably
>> have to wait on the completion of the RDMA SEND of the RPC call
>> message.
> 
>> So, we could get away with signalling only the last send WR issued
>> for each RPC.
> 
> I think I see you thinking about how to bolt on a different implicit
> accounting scheme, again using inference about X completing meaning Y
> is available?
> 
> I'm sure that can be made to work (and I think you've got the right
> reasoning), but I strongly don't recommend it - it is complicated and
> brittle to maintain. ie Perhaps NFS had a reasonable scheme like this
> once, but the FRWR additions appear to have damanged it's basic
> unstated assumptions.
> 
> Directly track the number of SQEs used and available, use WARN_ON
> before every post to make sure the invariant isn't violated.
> 
> Because NFS has a mixed scheme where only INVALIDATE is required
> synchronous, I'd optimize for free flow without requiring SEND to be
> signaled.
> 
> Based on your comments, I think an accounting scheme like this makes
> sense:
> 0. Broadly we want to have three pools for a RPC slot:
>     - Submitted to the upper layer and available for immediate use
>     - Submitted to the network and currently executing
>     - Waiting for resources to recycle
>       * A recv buffer is posted to the local RQ
>       * The far end has posted its recv buffer to its RQ
>       * The SQ/SCQ has avilable space to issue any RPC
> 1. Each RPC slot takes a maximum of N SQE credits. Figure this
>    constant out at the start of time. I suspect it is 3 when using FRWR.
> 2. When you pass a RPC slot to the upper layer, either at the start
>    of time, or when completing recvs, decrease the SQE accounting
>    by N. ie the upper layer is now free to use that RPC slot at any
>    momement, the maximum N SQEs it could require are guaranteed
>    available and nothing can steal them.
> 
>    If N SQEs are not available then do not give the slot to the
>    upper layer.
> 3. When the RPC is actually submitted figure out how many SQEs it
>    really needs and adjust the accounting. Ie if only 1 is needed then
>    return 2 SQE credits.
> 4. Track SQE credits use at SCQ time using some scheme, and return
>    credit for explicitly&implicitly completed SQEs.
> 5. Figure out the right place to inject the 3rd pool of #0. This can
>    absolutely be done by deferring advancing the recvQ until the RPC
>    recycling conditions are all met, but it would be better
>    (latency wise) to process the recv and then defer recycling the
>    empty RPC slot.
> 
> Use signaling when necessary: at the 1/2 point, for all SQEs when free
> space is < N (deadlock avoidance) and when NFS needs to wait for a
> sync invalidate.
> 
> It sounds more complicated than it is. :)
> 
> If you have a work load with no sync-invalidates then the above still
> functions at full speed without requiring extra SEND signaling.
> sync-invalidates cause SQE credits to recycle faster and guarentees we
> won't do the deferral in #5.
> 
> Size the SQ length to be at least something like 2*N*(the # of RPC)
> slots..
> 
> I'd say the above is broadly typical for what I'd consider correct use
> of a RDMA QP.. The three flow control loops of #0 should be fairly obvious
> and explicit in the code.

Jason, thanks for your comments and your time.

The send queue overflows I saw may indeed be related to the
current design which assumes the receive completion for an RPC
reply always implies the send queue has space for the next RPC
operation’s send WRs.

I wonder if I introduced this problem when I split the completion
queue.

Some send queue accounting is already in place (see DECR_CQCOUNT).
I’m sure that can be enhanced. What may be missing is a check for
available send queue resources before dispatching the next RPC.

The three #0 pools make sense, and I can see a mapping onto the
current RPC client design. When the send queue is full, the next
RPC can be deferred by not calling xprt_release_rqst_cong() until
this RPC’s resources are free.

However, if we start signaling more aggressively when the send
queue is full, that means intentionally multiplying the completion
and interrupt rate when the workload is heaviest. That could have
performance scalability consequences.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                                                                             ` <9A70883F-9963-42D0-9F5C-EF49F822A037-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-17 20:36                                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-17 20:36 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Or Gerlitz,
	Oren Duer, Bart Van Assche, Liran Liss, Hefty, Sean,
	Doug Ledford, Tom Talpey

On Fri, Jul 17, 2015 at 03:26:04PM -0400, Chuck Lever wrote:
> > I'd say the above is broadly typical for what I'd consider correct use
> > of a RDMA QP.. The three flow control loops of #0 should be fairly obvious
> > and explicit in the code.
> 
> Jason, thanks for your comments and your time.

No problem, I hope you can work something out and keep participating
in the various new API discussions!

> Some send queue accounting is already in place (see DECR_CQCOUNT).
> I’m sure that can be enhanced. What may be missing is a check for
> available send queue resources before dispatching the next RPC.

Just some more clarity and colour: I talked about tracking SQEs, this
is explicitly monitoring the SQ and preventing overflow, but I'm
assuming that there is a 1:1 mapping of SQ to CQ -> ie the CQ is not
shared.

In this case, the SQE limit is the smaller of the two queues and
tracking the SQEs tracks the CQ space.

If the CQ is shared, then the CQ itself should also be tracked, and
nobody can post to a related Q without CQ space. This forms a fourth
flow control loop.

So language wise, talk about tracking SQE (send queue entries), and
if you have shared CQs then add a CQ count.

Implementation wise, I often use wrapping 64 bit counters to keep
track of this stuff. Every SQE post incres the head and every SCQ reap
incrs the tail, (head-tail) < limit is the main math.

This lets the counter be used as a record, and aids debugging, see
below

> However, if we start signaling more aggressively when the send
> queue is full, that means intentionally multiplying the completion
> and interrupt rate when the workload is heaviest. That could have
> performance scalability consequences.

Consider, it is also possible that the SQ is full because we
are not signaling enough: There are many unreaped entries.

There are many different schemes that are possible here.. What I
described was something simple and easy to understand, while still
thinking about various deadlock situations.

Something like this is a more complete example:

uint64_t head_sqe;
uint64_t tail_sqe;
uint64_t signaled_sqe;

if (need_signal ||  
    (head_sqe - signaled_sqe) >= sqe_limit/2 ||
   ((head_sqe - tail_sqe) >= (sqe_limit - N) &&
    (head_sqe - signaled_sqe) >= sqe_limit/4) &&
    ring64_gt(signaled_sqe,tail_sqe)) {
  wr[0].send_flags |= IB_SEND_SIGNALED;
  signaled_sqe = head_sqe;
}

ib_post(..,1);

head_sqe += 1;
assert(head_sqe - tail_sqe < sqe_limit);

- Every SQE that crosses a 1/2 marker get a signal at the marker.
- Upon going full we start signaling, unless we signaled recently,
  and the last signal has not been reaped.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <20150716180806.GC3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-19  5:33                   ` Sagi Grimberg
       [not found]                     ` <55AB36A4.1070102-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-19  5:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/16/2015 9:08 PM, Jason Gunthorpe wrote:
> On Thu, Jul 16, 2015 at 03:21:04PM +0300, Sagi Grimberg wrote:
>> I gotta say,
>>
>> these suggestions of bool/write or supported_ops with a convert helper
>> seem (to me at least) to make things more complicated.
>>
>> Why not just set the the access_flags as they are?
>> I want local use?
>> set IB_ACCESS_LOCAL_WRITE
>> I want a peer to read from me?
>> set IB_ACCESS_REMOTE_READ
>> I want a peer to write to me?
>> IB_ACCESS_REMOTE_WRITE
>> ...
>>
>> isn't it much simpler?
>
> I don't have a really strong preference, but I like the idea of using
> the OP names because it means the API cannot be used wrong.
>
> Ie this, is absolutely wrong:
>
>   rdma_map_sg_lkey(..,IB_ACCESS_REMOTE_READ);
>
> But this:
>
>   rdma_map_sg_lkey(..,IB_ACCESS_REMOTE_WRITE);
>
> Is only wrong on IB.
>
> While,
>   rdma_map_sg_lkey(..,RDMA_OP_RDMA_READ)
>
> Is always right, and can never be used wrong, and thus is less
> complicated (for the ULP).

I was thinking that the user won't explicitly say which key it registers
and it will be decided from the registration itself.
Meaning, the registration code will do:

if (access | (IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE)
         register rkey...
else
         register lkey...

Will that work with iWARP? or will this break because
iWARP needs REMOTE_WRITE for lkeys?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-07-15  7:32   ` Christoph Hellwig
@ 2015-07-19  5:45   ` Sagi Grimberg
       [not found]     ` <55AB3976.7060202-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  3 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-19  5:45 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Jason Gunthorpe
  Cc: Steve Wise, Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche,
	Liran Liss, Hefty, Sean, Doug Ledford, Tom Talpey


> /**
>    * ib_mr_set_sg() - populate memory region buffers
>    *     array from a SG list
>    * @mr:          memory region
>    * @sg:          sg list
>    * @sg_nents:    number of elements in the sg
>    *
>    * Can fail if the HW is not able to register this
>    * sg list. In case of failure - caller is responsible
>    * to handle it (bounce-buffer, multiple registrations...)
>    */
> int ib_mr_set_sg(struct ib_mr *mr,
>                   struct scatterlist *sg,
>                   unsigned short sg_nents);

I'm thinking now that this should have an input argument
of block_size. Maybe in the future ULPs would want to register
huge pages, it will be a shame to map it into PAGE_SIZE chunks...

Current users will just pass PAGE_SIZE.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]     ` <55AB3976.7060202-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-20 16:18       ` Jason Gunthorpe
       [not found]         ` <20150720161821.GA18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-20 16:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Sun, Jul 19, 2015 at 08:45:26AM +0300, Sagi Grimberg wrote:
> 
> >/**
> >   * ib_mr_set_sg() - populate memory region buffers
> >   *     array from a SG list
> >   * @mr:          memory region
> >   * @sg:          sg list
> >   * @sg_nents:    number of elements in the sg
> >   *
> >   * Can fail if the HW is not able to register this
> >   * sg list. In case of failure - caller is responsible
> >   * to handle it (bounce-buffer, multiple registrations...)
> >   */
> >int ib_mr_set_sg(struct ib_mr *mr,
> >                  struct scatterlist *sg,
> >                  unsigned short sg_nents);
> 
> I'm thinking now that this should have an input argument
> of block_size. Maybe in the future ULPs would want to register
> huge pages, it will be a shame to map it into PAGE_SIZE chunks...

Why wouldn't it just transparently support huge pages? sg seems to
have enough information.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                     ` <55AB36A4.1070102-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-20 16:23                       ` Jason Gunthorpe
       [not found]                         ` <20150720162340.GB18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-20 16:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Sun, Jul 19, 2015 at 08:33:24AM +0300, Sagi Grimberg wrote:
> I was thinking that the user won't explicitly say which key it registers
> and it will be decided from the registration itself.
> Meaning, the registration code will do:

Please don't..

> if (access | (IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE)
>         register rkey...
> else
>         register lkey...
> 
> Will that work with iWARP? or will this break because
> iWARP needs REMOTE_WRITE for lkeys?

Right, it will break.

Access flags are only weakly related to lkey/rkey.

It needs to be explicit. We have spots in the API that take lkeys and
other spots the take rkeys - the caller must always know the intended
use of the key it is requesting, there is no reason not to describe
that explicitly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]         ` <20150720161821.GA18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-20 16:27           ` Sagi Grimberg
       [not found]             ` <55AD2188.50708-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-20 16:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

>> I'm thinking now that this should have an input argument
>> of block_size. Maybe in the future ULPs would want to register
>> huge pages, it will be a shame to map it into PAGE_SIZE chunks...
>
> Why wouldn't it just transparently support huge pages? sg seems to
> have enough information.

I'm not sure I know how to do that, can you explain how please?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <20150720162340.GB18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-20 16:29                           ` Sagi Grimberg
  0 siblings, 0 replies; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-20 16:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/20/2015 7:23 PM, Jason Gunthorpe wrote:
> On Sun, Jul 19, 2015 at 08:33:24AM +0300, Sagi Grimberg wrote:
>> I was thinking that the user won't explicitly say which key it registers
>> and it will be decided from the registration itself.
>> Meaning, the registration code will do:
>
> Please don't..
>
>> if (access | (IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE)
>>          register rkey...
>> else
>>          register lkey...
>>
>> Will that work with iWARP? or will this break because
>> iWARP needs REMOTE_WRITE for lkeys?
>
> Right, it will break.
>
> Access flags are only weakly related to lkey/rkey.
>
> It needs to be explicit. We have spots in the API that take lkeys and
> other spots the take rkeys - the caller must always know the intended
> use of the key it is requesting, there is no reason not to describe
> that explicitly.

OK, keep it explicit. got it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]             ` <55AD2188.50708-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-20 17:00               ` Jason Gunthorpe
       [not found]                 ` <20150720170033.GA20350-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-20 17:00 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Mon, Jul 20, 2015 at 07:27:52PM +0300, Sagi Grimberg wrote:
> >>I'm thinking now that this should have an input argument
> >>of block_size. Maybe in the future ULPs would want to register
> >>huge pages, it will be a shame to map it into PAGE_SIZE chunks...
> >
> >Why wouldn't it just transparently support huge pages? sg seems to
> >have enough information.
> 
> I'm not sure I know how to do that, can you explain how please?

Scan the scatter list, if the pages are all the same length and
aligned on their length then that is the huge page size, otherwise use
4k.

Convert to a shift.

See if the driver supports that as a huge page size (currently
missing from our driver API).

Fill in the wr as you'd expect using the computed value as
wr.fast_reg.page_shift

Really, today nothing can use huge pages with IB_WR_FAST_REG_MR
because there is no way to negotiate with the driver the supported
page shift values. Hopefully your new version can get closer to
fixing that by taking the wr construction away from the caller.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                 ` <20150720170033.GA20350-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-20 17:07                   ` Sagi Grimberg
       [not found]                     ` <55AD2AB4.8010209-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-20 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On 7/20/2015 8:00 PM, Jason Gunthorpe wrote:
> On Mon, Jul 20, 2015 at 07:27:52PM +0300, Sagi Grimberg wrote:
>>>> I'm thinking now that this should have an input argument
>>>> of block_size. Maybe in the future ULPs would want to register
>>>> huge pages, it will be a shame to map it into PAGE_SIZE chunks...
>>>
>>> Why wouldn't it just transparently support huge pages? sg seems to
>>> have enough information.
>>
>> I'm not sure I know how to do that, can you explain how please?
>
> Scan the scatter list, if the pages are all the same length and
> aligned on their length then that is the huge page size, otherwise use
> 4k.

Bleh... seems like a great effort just to find that out. Isn't it
better to just ask for a page_size arg?

>
> Convert to a shift.
>
> See if the driver supports that as a huge page size (currently
> missing from our driver API).

It not missing, we have device attribute page_size_cap which is
a bitmask of supported page shifts (if I'm not mistaken).

>
> Fill in the wr as you'd expect using the computed value as
> wr.fast_reg.page_shift
>
> Really, today nothing can use huge pages with IB_WR_FAST_REG_MR
> because there is no way to negotiate with the driver the supported
> page shift values. Hopefully your new version can get closer to
> fixing that by taking the wr construction away from the caller.

It is negotiable. Most drivers don't negotiate it though... srp is
the only one who does it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                     ` <55AD2AB4.8010209-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-20 19:50                       ` Jason Gunthorpe
       [not found]                         ` <20150720195027.GA24162-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-20 19:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Mon, Jul 20, 2015 at 08:07:00PM +0300, Sagi Grimberg wrote:
> On 7/20/2015 8:00 PM, Jason Gunthorpe wrote:
> >On Mon, Jul 20, 2015 at 07:27:52PM +0300, Sagi Grimberg wrote:
> >>>>I'm thinking now that this should have an input argument
> >>>>of block_size. Maybe in the future ULPs would want to register
> >>>>huge pages, it will be a shame to map it into PAGE_SIZE chunks...
> >>>
> >>>Why wouldn't it just transparently support huge pages? sg seems to
> >>>have enough information.
> >>
> >>I'm not sure I know how to do that, can you explain how please?
> >
> >Scan the scatter list, if the pages are all the same length and
> >aligned on their length then that is the huge page size, otherwise use
> >4k.
> 
> Bleh... seems like a great effort just to find that out. Isn't it
> better to just ask for a page_size arg?

So who computes page_size and how? Don't just punt things to a caller
without really explaining how the caller is supposed to use it
correctly.

For a value like this, it is a property of the scatter list. It should
either be computed when the scatterlist is created, or computed when
the scatterlist is passed to the HW.

Since IB is probably the only driver that would need to compute this,
then IB should do it at the driver level, and not burden the block
layer/etc with useless work.

Unless you think the ULP can get the same value faster..

> It not missing, we have device attribute page_size_cap which is
> a bitmask of supported page shifts (if I'm not mistaken).

Hum. That is what it should be..

Some drivers are wrong:

#define C2_MIN_PAGESIZE  1024
drivers/infiniband/hw/amso1100/c2_rnic.c:       props->page_size_cap       = ~(C2_MIN_PAGESIZE-1);

Many set it to PAGE_SIZE, which seems bonkers:

drivers/infiniband/hw/usnic/usnic_ib_verbs.c:   props->page_size_cap = USNIC_UIOM_PAGE_SIZE;
drivers/infiniband/hw/usnic/usnic_uiom.h:#define USNIC_UIOM_PAGE_SIZE           (PAGE_SIZE)
drivers/infiniband/hw/ipath/ipath_verbs.c:      props->page_size_cap = PAGE_SIZE;
drivers/infiniband/hw/qib/qib_verbs.c:  props->page_size_cap = PAGE_SIZE;

mlx5 seems to support only 1 page size, Sagi: I assume that needs fixing?

drivers/infiniband/hw/mlx5/main.c:      props->page_size_cap       = 1ull << MLX5_CAP_GEN(mdev, log_pg_sz);

ocrdma,cxgb4,mlx4,mhtca look pretty good, and support various huge
pages.

> It is negotiable. Most drivers don't negotiate it though... srp is
> the only one who does it.

Well SRP does this:

drivers/infiniband/ulp/srp/ib_srp.c:    mr_page_shift           = max(12, ffs(dev_attr->page_size_cap) - 1);
drivers/infiniband/ulp/srp/ib_srp.c:    srp_dev->mr_page_size   = 1 << mr_page_shift;

So it always uses 4096 on supported IB hardware and no huge page
support is enabled. This seems like the wrong way to use
page_size_cap...

Hopefully moving SRP to your new API will fix that.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                         ` <20150720195027.GA24162-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-21 11:40                           ` Sagi Grimberg
       [not found]                             ` <55AE2FA2.3000601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2015-07-21 11:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

>>
>> Bleh... seems like a great effort just to find that out. Isn't it
>> better to just ask for a page_size arg?
>
> So who computes page_size and how? Don't just punt things to a caller
> without really explaining how the caller is supposed to use it
> correctly.

I'd imagine that the ULP knows when it registers huge-pages.
OK, I can scan the scatterlist and check it.

>> It not missing, we have device attribute page_size_cap which is
>> a bitmask of supported page shifts (if I'm not mistaken).
>
> Hum. That is what it should be..
>
> Some drivers are wrong:
>
> #define C2_MIN_PAGESIZE  1024
> drivers/infiniband/hw/amso1100/c2_rnic.c:       props->page_size_cap       = ~(C2_MIN_PAGESIZE-1);
>
> Many set it to PAGE_SIZE, which seems bonkers:
>
> drivers/infiniband/hw/usnic/usnic_ib_verbs.c:   props->page_size_cap = USNIC_UIOM_PAGE_SIZE;
> drivers/infiniband/hw/usnic/usnic_uiom.h:#define USNIC_UIOM_PAGE_SIZE           (PAGE_SIZE)
> drivers/infiniband/hw/ipath/ipath_verbs.c:      props->page_size_cap = PAGE_SIZE;
> drivers/infiniband/hw/qib/qib_verbs.c:  props->page_size_cap = PAGE_SIZE;
>
> mlx5 seems to support only 1 page size, Sagi: I assume that needs fixing?
>
> drivers/infiniband/hw/mlx5/main.c:      props->page_size_cap       = 1ull << MLX5_CAP_GEN(mdev, log_pg_sz);

Yep, fixing it now.

>
> ocrdma,cxgb4,mlx4,mhtca look pretty good, and support various huge
> pages.
>
>> It is negotiable. Most drivers don't negotiate it though... srp is
>> the only one who does it.
>
> Well SRP does this:
>
> drivers/infiniband/ulp/srp/ib_srp.c:    mr_page_shift           = max(12, ffs(dev_attr->page_size_cap) - 1);
> drivers/infiniband/ulp/srp/ib_srp.c:    srp_dev->mr_page_size   = 1 << mr_page_shift;
>
> So it always uses 4096 on supported IB hardware and no huge page
> support is enabled. This seems like the wrong way to use
> page_size_cap...
>
> Hopefully moving SRP to your new API will fix that.

I have no plans in attempting the try to find the biggest aligned
page_size that is supported by the device.

I'm only going to check HUGE_PAGE, and if not aligned I'll use
PAGE_SIZE and if that's not aligned - fail.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Kernel fast memory registration API proposal [RFC]
       [not found]                             ` <55AE2FA2.3000601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-21 16:00                               ` Jason Gunthorpe
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Gunthorpe @ 2015-07-21 16:00 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Steve Wise,
	Or Gerlitz, Oren Duer, Chuck Lever, Bart Van Assche, Liran Liss,
	Hefty, Sean, Doug Ledford, Tom Talpey

On Tue, Jul 21, 2015 at 02:40:18PM +0300, Sagi Grimberg wrote:

> I'd imagine that the ULP knows when it registers huge-pages.
> OK, I can scan the scatterlist and check it.

At some point in the stack.. Not necessarily in the IB ULP though..

It is something that can be reviewed when the API is done, if it can
be sped up by shifting stuff around..

> >So it always uses 4096 on supported IB hardware and no huge page
> >support is enabled. This seems like the wrong way to use
> >page_size_cap...
> >
> >Hopefully moving SRP to your new API will fix that.
> 
> I have no plans in attempting the try to find the biggest aligned
> page_size that is supported by the device.

It might be worth doing if there is a single sg entry that is 'big', 
but I don't think we have workloads like that today, someone else can
worry about that.. I'd probably just leave a note in a comment about
the possibility.

> I'm only going to check HUGE_PAGE, and if not aligned I'll use
> PAGE_SIZE and if that's not aligned - fail.

Seems reasonable.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2015-07-21 16:00 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-10  9:09 Kernel fast memory registration API proposal [RFC] Sagi Grimberg
     [not found] ` <559F8BD1.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-11 10:39   ` Christoph Hellwig
     [not found]     ` <20150711103920.GE14741-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-12  7:57       ` Sagi Grimberg
     [not found]         ` <55A21DF6.6090909-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-12 18:15           ` Chuck Lever
     [not found]             ` <96901C8F-D916-4ECF-8DA4-C5C67FB8539E-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-13  6:47               ` Christoph Hellwig
     [not found]                 ` <20150713064701.GB31842-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-13 14:16                   ` Chuck Lever
     [not found]                     ` <1D9C0527-E277-4C3F-A80D-C4FBAA3D82E9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-14  8:50                       ` Sagi Grimberg
     [not found]                         ` <55A4CD5B.9030000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-14 18:58                           ` Chuck Lever
2015-07-13 16:30   ` Jason Gunthorpe
     [not found]     ` <20150713163015.GA23832-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-14  8:39       ` Sagi Grimberg
     [not found]         ` <55A4CABC.5050807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-14 14:42           ` Steve Wise
2015-07-14 15:33           ` Christoph Hellwig
     [not found]             ` <20150714153347.GA11026-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-14 15:53               ` Jason Gunthorpe
     [not found]                 ` <20150714155340.GA7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-14 16:46                   ` Sagi Grimberg
     [not found]                     ` <55A53CFA.7070509-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-14 17:08                       ` Jason Gunthorpe
     [not found]                         ` <20150714170808.GA19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-14 18:07                           ` Steve Wise
2015-07-15  3:05                           ` Doug Ledford
     [not found]                             ` <55A5CDE2.4060904-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-07-15  8:52                               ` Sagi Grimberg
2015-07-14 16:12               ` Sagi Grimberg
     [not found]                 ` <55A534D1.6030008-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-14 16:16                   ` Steve Wise
2015-07-14 17:29                     ` Tom Talpey
2015-07-14 16:35                   ` Jason Gunthorpe
     [not found]                     ` <20150714163506.GC7399-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-14 16:55                       ` Sagi Grimberg
     [not found]                         ` <55A53F0B.5050009-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-14 17:09                           ` Jason Gunthorpe
     [not found]                             ` <20150714170859.GB19814-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-15  8:01                               ` Sagi Grimberg
     [not found]                                 ` <55A6136A.8010204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-15 14:32                                   ` Chuck Lever
     [not found]                                     ` <A9EF2F26-E737-4E80-B2E3-F8D6406F9893-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-15 14:39                                       ` Chuck Lever
2015-07-15 17:19                                       ` Jason Gunthorpe
     [not found]                                         ` <20150715171926.GB23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-15 18:39                                           ` Steve Wise
2015-07-15 21:25                                           ` Chuck Lever
     [not found]                                             ` <F2C64EE9-38A5-4DEE-B60E-AD8430FE1049-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-15 22:49                                               ` Jason Gunthorpe
     [not found]                                                 ` <20150715224928.GA941-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-16 14:45                                                   ` Chuck Lever
     [not found]                                                     ` <F0518DEF-D43C-4CB6-89ED-CA3E94A4DD72-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-16 14:56                                                       ` Steve Wise
2015-07-16 17:40                                                       ` Jason Gunthorpe
     [not found]                                                         ` <20150716174046.GB3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-16 20:07                                                           ` Chuck Lever
     [not found]                                                             ` <F8484ABB-BED9-463F-8AEA-EB898EBDD93C-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-16 20:49                                                               ` Jason Gunthorpe
     [not found]                                                                 ` <20150716204932.GA10638-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-17 15:03                                                                   ` Chuck Lever
     [not found]                                                                     ` <62F9F5B8-0A18-4DF8-B47E-7408BFFE9904-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-17 17:21                                                                       ` Jason Gunthorpe
     [not found]                                                                         ` <20150717172141.GA15808-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-17 19:26                                                                           ` Chuck Lever
     [not found]                                                                             ` <9A70883F-9963-42D0-9F5C-EF49F822A037-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-17 20:36                                                                               ` Jason Gunthorpe
2015-07-16  6:52                                       ` Sagi Grimberg
     [not found]                                         ` <55A754BC.6010706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-16  8:07                                           ` Christoph Hellwig
     [not found]                                             ` <20150716080702.GD9093-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-16  8:29                                               ` Sagi Grimberg
     [not found]                                                 ` <55A76B84.30504-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-16 14:25                                                   ` Steve Wise
2015-07-16 14:40                                                     ` Sagi Grimberg
2015-07-15 18:31                                   ` Jason Gunthorpe
     [not found]                                     ` <20150715183129.GC23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-15 18:50                                       ` Steve Wise
2015-07-15 19:09                                         ` Jason Gunthorpe
     [not found]                                           ` <20150715190947.GE23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-15 19:26                                             ` Steve Wise
2015-07-16  8:02                                       ` Christoph Hellwig
2015-07-15  7:32   ` Christoph Hellwig
     [not found]     ` <20150715073233.GA11535-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-15  8:33       ` Sagi Grimberg
     [not found]         ` <55A61AE3.8020609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-15  9:07           ` Christoph Hellwig
2015-07-15 19:15           ` Jason Gunthorpe
2015-07-15 17:07       ` Jason Gunthorpe
     [not found]         ` <20150715170750.GA23588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-16 12:21           ` Sagi Grimberg
     [not found]             ` <55A7A1B0.5000808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-16 18:08               ` Jason Gunthorpe
     [not found]                 ` <20150716180806.GC3680-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-19  5:33                   ` Sagi Grimberg
     [not found]                     ` <55AB36A4.1070102-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-20 16:23                       ` Jason Gunthorpe
     [not found]                         ` <20150720162340.GB18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-20 16:29                           ` Sagi Grimberg
2015-07-19  5:45   ` Sagi Grimberg
     [not found]     ` <55AB3976.7060202-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-20 16:18       ` Jason Gunthorpe
     [not found]         ` <20150720161821.GA18336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-20 16:27           ` Sagi Grimberg
     [not found]             ` <55AD2188.50708-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-20 17:00               ` Jason Gunthorpe
     [not found]                 ` <20150720170033.GA20350-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-20 17:07                   ` Sagi Grimberg
     [not found]                     ` <55AD2AB4.8010209-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-20 19:50                       ` Jason Gunthorpe
     [not found]                         ` <20150720195027.GA24162-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-21 11:40                           ` Sagi Grimberg
     [not found]                             ` <55AE2FA2.3000601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-21 16:00                               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.