All of lore.kernel.org
 help / color / mirror / Atom feed
* Status of "ummunot" branch?
@ 2013-05-28 17:51 Jeff Squyres (jsquyres)
       [not found] ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F643196-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-05-28 17:51 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Roland --

I see a ummunot branch on your kernel tree at git.kernel.org (https://git.kernel.org/cgit/linux/kernel/git/roland/infiniband.git/log/?h=ummunot).

Just curious -- what's the status of this tree?  I ask because, as an MPI guy, I would *love* to see this stuff integrated into the kernel and libibverbs.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found] ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F643196-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-05-28 17:52   ` Roland Dreier
       [not found]     ` <CAL1RGDUops1ju6zU=w3vKxcUcLHp6XJFKfBTDr4nm397UkhaYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-05-29  8:53   ` Or Gerlitz
  1 sibling, 1 reply; 40+ messages in thread
From: Roland Dreier @ 2013-05-28 17:52 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres); +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, May 28, 2013 at 10:51 AM, Jeff Squyres (jsquyres)
<jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> I see a ummunot branch on your kernel tree at git.kernel.org (https://git.kernel.org/cgit/linux/kernel/git/roland/infiniband.git/log/?h=ummunot).
>
> Just curious -- what's the status of this tree?  I ask because, as an MPI guy, I would *love* to see this stuff integrated into the kernel and libibverbs.

Haven't touched it in quite a while except to keep it building.  Needs
work to finish up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]     ` <CAL1RGDUops1ju6zU=w3vKxcUcLHp6XJFKfBTDr4nm397UkhaYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-28 18:30       ` Jeff Squyres (jsquyres)
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-05-28 18:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On May 28, 2013, at 1:52 PM, Roland Dreier <roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org> wrote:

> Haven't touched it in quite a while except to keep it building.  Needs
> work to finish up.

What kinds of things still need to be done?  (I don't know if we could work on this or not; just asking to scope out what would need to be done at this point)

Has anything been done on the userspace side?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found] ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F643196-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  2013-05-28 17:52   ` Roland Dreier
@ 2013-05-29  8:53   ` Or Gerlitz
       [not found]     ` <CAJZOPZJc2Dq2jQgRspP_2c1j=4aJ40UxcBEcyiY_mhHPX1ptPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Or Gerlitz @ 2013-05-29  8:53 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres); +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, May 28, 2013 at 8:51 PM, Jeff Squyres (jsquyres)
<jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:

>  I ask because, as an MPI guy, I would *love* to see this stuff integrated into the kernel and libibverbs.


Hi Jeff,

Have you looked on ODP? see
https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]     ` <CAJZOPZJc2Dq2jQgRspP_2c1j=4aJ40UxcBEcyiY_mhHPX1ptPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-29 22:56       ` Jeff Squyres (jsquyres)
       [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F64AAB7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  2013-06-04  1:24       ` Jeff Squyres (jsquyres)
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-05-29 22:56 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On May 29, 2013, at 4:53 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Have you looked on ODP? see
> https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html


Is this upstream?

Has this been run by the MPI implementor community?

The limitation of a max of 2 concurrent page faults seems fairly significant.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F64AAB7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-05-30  5:09           ` Or Gerlitz
       [not found]             ` <51A6DEEC.40305-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Or Gerlitz @ 2013-05-30  5:09 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres); +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 30/05/2013 01:56, Jeff Squyres (jsquyres) wrote:
> On May 29, 2013, at 4:53 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> Have you looked on ODP? see
>> https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html
>
> Is this upstream?

No

> Has this been run by the MPI implementor community?

The team that works on this here isn't ready for submission, so 
community runs were not made yet


> The limitation of a max of 2 concurrent page faults seems fairly significant.
>

let me check
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]             ` <51A6DEEC.40305-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-05-30 15:52               ` Jeff Squyres (jsquyres)
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-05-30 15:52 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On May 30, 2013, at 1:09 AM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> Has this been run by the MPI implementor community?
> 
> The team that works on this here isn't ready for submission, so community runs were not made yet

If this is a solution to an MPI problem, it would seem like a good idea to run the specifics of this proposal to the MPI *implementor* community first (not *users*).

I say this because Mellanox also proposed the concept of a "shared send queue" as a solution to MPI RC scalability problems a while ago (around about the time XRC first debuted, IIRC?), and the MPI community universally hated it.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]     ` <CAJZOPZJc2Dq2jQgRspP_2c1j=4aJ40UxcBEcyiY_mhHPX1ptPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-05-29 22:56       ` Jeff Squyres (jsquyres)
@ 2013-06-04  1:24       ` Jeff Squyres (jsquyres)
       [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F657918-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-04  1:24 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On May 29, 2013, at 1:53 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Have you looked on ODP? see
> https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html


Is the idea behind ODP that, at the beginning of time, you register the entire memory space (i.e., NULL to 2^64) and then never worry about registered memory?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F657918-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-04  8:37           ` Or Gerlitz
       [not found]             ` <51ADA761.2080107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Or Gerlitz @ 2013-06-04  8:37 UTC (permalink / raw)
  To: Haggai Eran; +Cc: Jeff Squyres (jsquyres), linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 04/06/2013 04:24, Jeff Squyres (jsquyres) wrote:
> On May 29, 2013, at 1:53 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> Have you looked on ODP? see
>> https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html
>
> Is the idea behind ODP that, at the beginning of time, you register the entire memory space (i.e., NULL to 2^64) and then never worry about registered memory?
>

Adding Haggai from the team that works on ODP. Haggai, Jeff also made a 
comment over this thread http://marc.info/?t=136976347600006&r=1&w=2 
that a limitation of a max of 2 concurrent page faults seems fairly 
significant which you might want to address too.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]             ` <51ADA761.2080107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-04  9:54               ` Haggai Eran
       [not found]                 ` <51ADB948.5080903-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Haggai Eran @ 2013-06-04  9:54 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Jeff Squyres (jsquyres), linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 04/06/2013 11:37, Or Gerlitz wrote:
> On 04/06/2013 04:24, Jeff Squyres (jsquyres) wrote:
>> On May 29, 2013, at 1:53 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>> Have you looked on ODP? see
>>> https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/568-on-demand-paging-for-user-space-networking.html
>>>
>>
>> Is the idea behind ODP that, at the beginning of time, you register
>> the entire memory space (i.e., NULL to 2^64) and then never worry
>> about registered memory?
>>

We wish to get there eventually. In our current implementation you still
have to register an on-demand memory region explicitly. The difference
between a regular memory region is that the pages in the region aren't
pinned.

> 
> Adding Haggai from the team that works on ODP. Haggai, Jeff also made a
> comment over this thread http://marc.info/?t=136976347600006&r=1&w=2
> that a limitation of a max of 2 concurrent page faults seems fairly
> significant which you might want to address too.

We chose to support only 2 concurrent page faults per QP since this
allows us to maintain order between the QP's operations and the
user-space code using it.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                 ` <51ADB948.5080903-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-04 10:56                   ` Jeff Squyres (jsquyres)
       [not found]                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F659155-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-04 10:56 UTC (permalink / raw)
  To: Haggai Eran; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Jun 4, 2013, at 2:54 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> We wish to get there eventually. In our current implementation you still
> have to register an on-demand memory region explicitly. The difference
> between a regular memory region is that the pages in the region aren't
> pinned.

Does this mean that an MPI implementation still has to register memory upon usage, and maintain its own registered memory cache?

> We chose to support only 2 concurrent page faults per QP since this
> allows us to maintain order between the QP's operations and the
> user-space code using it.


I talked to someone who was at the OpenFabrics workshop and saw the ODP presentation in person; he tells me that a fault will be incurred when a page is not in the HCA's TLB cache (vs. when a registered page is not in memory and must be swapped back in), and that this will trigger an RNR NAK.

Is this correct?

He was very concerned about what the size of the TLB on the HCA, and therefore what the actual run-time behavior would be for sending around large messages via MPI -- i.e., would RDMA'ing 1GB messages now incur this HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F659155-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-04 11:50                       ` Haggai Eran
       [not found]                         ` <51ADD489.3020902-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Haggai Eran @ 2013-06-04 11:50 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres); +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 04/06/2013 13:56, Jeff Squyres (jsquyres) wrote:
> On Jun 4, 2013, at 2:54 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
>> We wish to get there eventually. In our current implementation you still
>> have to register an on-demand memory region explicitly. The difference
>> between a regular memory region is that the pages in the region aren't
>> pinned.
> 
> Does this mean that an MPI implementation still has to register memory upon usage, and maintain its own registered memory cache?
Yes. However, since registration doesn't pin memory, you can leave
registered memory regions in the cache for longer periods, and you can
register larger memory regions without needing to back them with
physical memory.

> 
>> We chose to support only 2 concurrent page faults per QP since this
>> allows us to maintain order between the QP's operations and the
>> user-space code using it.
> 
> 
> I talked to someone who was at the OpenFabrics workshop and saw the ODP presentation in person; he tells me that a fault will be incurred when a page is not in the HCA's TLB cache (vs. when a registered page is not in memory and must be swapped back in), and that this will trigger an RNR NAK.
> 
> Is this correct?

Our HCAs use their own page tables, in addition to a TLB cache. A miss
in the TLB cache that can be filled from the HCA's page tables will not
cause an RNR NAK, since the HCA can fill it relatively fast without the
help of the operating system. If the page is missing from the HCA's page
table though it will trigger a page fault and ask the OS to bring that
page. Since this might take longer, in these cases we send an RNR NAK.

> 
> He was very concerned about what the size of the TLB on the HCA, and therefore what the actual run-time behavior would be for sending around large messages via MPI -- i.e., would RDMA'ing 1GB messages now incur this HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?
> 
We have a mechanism to prefetch the pages needed for a large message
upon the first page fault, which can also help amortizing the cost of
the page fault for larger messages.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                         ` <51ADD489.3020902-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-04 17:04                           ` Jason Gunthorpe
       [not found]                             ` <20130604170441.GA13745-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2013-06-04 20:13                           ` Jeff Squyres (jsquyres)
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-04 17:04 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jeff Squyres (jsquyres), Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 02:50:33PM +0300, Haggai Eran wrote:

> Our HCAs use their own page tables, in addition to a TLB cache. A miss
> in the TLB cache that can be filled from the HCA's page tables will not
> cause an RNR NAK, since the HCA can fill it relatively fast without the
> help of the operating system. If the page is missing from the HCA's page
> table though it will trigger a page fault and ask the OS to bring that
> page. Since this might take longer, in these cases we send an RNR NAK.

I also saw the presentation at the OFA conference and had several
questions..

So, my assumption:
 - There is a fast small TLB inside the HCA
 - There is a larger page table the HCA accesses inside the host
   memory

AFAIK, this is basically the construction we have today, and the
larger page table is expected to be fully populated.

Thus, I assume, on-demand allows pages that are 'absent' in the larger
page table to generate faults to the CPU?

So how does lifetime work here?

 - Can you populate the larger page table as soon as registration
   happens, relying on mmu notifier and HCA faults to keep it
   consistent?
 - After a fault happens are the faulted pages pinned? How does
   lifetime work here? What happens when the kernel wants to evict
   a page that has currently ongoing RDMA? What happens if user space
   munmaps something while the remote is doing RDMA to it?
 - If I recall the presentation, the fault-in operation was very slow,
   what is the cause for this?

> > He was very concerned about what the size of the TLB on the HCA,
> > and therefore what the actual run-time behavior would be for
> > sending around large messages via MPI -- i.e., would RDMA'ing 1GB
> > messages now incur this
> > HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?
> > 
> We have a mechanism to prefetch the pages needed for a large message
> upon the first page fault, which can also help amortizing the cost of
> the page fault for larger messages.

My reaction was that a pre-fault WR is needed to make this performant.

But, I also don't fully understand why we need so many faults from the
HCA in the first place. If you've properly solved the lifetime issues
then the initial registration can meaningfully pre-initialize the page
table in many cases, and computing the physical address of a page
should not be so expensive.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                         ` <51ADD489.3020902-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-06-04 17:04                           ` Jason Gunthorpe
@ 2013-06-04 20:13                           ` Jeff Squyres (jsquyres)
       [not found]                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65AE40-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-04 20:13 UTC (permalink / raw)
  To: Haggai Eran; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Jun 4, 2013, at 4:50 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> Does this mean that an MPI implementation still has to register memory upon usage, and maintain its own registered memory cache?
> Yes. However, since registration doesn't pin memory, you can leave
> registered memory regions in the cache for longer periods, and you can
> register larger memory regions without needing to back them with
> physical memory.

Hmm; I'm confused.  How does this fix the MPI-needs-to-intercept-freed-memory problem?

>> 
>>> We chose to support only 2 concurrent page faults per QP since this
>>> allows us to maintain order between the QP's operations and the
>>> user-space code using it.
>> 
>> 
>> I talked to someone who was at the OpenFabrics workshop and saw the ODP presentation in person; he tells me that a fault will be incurred when a page is not in the HCA's TLB cache (vs. when a registered page is not in memory and must be swapped back in), and that this will trigger an RNR NAK.
>> 
>> Is this correct?
> 
> Our HCAs use their own page tables, in addition to a TLB cache. A miss
> in the TLB cache that can be filled from the HCA's page tables will not
> cause an RNR NAK, since the HCA can fill it relatively fast without the
> help of the operating system. If the page is missing from the HCA's page
> table though it will trigger a page fault and ask the OS to bring that
> page. Since this might take longer, in these cases we send an RNR NAK.

Ok.

But the primary use case I care about is fixing the MPI-needs-to-intercept-freed-memory problem, and it doesn't sounds like ODP fixes this.

>> He was very concerned about what the size of the TLB on the HCA, and therefore what the actual run-time behavior would be for sending around large messages via MPI -- i.e., would RDMA'ing 1GB messages now incur this HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?
>> 
> We have a mechanism to prefetch the pages needed for a large message
> upon the first page fault, which can also help amortizing the cost of
> the page fault for larger messages.


Ok, thanks.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                             ` <20130604170441.GA13745-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-05  7:09                               ` Haggai Eran
  0 siblings, 0 replies; 40+ messages in thread
From: Haggai Eran @ 2013-06-05  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jeff Squyres (jsquyres), Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 04/06/2013 20:04, Jason Gunthorpe wrote:
> Thus, I assume, on-demand allows pages that are 'absent' in the larger
> page table to generate faults to the CPU?
Yes, that's correct.

> So how does lifetime work here?
> 
>  - Can you populate the larger page table as soon as registration
>    happens, relying on mmu notifier and HCA faults to keep it
>    consistent?
We prefer not to keep the entire page table in sync, since we want to
allow registration of larger portions of the virtual address space, and
much of that memory isn't needed by the HCA.

>  - After a fault happens are the faulted pages pinned?
After a page fault happens the faulted pages are mapped in using
get_user_pages, but they are immediately released.

> How does lifetime work here? What happens when the kernel wants to
> evict a page that has currently ongoing RDMA?
If the kernel tries to evict a page that is currently ongoing RDMA, the
driver will update the HCA before the kernel can free the page. If the
RDMA operation is still ongoing, it will trigger a page fault.

> What happens if user space munmaps something while the remote is
> doing RDMA to it?
We want to allow the user to register memory areas that are unmapped. We
only require that the user have some VMA backing the addresses used for
RDMA operations, during the course of these operations. If the user
munmaps something in the middle of an RDMA operation, this will trigger
a page fault, which will in turn close the QP doing the operation with
an error.

>  - If I recall the presentation, the fault-in operation was very slow,
>    what is the cause for this?
Page faults involve stopping the QP, reading the WQE to get the page
ranges needed, bringing the pages to memory using get_user_pages,
updating the HCA's page table (and flushing its caches) and resuming the
QP. With short messages, the commands sent to the device are dominant,
while with larger messages, get_user_pages becomes dominant.

> 
>>> He was very concerned about what the size of the TLB on the HCA,
>>> and therefore what the actual run-time behavior would be for
>>> sending around large messages via MPI -- i.e., would RDMA'ing 1GB
>>> messages now incur this
>>> HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?
>>>
>> We have a mechanism to prefetch the pages needed for a large message
>> upon the first page fault, which can also help amortizing the cost of
>> the page fault for larger messages.
> 
> My reaction was that a pre-fault WR is needed to make this performant.
> 
> But, I also don't fully understand why we need so many faults from the
> HCA in the first place. If you've properly solved the lifetime issues
> then the initial registration can meaningfully pre-initialize the page
> table in many cases, and computing the physical address of a page
> should not be so expensive.

We have implemented a prefetching verb, but I think that in many cases,
with smart enough prefetching logic in the page fault handler, it won't
be needed.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65AE40-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-05  7:14                               ` Haggai Eran
       [not found]                                 ` <51AEE53C.2090603-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Haggai Eran @ 2013-06-05  7:14 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres); +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 04/06/2013 23:13, Jeff Squyres (jsquyres) wrote:
> On Jun 4, 2013, at 4:50 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
>>> Does this mean that an MPI implementation still has to register memory upon usage, and maintain its own registered memory cache?
>> Yes. However, since registration doesn't pin memory, you can leave
>> registered memory regions in the cache for longer periods, and you can
>> register larger memory regions without needing to back them with
>> physical memory.
> 
> Hmm; I'm confused.  How does this fix the MPI-needs-to-intercept-freed-memory problem?
Well, there is no problem if an application frees registered memory (in
an on-demand paging memory region) and that memory is returned to the
OS. The OS will invalidate these pages, and the HCA will no longer be
able to use them. This means that the registration cache doesn't have to
de-register memory immediately when it is freed.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                 ` <51AEE53C.2090603-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-05 12:45                                   ` Jeff Squyres (jsquyres)
       [not found]                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65C855-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-05 12:45 UTC (permalink / raw)
  To: Haggai Eran; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Jun 5, 2013, at 12:14 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> Hmm; I'm confused.  How does this fix the MPI-needs-to-intercept-freed-memory problem?
> Well, there is no problem if an application frees registered memory (in
> an on-demand paging memory region) and that memory is returned to the
> OS. The OS will invalidate these pages, and the HCA will no longer be
> able to use them. This means that the registration cache doesn't have to
> de-register memory immediately when it is freed.


(must... resist... urge... to... throw... furniture...)

This is why features should not be introduced to solve MPI problems without an understanding of what the MPI problems are.  :-)  Please go talk to the Mellanox MPI team.

Forgive me for being frustrated; memory registration and all the pain that it entails was highlighted as ***the #1 problem*** by *5 major MPI implementations* at the Sonoma 2009 workshop (see https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/301-mpi-update-and-requirements-panel-all-presentations.html, starting at slide 7 in the "openmpi" slide deck).  

Why don't we have something like ummunotify yet?
Why don't we have non-blocking memory registration yet?
...etc.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65C855-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-05 13:39                                       ` Haggai Eran
       [not found]                                         ` <51AF3FA8.7000900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Haggai Eran @ 2013-06-05 13:39 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On 05/06/2013 15:45, Jeff Squyres (jsquyres) wrote:
> On Jun 5, 2013, at 12:14 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
>>> Hmm; I'm confused.  How does this fix the MPI-needs-to-intercept-freed-memory problem?
>> Well, there is no problem if an application frees registered memory (in
>> an on-demand paging memory region) and that memory is returned to the
>> OS. The OS will invalidate these pages, and the HCA will no longer be
>> able to use them. This means that the registration cache doesn't have to
>> de-register memory immediately when it is freed.
> 
> 
> (must... resist... urge... to... throw... furniture...)
(ducking and taking cover :-) )

> 
> This is why features should not be introduced to solve MPI problems without an understanding of what the MPI problems are.  :-)  Please go talk to the Mellanox MPI team.
> 
> Forgive me for being frustrated; memory registration and all the pain that it entails was highlighted as ***the #1 problem*** by *5 major MPI implementations* at the Sonoma 2009 workshop (see https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/301-mpi-update-and-requirements-panel-all-presentations.html, starting at slide 7 in the "openmpi" slide deck).  
Perhaps I'm missing something, but I believe ODP deals with the first
two problems in the list (slide 8), even if it doesn't solve them
completely.

You no longer need to do dangerous tricks to catch free, munmap, sbrk.
As I explained above, these operations can work on an ODP MR without
allowing the HCA use the invalidated mappings.

In the future we want to implement an implicit memory region covering
the entire process address space, thus eliminating the need for memory
registration almost completely (you might still want memory
registration, or memory windows, in order to control permissions of
remote operations).

We can also allow fork to work with our implementation. Copy-on-write
will work with ODP regions by invalidating the HCA's page tables before
modifying the pages to be read-only. A page fault from the HCA can then
refill the pages, or even break COW in case of a write.

> Why don't we have something like ummunotify yet?
I think that the problem we are trying to solve is better handled inside
the kernel. If you are going to change the HCA's memory mappings, you'd
have to go through the kernel anyway.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                         ` <51AF3FA8.7000900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-05 16:53                                           ` Jeff Squyres (jsquyres)
       [not found]                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65D5D3-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-05 16:53 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On Jun 5, 2013, at 6:39 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> Perhaps I'm missing something, but I believe ODP deals with the first
> two problems in the list (slide 8), even if it doesn't solve them
> completely.

Unfortunately, it does not.  If we could register(0 ... 2^64) and never have to worry about registered memory, that might be cool (depending on how that actually works) -- more below.

See this blog post that describes the freed registered memory issue:

    http://blogs.cisco.com/performance/registered-memory-rma-rdma-and-mpi-implementations/

and consider the following valid user code:

a = malloc(x);    // a gets (va=0x100, pa=0x12345) back from malloc
MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in reg cache
free(a);
a = malloc(x);    // a gets (va=0x100, pa=0x98765) back from malloc
MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
// ...kaboom

In short, MPI has to intercept free/sbrk/whatever so that it can update its registration cache.

> In the future we want to implement an implicit memory region covering
> the entire process address space, thus eliminating the need for memory
> registration almost completely (you might still want memory
> registration, or memory windows, in order to control permissions of
> remote operations).

This would be great, as long as it's fast, transparent, and has no subtle implementation effects (like causing additional RNR NAKs for pages that are still in memory, which, according to your descriptions, it sounds like it won't).

> We can also allow fork to work with our implementation. Copy-on-write
> will work with ODP regions by invalidating the HCA's page tables before
> modifying the pages to be read-only. A page fault from the HCA can then
> refill the pages, or even break COW in case of a write.

That would be cool, too.  fork() has been a continuing problem -- solving that problem would be wonderful.

If this ODP stuff becomes a new verb, it would be good:

- if these fork-fixing / register-infinite capabilities can be queried at run time (maybe on ibv_device_cap_flags?) so that ULPs can know to use this functionality
- if driver owners can get a heads up so that they can know to implement it

>> Why don't we have something like ummunotify yet?
> I think that the problem we are trying to solve is better handled inside
> the kernel. If you are going to change the HCA's memory mappings, you'd
> have to go through the kernel anyway.

If/when you allow registering all memory, then I think you're right -- the MPI-must-intercept-free/sbrk-whatever issue may go away (that's why I started this thread asking about register(0 .. 2^64)).  But without that, unless I'm missing something, I don't think it solves the MPI-must-catch-free-sbrk-etc. issues...?  And therefore, having some kind of ummunotify-like functionality as a verb would be a Very Good Thing.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65D5D3-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-05 17:14                                               ` Jason Gunthorpe
       [not found]                                                 ` <20130605171426.GC30184-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-05 17:14 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Wed, Jun 05, 2013 at 04:53:48PM +0000, Jeff Squyres (jsquyres) wrote:
> On Jun 5, 2013, at 6:39 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
> > Perhaps I'm missing something, but I believe ODP deals with the first
> > two problems in the list (slide 8), even if it doesn't solve them
> > completely.
> 
> Unfortunately, it does not.  If we could register(0 ... 2^64) and
> never have to worry about registered memory, that might be cool
> (depending on how that actually works) -- more below.
> 
> See this blog post that describes the freed registered memory issue:
> 
>     http://blogs.cisco.com/performance/registered-memory-rma-rdma-and-mpi-implementations/
> 
> and consider the following valid user code:
> 
> a = malloc(x);    // a gets (va=0x100, pa=0x12345) back from malloc
> MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in reg cache
> free(a);
> a = malloc(x);    // a gets (va=0x100, pa=0x98765) back from malloc
> MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
> // ...kaboom
> 
> In short, MPI has to intercept free/sbrk/whatever so that it can
> update its registration cache.

ODP is supposed to completely solve this problem. The HCA's view and
Kernels view of virtual to physical mapping becomes 100% synchronized,
and there is no 'kaboom'. The kernel updates the HCA after the free,
and after the 2nd malloc to 100% match the current virtual memory map
in the process.

MPI still has to register the memory in the first place..

.. and somehow stuff has to be managed to avoid HCA page faults in
   common cases
.. and the feature must be discoverable
.. and and and ..

The biggest issue to me is going to be efficiently prefetching receive
buffers so that RNR acks are avoided in all common cases...

> solves the MPI-must-catch-free-sbrk-etc. issues...?  And therefore,
> having some kind of ummunotify-like functionality as a verb would be
> a Very Good Thing.

AFAIK the ummunotify user space API was nak'd by the core kernel
guys. I got the impression people thought it would be acceptable as a
rdma API, not a general API. So it is waiting on someone to recast the
function within verbs to make progress...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                 ` <20130605171426.GC30184-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-05 18:10                                                   ` Jeff Squyres (jsquyres)
       [not found]                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DC0D-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-05 18:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 5, 2013, at 10:14 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> a = malloc(x);    // a gets (va=0x100, pa=0x12345) back from malloc
>> MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in reg cache
>> free(a);
>> a = malloc(x);    // a gets (va=0x100, pa=0x98765) back from malloc
>> MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
>> // ...kaboom
> 
> ODP is supposed to completely solve this problem. The HCA's view and
> Kernels view of virtual to physical mapping becomes 100% synchronized,
> and there is no 'kaboom'. The kernel updates the HCA after the free,
> and after the 2nd malloc to 100% match the current virtual memory map
> in the process.

Are you saying that the 2nd malloc will magically be registered (with the new physical address)?

> AFAIK the ummunotify user space API was nak'd by the core kernel
> guys.

It was NAK'ed by Linus, saying "fix your own network stack; this is not needed in the general purpose part of the kernel" (remember that Roland initially developed this as a standalone, non-IB-related kernel module).  

> I got the impression people thought it would be acceptable as a
> rdma API, not a general API. So it is waiting on someone to recast the
> function within verbs to make progress...

'zactly.  Roland has this ummunot branch in his git tree, where he is in the middle of incorporating this functionality from the original ummunotify standalone kernel module into libibverbs and ibcore.

I started this thread asking the status of that branch.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DC0D-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-05 18:18                                                       ` Jason Gunthorpe
       [not found]                                                         ` <20130605181853.GB1946-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-05 18:18 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Wed, Jun 05, 2013 at 06:10:11PM +0000, Jeff Squyres (jsquyres) wrote:
> On Jun 5, 2013, at 10:14 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> >> a = malloc(x);    // a gets (va=0x100, pa=0x12345) back from malloc
> >> MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in reg cache
> >> free(a);
> >> a = malloc(x);    // a gets (va=0x100, pa=0x98765) back from malloc
> >> MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
> >> // ...kaboom
> > 
> > ODP is supposed to completely solve this problem. The HCA's view and
> > Kernels view of virtual to physical mapping becomes 100% synchronized,
> > and there is no 'kaboom'. The kernel updates the HCA after the free,
> > and after the 2nd malloc to 100% match the current virtual memory map
> > in the process.
> 
> Are you saying that the 2nd malloc will magically be registered
> (with the new physical address)?

Yes, that is the whole point.

ODP fundamentally fixes the *bug* where the HCA's view of process
memory can become inconsistent with the kernel's view.

'magically be registered' is the wrong way to think about it - the
registration of VA=0x100 is simply kept, and any change to the
underlying physical mapping of the VA is synchronized with the HCA.

> 'zactly.  Roland has this ummunot branch in his git tree, where he
> is in the middle of incorporating this functionality from the
> original ummunotify standalone kernel module into libibverbs and
> ibcore.

Right, this was discussed at the Enterprise Summit a few weeks
ago. I'm sure Roland would welcome patches...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                         ` <20130605181853.GB1946-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-05 18:45                                                           ` Jeff Squyres (jsquyres)
       [not found]                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DF6F-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-05 18:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 5, 2013, at 11:18 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> Are you saying that the 2nd malloc will magically be registered
>> (with the new physical address)?
> 
> Yes, that is the whole point.

Interesting.

> ODP fundamentally fixes the *bug* where the HCA's view of process
> memory can become inconsistent with the kernel's view.

Hum.  I was under the impression that with today's code (i.e., not ODP), if you

a = malloc(N);
ibv_reg_mr(..., a, N, ...);
free(a);

(assuming that the memory actually left the process at free)

Then the relevant kernel verbs driver was notified, and would unregister that device.  ...but I'm an MPI guy, not a kernel guy -- it seems like you're saying that my impression was wrong (which doesn't currently matter because we intercept free/sbrk and unregister such memory, anyway).

> 'magically be registered' is the wrong way to think about it - the
> registration of VA=0x100 is simply kept, and any change to the
> underlying physical mapping of the VA is synchronized with the HCA.

What happens if you:

a = malloc(N * page_size);
ibv_reg_mr(..., a, N * page_size, ...);
free(a);
// incoming RDMA arrives targeted at buffer a

Or if you:

a = malloc(N * page_size);
ibv_reg_mr(..., a, N * page_size, ...);
free(a);
a = malloc(N / 2 * page_size);
// incoming RDMA arrives targeted at buffer a that is of length (N*page_size)

It does seem quite odd, abstractly speaking, that a registration would survive a free/re-malloc (which is arguably a "different" buffer).

That being said, it still seems like MPI needs a registration cache.  It is several good steps forward if we don't need to intercept free/sbrk/whatever, but when MPI_Send(buf, ...) is invoked, we still have to check that the entire buf is registered.  If ibv_reg_mr(..., 0, 2^64, ...) was supported, that would obviate the entire need for registration caches.  That would be wonderful.

> Right, this was discussed at the Enterprise Summit a few weeks
> ago. I'm sure Roland would welcome patches...


That's why I asked at the beginning of this thread.  He didn't provide any details about what still needs to be done, though.  :-)

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DF6F-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-05 19:05                                                               ` Jason Gunthorpe
       [not found]                                                                 ` <20130605190529.GA3044-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-05 19:05 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Wed, Jun 05, 2013 at 06:45:13PM +0000, Jeff Squyres (jsquyres) wrote:

> Hum.  I was under the impression that with today's code (i.e., not ODP), if you
> 
> a = malloc(N);
> ibv_reg_mr(..., a, N, ...);
> free(a);
> 
> (assuming that the memory actually left the process at free)
> 
> Then the relevant kernel verbs driver was notified, and would
> unregister that device.  ...but I'm an MPI guy, not a kernel guy --
> it seems like you're saying that my impression was wrong (which
> doesn't currently matter because we intercept free/sbrk and
> unregister such memory, anyway).

Sadly no, what happens is that once you do ibv_reg_mr that 'HCA
virtual address' is forever tied to the physical memory under the
'process virtual address' *at that moment* forever.

So in the case above, RDMA can continue after the free, and it
continues to hit the same *physical* memory that it always hit, but
due to the free the process has lost access to that memory (the kernel
keeps the physical memory reserved for RDMA purposes until unreg
though).

This is fundamentally why you need to intercept mmap/munmap/sbrk - if
the process's VM mapping is changed through those syscalls then the
HCA's VM and the process VM becomes de-synchronized.

> > 'magically be registered' is the wrong way to think about it - the
> > registration of VA=0x100 is simply kept, and any change to the
> > underlying physical mapping of the VA is synchronized with the HCA.
> 
> What happens if you:
> 
> a = malloc(N * page_size);
> ibv_reg_mr(..., a, N * page_size, ...);
> free(a);
> // incoming RDMA arrives targeted at buffer a

Haggai should comment on this, but my impression/expectation was
you'll get a remote protection fault/

> Or if you:
> 
> a = malloc(N * page_size);
> ibv_reg_mr(..., a, N * page_size, ...);
> free(a);
> a = malloc(N / 2 * page_size);
> // incoming RDMA arrives targeted at buffer a that is of length (N*page_size)

again, I expect a remote protection fault.

Noting of course, both of these cases are only true if the underlying
VM is manipulated in a way that makes the pages unmapped (eg
mmap/munmap, not free)

I would also assume that attempts to RDMA write read only pages
protection fault as well.

> It does seem quite odd, abstractly speaking, that a registration
> would survive a free/re-malloc (which is arguably a "different"
> buffer).

Not at all: the purpose of the registration is to allow access via
RDMA to a portion of the process's address space. The address space
doesn't change, but what it is mapped to can vary.

So - the ODP semantics make much more sense, so much so I'm not sure
we need a ODP flag at all, but that can be discussed when the patches
are proposed...

> That being said, it still seems like MPI needs a registration cache.
> It is several good steps forward if we don't need to intercept
> free/sbrk/whatever, but when MPI_Send(buf, ...) is invoked, we still
> have to check that the entire buf is registered.  If ibv_reg_mr(...,
> 0, 2^64, ...) was supported, that would obviate the entire need for
> registration caches.  That would be wonderful.

Yes, except that this shifts around where the registration overhead
ends up. Basically the HCA driver now has the registration cache you
had in MPI, and all the same overheads still exist. No free lunch
here :(

Haggai: A verb to resize a registration would probably be a helpful
step. MPI could maintain one registration that covers the sbrk
region and one registration that covers the heap, much easier than
searching tables and things.

Also bear in mind that all RDMA access protections will be disabled if
you register the entire process VM, the remote(s) can scribble/read
everything..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                 ` <20130605190529.GA3044-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-06  2:58                                                                   ` Jeff Squyres (jsquyres)
  2013-06-06  5:52                                                                   ` Haggai Eran
  1 sibling, 0 replies; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-06  2:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 5, 2013, at 12:05 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> It does seem quite odd, abstractly speaking, that a registration
>> would survive a free/re-malloc (which is arguably a "different"
>> buffer).
> 
> Not at all: the purpose of the registration is to allow access via
> RDMA to a portion of the process's address space. The address space
> doesn't change, but what it is mapped to can vary.

I still think it's really weird.  When I do this:

a = malloc(N);
ibv_reg_mr(..., a, N, ...);
free(a);
b = malloc(M);

If b just happens to be partially or wholly registered by some quirk of the malloc() system (i.e., some/all of the virtual address space in b happens to have been covered by a prior malloc/ibv_reg_mr)... that's just weird.

>> If ibv_reg_mr(...,
>> 0, 2^64, ...) was supported, that would obviate the entire need for
>> registration caches.  That would be wonderful.
> 
> Yes, except that this shifts around where the registration overhead
> ends up. Basically the HCA driver now has the registration cache you
> had in MPI, and all the same overheads still exist.

There's fewer verbs drivers than applications, right?

> Haggai: A verb to resize a registration would probably be a helpful
> step. MPI could maintain one registration that covers the sbrk
> region and one registration that covers the heap, much easier than
> searching tables and things.

If we still have to register buffers piecemeal, a non-blocking registration verb would be quite helpful.

> Also bear in mind that all RDMA access protections will be disabled if
> you register the entire process VM, the remote(s) can scribble/read
> everything..


No problem for MPI/HPC...  :-)

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                 ` <20130605190529.GA3044-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2013-06-06  2:58                                                                   ` Jeff Squyres (jsquyres)
@ 2013-06-06  5:52                                                                   ` Haggai Eran
       [not found]                                                                     ` <51B023B9.9050000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Haggai Eran @ 2013-06-06  5:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jeff Squyres (jsquyres),
	Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On 05/06/2013 22:05, Jason Gunthorpe wrote:
> On Wed, Jun 05, 2013 at 06:45:13PM +0000, Jeff Squyres (jsquyres) wrote:
>> What happens if you:
>>
>> a = malloc(N * page_size);
>> ibv_reg_mr(..., a, N * page_size, ...);
>> free(a);
>> // incoming RDMA arrives targeted at buffer a
> 
> Haggai should comment on this, but my impression/expectation was
> you'll get a remote protection fault/
> 
>> Or if you:
>>
>> a = malloc(N * page_size);
>> ibv_reg_mr(..., a, N * page_size, ...);
>> free(a);
>> a = malloc(N / 2 * page_size);
>> // incoming RDMA arrives targeted at buffer a that is of length (N*page_size)
> 
> again, I expect a remote protection fault.
> 
> Noting of course, both of these cases are only true if the underlying
> VM is manipulated in a way that makes the pages unmapped (eg
> mmap/munmap, not free)

That's right. If pages are unmapped and a remote operation tries to
access them the QP will be closed with a protection error.

> 
> I would also assume that attempts to RDMA write read only pages
> protection fault as well.
Right.

> Haggai: A verb to resize a registration would probably be a helpful
> step. MPI could maintain one registration that covers the sbrk
> region and one registration that covers the heap, much easier than
> searching tables and things.

That's a nice idea. Even without this verb, I think it is possible to
develop a registration cache that covers those regions though. When you
find out you have some part of your region not registered, you can
register a new, larger region that covers everything you need. For new
operations you only use the newer region. Once the previous, smaller
region is not used, you de-register it.

What do you think?

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                     ` <51B023B9.9050000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-06-06 23:33                                                                       ` Jeff Squyres (jsquyres)
       [not found]                                                                         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66B79C-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-06 23:33 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 5, 2013, at 10:52 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> Haggai: A verb to resize a registration would probably be a helpful
>> step. MPI could maintain one registration that covers the sbrk
>> region and one registration that covers the heap, much easier than
>> searching tables and things.
> 
> That's a nice idea. Even without this verb, I think it is possible to
> develop a registration cache that covers those regions though. When you
> find out you have some part of your region not registered, you can
> register a new, larger region that covers everything you need. For new
> operations you only use the newer region. Once the previous, smaller
> region is not used, you de-register it.


I'm not sure what you mean.  Are you saying I should do something like this:

MPI_Init() {
// the first MPI function invoked
  mpi_sbrk_save = sbrk();
  ibv_reg_mr(..., 0, mpi_sbrk_save, ...);
  ...
}

MPI_Send(buffer, ...) {
  if (mpi_sbrk_save != sbrk())
      mpi_sbrk_save = sbrk();
      ibv_rereg_mr(..., 0, mpi_sbrk_save, ...);
  ...
}

I don't think this covers other memory regions, like those added via mmap, right?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66B79C-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-07 22:59                                                                           ` Jeff Squyres (jsquyres)
       [not found]                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66E403-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-07 22:59 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 6, 2013, at 4:33 PM, Jeff Squyres (jsquyres) <jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:

> I don't think this covers other memory regions, like those added via mmap, right?


We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems.

1. MPI still has to have a memory registration cache, because ibv_reg_mr(0...sbrk()) doesn't cover the stack or mmap'ed memory, etc.

2. MPI still has to intercept (at least) munmap().

3. Having mmap/malloc/etc. return "new" memory that may already be registered because of a prior memory registration and subsequent munmap/free/etc. is just plain weird.  Worse, if we re-register it, ref counts could go such that the actual registration will never actually expire until the process dies (which could lead to processes with abnormally large memory footprints, because they never actually let go of memory because it's still registered).

4. Even if MPI checks the value of sbrk() and re-registers (0...sbrk()) when sbrk() increases, this would seem to create a lot of work for the kernel -- which is both slow and synchronous.  Example:

a = malloc(5GB);
MPI_Send(a, 1, MPI_CHAR, ...); // MPI sends 1 byte

Then the MPI_Send of 1 byte will have to pay the cost of registering 5GB of new memory.

-----

Unless we understand this wrong (and there's definitely a chance that we do!), it doesn't sound like ODP solves anything for MPI.  Especially since HPC applications almost never swap (in fact, swap is usually disabled in HPC environments).

What MPI wants is:

1. verbs for ummunotify-like functionality
2. non-blocking memory registration verbs; poll the cq to know when it has completed

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66E403-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-07 23:57                                                                               ` Jason Gunthorpe
       [not found]                                                                                 ` <20130607235731.GA25942-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-07 23:57 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Fri, Jun 07, 2013 at 10:59:43PM +0000, Jeff Squyres (jsquyres) wrote:

> > I don't think this covers other memory regions, like those added via mmap, right?
>  
> We talked about this at the MPI Forum this week; it doesn't seem
> like ODP fixes any MPI problems.

ODP without 'register all address space' changes the nature of the
problem, and fixes only one problem.

You do need to cache registrations, and all the tuning parameters (how
much do I cache, how long do I hold it for, etc, etc) all still apply.

What goes away (is fixed) is the need for intercepts and the need to
purge address space from the cache because the backing registration
has become non-coherent/invalid. Registrations are always
coherent/valid with ODP.

This cache, and the associated optimization problem, can never go
away. With a 'register all of memory' semantic the cache can move into
the kernel, but the performance implication and overheads are all
still present, just migrated.

> 2. MPI still has to intercept (at least) munmap().

Curious to know what for? 

If you want to prune registrations (ie to reduce memory footprint),
this can be done lazyily at any time (eg in a background thread or
something). Read /proc/self/maps and purge all the registrations
pointing to unmapped memory. Similar to garbage collection.

There is no harm in keeping a registration for a long period, except
for the memory footprint in the kernel.

> 3. Having mmap/malloc/etc. return "new" memory that may already be
> registered because of a prior memory registration and subsequent
> munmap/free/etc. is just plain weird.  Worse, if we re-register it,
> ref counts could go such that the actual registration will never
> actually expire until the process dies (which could lead to
> processes with abnormally large memory footprints, because they
> never actually let go of memory because it's still registered).

This is entirely on the registration cache implementation to sort
out, there are lots of performance/memory trade offs.

It is only weird when you think about it in terms of buffers. memory
registration has to do with address space, not buffers.

> What MPI wants is:
> 
> 1. verbs for ummunotify-like functionality
> 2. non-blocking memory registration verbs; poll the cq to know when it has completed

To me, ODP with an additional 'register all address space' semantic, plus
an asynchronous prefetch does both of these for you.

1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.
2. asynchronous prefetch (eg as a work request) triggers ODP and
   kernel actions to ready a subset of memory for RDMA, including
   all the work that memory registration does today (get_user_pages,
   COW break, etc)
   
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: Status of "ummunot" branch?
       [not found]                                                                                 ` <20130607235731.GA25942-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-10  9:17                                                                                   ` Liran Liss
  2013-06-10 14:49                                                                                   ` Jeff Squyres (jsquyres)
  1 sibling, 0 replies; 40+ messages in thread
From: Liran Liss @ 2013-06-10  9:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

Here are a few more clarifications:

1) ODP MRs can cover address ranges that do not have a mapping at registration time.

This means that MPI can register in advance, say, the lower GB's of the address space, covering malloc's primary arena.
Thus, there is no need to adjust to each increase in sbrk().

Similarly, you can register the stack region up to the maximum size of the stack.
The stack can grow and shrink, and ODP will always use the current mapping.

2) Virtual addresses covered by an ODP MR must have a valid mapping when they are is accessed (during send/receive WQE processing or as a target of an RDMA/atomic operation).
So, Jeff, the only thing you need to make sure is that you don't free() a buffer that you posted and haven't got a completion yet - but I guess that this is something that you already do... :)

For example, in the following scenario:
a. reg_mr(first GB of the address space)

b. p = malloc()
c. post_send(p)
d. poll for completion
e. free(p)

f. p = malloc()
g. post_send(p)
h. poll for completion
i. free(p)

(c) may incur a page fault (if not pre-fetched or faulted-in by another thread).
(e) happens after the completion, so it is guaranteed that (c), when processed by HW, uses the correct application buffer with the current virt-to-phys mapping (at HW access time)

The reallocation may or may not change the virtual-to-physical mappings.
The message may or may not be paged out (ODP does not hold a reference on the page).
In any case, when (g) is processed, it always uses the current mapping.

--Liran



-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe
Sent: Saturday, June 08, 2013 2:58 AM
To: Jeff Squyres (jsquyres)
Cc: Haggai Eran; Or Gerlitz; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Shachar Raindel
Subject: Re: Status of "ummunot" branch?

On Fri, Jun 07, 2013 at 10:59:43PM +0000, Jeff Squyres (jsquyres) wrote:

> > I don't think this covers other memory regions, like those added via mmap, right?
>  
> We talked about this at the MPI Forum this week; it doesn't seem like 
> ODP fixes any MPI problems.

ODP without 'register all address space' changes the nature of the problem, and fixes only one problem.

You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply.

What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP.

This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated.

> 2. MPI still has to intercept (at least) munmap().

Curious to know what for? 

If you want to prune registrations (ie to reduce memory footprint), this can be done lazyily at any time (eg in a background thread or something). Read /proc/self/maps and purge all the registrations pointing to unmapped memory. Similar to garbage collection.

There is no harm in keeping a registration for a long period, except for the memory footprint in the kernel.

> 3. Having mmap/malloc/etc. return "new" memory that may already be 
> registered because of a prior memory registration and subsequent 
> munmap/free/etc. is just plain weird.  Worse, if we re-register it, 
> ref counts could go such that the actual registration will never 
> actually expire until the process dies (which could lead to processes 
> with abnormally large memory footprints, because they never actually 
> let go of memory because it's still registered).

This is entirely on the registration cache implementation to sort out, there are lots of performance/memory trade offs.

It is only weird when you think about it in terms of buffers. memory registration has to do with address space, not buffers.

> What MPI wants is:
> 
> 1. verbs for ummunotify-like functionality 2. non-blocking memory 
> registration verbs; poll the cq to know when it has completed

To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you.

1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.
2. asynchronous prefetch (eg as a work request) triggers ODP and
   kernel actions to ready a subset of memory for RDMA, including
   all the work that memory registration does today (get_user_pages,
   COW break, etc)
   
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                 ` <20130607235731.GA25942-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2013-06-10  9:17                                                                                   ` Liran Liss
@ 2013-06-10 14:49                                                                                   ` Jeff Squyres (jsquyres)
       [not found]                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F676E59-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-10 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> We talked about this at the MPI Forum this week; it doesn't seem
>> like ODP fixes any MPI problems.
> 
> ODP without 'register all address space' changes the nature of the
> problem, and fixes only one problem.

I agree that pushing all registration issues out of the application and (somewhere) into the verbs stack would be a nice solution.

> You do need to cache registrations, and all the tuning parameters (how
> much do I cache, how long do I hold it for, etc, etc) all still apply.
> 
> What goes away (is fixed) is the need for intercepts and the need to
> purge address space from the cache because the backing registration
> has become non-coherent/invalid. Registrations are always
> coherent/valid with ODP.

> This cache, and the associated optimization problem, can never go
> away. With a 'register all of memory' semantic the cache can move into
> the kernel, but the performance implication and overheads are all
> still present, just migrated.

Good summary; and you corrected some of my mistakes -- thanks.

That being said, everyone I've talked to about ODP finds it very, very strange that the kernel would keep memory registrations around for memory that is no longer part of a process.  Not only does it lead to the "new memory is magically already registered" semantic that I find weird, it's just plain *odd* for the kernel to maintain state for something that doesn't exist any more.  It feels dirty.

Sidenote: I was just informed today that the current way MPI implementations implement registration cache coherence (glibc malloc hooks) has been deprecated and will be removed from glibc (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html).  This really puts on the pressure to find a new / proper solution.

>> What MPI wants is:
>> 
>> 1. verbs for ummunotify-like functionality
>> 2. non-blocking memory registration verbs; poll the cq to know when it has completed
> 
> To me, ODP with an additional 'register all address space' semantic, plus
> an asynchronous prefetch does both of these for you.
> 
> 1. ummunotify functionality and caching is now in the kernel, under
>   ODP. RDMA access to an 'all of memory' registration always does the
>   right thing.

"Register all address space" is the moral equivalent of not having userspace registration, so let's talk about it in those terms.  Specifically, there's a subtle difference between:

a) telling verbs to register (0...2^64) 
   --> Which is weird because it tells verbs to register memory that isn't in my address space
b) telling verbs that the app doesn't want to handle registration
   --> How that gets implemented is not important (from userspace's point of view) -- if the kernel chooses to implement that by registering non-existent memory, that's the kernel's problem

I guess I'm arguing that registering non-existent memory is not the Right Thing.

Regardless of what solution is devised for registered memory management (ummunotify, ODP, or something else), a non-blocking verb for registering memory would still be a Very Useful Thing.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: Status of "ummunot" branch?
       [not found]                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F676E59-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-10 15:56                                                                                       ` Liran Liss
       [not found]                                                                                         ` <D554B471892C914E90E136467281724DAD695B50-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
  2013-06-10 17:26                                                                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 40+ messages in thread
From: Liran Liss @ 2013-06-10 15:56 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres), Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jeff Squyres (jsquyres)
> Sent: Monday, June 10, 2013 5:50 PM
> To: Jason Gunthorpe
> Cc: Haggai Eran; Or Gerlitz; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Shachar Raindel
> Subject: Re: Status of "ummunot" branch?
> 
> On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> >> We talked about this at the MPI Forum this week; it doesn't seem like
> >> ODP fixes any MPI problems.
> >
> > ODP without 'register all address space' changes the nature of the
> > problem, and fixes only one problem.
> 
> I agree that pushing all registration issues out of the application and
> (somewhere) into the verbs stack would be a nice solution.
> 
> > You do need to cache registrations, and all the tuning parameters (how
> > much do I cache, how long do I hold it for, etc, etc) all still apply.
> >
> > What goes away (is fixed) is the need for intercepts and the need to
> > purge address space from the cache because the backing registration
> > has become non-coherent/invalid. Registrations are always
> > coherent/valid with ODP.
> 
> > This cache, and the associated optimization problem, can never go
> > away. With a 'register all of memory' semantic the cache can move into
> > the kernel, but the performance implication and overheads are all
> > still present, just migrated.
> 
> Good summary; and you corrected some of my mistakes -- thanks.
> 
> That being said, everyone I've talked to about ODP finds it very, very strange
> that the kernel would keep memory registrations around for memory that is
> no longer part of a process.  Not only does it lead to the "new memory is
> magically already registered" semantic that I find weird, it's just plain *odd*
> for the kernel to maintain state for something that doesn't exist any more.  It
> feels dirty.
> 
> Sidenote: I was just informed today that the current way MPI
> implementations implement registration cache coherence (glibc malloc
> hooks) has been deprecated and will be removed from glibc
> (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html).  This really
> puts on the pressure to find a new / proper solution.
> 
> >> What MPI wants is:
> >>
> >> 1. verbs for ummunotify-like functionality 2. non-blocking memory
> >> registration verbs; poll the cq to know when it has completed
> >
> > To me, ODP with an additional 'register all address space' semantic,
> > plus an asynchronous prefetch does both of these for you.
> >
> > 1. ummunotify functionality and caching is now in the kernel, under
> >   ODP. RDMA access to an 'all of memory' registration always does the
> >   right thing.
> 
> "Register all address space" is the moral equivalent of not having userspace
> registration, so let's talk about it in those terms.  Specifically, there's a subtle
> difference between:
> 
> a) telling verbs to register (0...2^64)
>    --> Which is weird because it tells verbs to register memory that isn't in my
> address space


Another way to look at it is "specify IO access permissions" for address space ranges.
This could be useful to implement a buffer pool to be used for a specific MR only, yet still map/unmap memory within this pool on the fly to optimize physical memory utilization.
In this case, you would provide smaller ranges than 2^64...


> b) telling verbs that the app doesn't want to handle registration
>    --> How that gets implemented is not important (from userspace's point of
> view) -- if the kernel chooses to implement that by registering non-existent
> memory, that's the kernel's problem
> 
> I guess I'm arguing that registering non-existent memory is not the Right
> Thing.
> 
> Regardless of what solution is devised for registered memory management
> (ummunotify, ODP, or something else), a non-blocking verb for registering
> memory would still be a Very Useful Thing.
> 
> --
> Jeff Squyres
> jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F676E59-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  2013-06-10 15:56                                                                                       ` Liran Liss
@ 2013-06-10 17:26                                                                                       ` Jason Gunthorpe
       [not found]                                                                                         ` <20130610172627.GC2391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-10 17:26 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Mon, Jun 10, 2013 at 02:49:24PM +0000, Jeff Squyres (jsquyres) wrote:
> On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> >> We talked about this at the MPI Forum this week; it doesn't seem
> >> like ODP fixes any MPI problems.
> > 
> > ODP without 'register all address space' changes the nature of the
> > problem, and fixes only one problem.
> 
> I agree that pushing all registration issues out of the application
> and (somewhere) into the verbs stack would be a nice solution.

Well, it creates a mess in another sense, because now you've lost
context. When your MPI goes to do a 1byte send the kernel may well
prefetch a few megabytes of page tables, whereas an implementation in
userspace still has the context and can say, no I don't need that..

Maybe a prefetch WR can restore the lost context, donno..

> That being said, everyone I've talked to about ODP finds it very,
> very strange that the kernel would keep memory registrations around
> for memory that is no longer part of a process.  Not only does it

MRs are badly named. They are not 'memory registrations'. They are
'address registrations'. Don't conflat address === memory in your
head, then it seems weird :)

The memory the address space points to is flexible.

The address space is tied to the lifetime of the process.

It doesn't matter if there is no memory mapped to the address space,
the address space is still there.

Liran had a good example. You can register address space and then use
mmap/munmap/MAP_FIXED to mess around with where it points to.

A practical example of using this would be to avoid the need to send
scatter buffer pointers to the remote. The remote writes into a memory
ring and the ring is made 'endless' by clever use of remapping.

> "Register all address space" is the moral equivalent of not having
> userspace registration, so let's talk about it in those terms.
> Specifically, there's a subtle difference between:
> 
> a) telling verbs to register (0...2^64) 
> b) telling verbs that the app doesn't want to handle registration

I agree, a verb to do 'B' is a cleaner choice than trying to cram this
kind of API into A...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                         ` <D554B471892C914E90E136467281724DAD695B50-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
@ 2013-06-12 21:10                                                                                           ` Jeff Squyres (jsquyres)
       [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6808D7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-12 21:10 UTC (permalink / raw)
  To: Liran Liss
  Cc: Jason Gunthorpe, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On Jun 10, 2013, at 11:56 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> "Register all address space" is the moral equivalent of not having userspace
>> registration, so let's talk about it in those terms.  Specifically, there's a subtle
>> difference between:
>> 
>> a) telling verbs to register (0...2^64)
>>   --> Which is weird because it tells verbs to register memory that isn't in my
>> address space
> 
> Another way to look at it is "specify IO access permissions" for address space ranges.
> This could be useful to implement a buffer pool to be used for a specific MR only, yet still map/unmap memory within this pool on the fly to optimize physical memory utilization.
> In this case, you would provide smaller ranges than 2^64...


Hmm; I'm not sure I understand.

Userspace doesn't control what virtual addresses it gets back from mmap/etc.  So how is what you're talking about different than regular/reactive memory registration? (vs. pre-emptively registering a whole pile of memory that doesn't exist yet)

Specifically: I'm confused because you said you could (preemptively) register some small regions (that assumedly don't yet exist in your virtual memory address space) and use them as memory pools.  But given that userspace doesn't control its virtual address ranges, I'm not sure how that's useful.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6808D7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-12 21:17                                                                                               ` Jason Gunthorpe
       [not found]                                                                                                 ` <20130612211742.GA8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-12 21:17 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Liran Liss, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On Wed, Jun 12, 2013 at 09:10:57PM +0000, Jeff Squyres (jsquyres) wrote:

> > Another way to look at it is "specify IO access permissions" for
> > address space ranges.  This could be useful to implement a buffer
> > pool to be used for a specific MR only, yet still map/unmap memory
> > within this pool on the fly to optimize physical memory
> > utilization.  In this case, you would provide smaller ranges than
> > 2^64...

> Hmm; I'm not sure I understand.
> 
> Userspace doesn't control what virtual addresses it gets back from
> mmap/etc.  

Yes, it can, via MAP_FIXED. There are lots of fun tricks you can play
using that.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                         ` <20130610172627.GC2391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-12 21:18                                                                                           ` Jeff Squyres (jsquyres)
       [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F680A2B-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-12 21:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 10, 2013, at 1:26 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> I agree that pushing all registration issues out of the application
>> and (somewhere) into the verbs stack would be a nice solution.
> 
> Well, it creates a mess in another sense, because now you've lost
> context. When your MPI goes to do a 1byte send the kernel may well
> prefetch a few megabytes of page tables, whereas an implementation in
> userspace still has the context and can say, no I don't need that..

It seems like there are Big Problems on either side of this problem (userspace and kernel).

I thought that ummunotify was a good balance between the two -- MPI kept its registration caches (which are annoying, but we have long-since understood that *someone* has to maintain them), but it gets a bulletproof way to keep them coherent.  That is what is missing in today's solutions: bulletproofness (plus we have to use the horrid glibc malloc hooks, which are deprecated and are going away).

>> That being said, everyone I've talked to about ODP finds it very,
>> very strange that the kernel would keep memory registrations around
>> for memory that is no longer part of a process.  Not only does it
> 
> MRs are badly named. They are not 'memory registrations'. They are
> 'address registrations'. Don't conflat address === memory in your
> head, then it seems weird :)
> 
> The memory the address space points to is flexible.
> 
> The address space is tied to the lifetime of the process.
> 
> It doesn't matter if there is no memory mapped to the address space,
> the address space is still there.
> 
> Liran had a good example. You can register address space and then use
> mmap/munmap/MAP_FIXED to mess around with where it points to

...but this is not how people write applications.  Real apps use malloc (and some direct mmap, and perhaps even some shared memory).  They don't pay attention to the contiguiousness (is that a word?) of memory/addresses in the large scale.  To be clear: the most tightly bound codes *do* actually care about cache hits and locality, but that's in the small scale -- not in the large scale.  I would find it hard to believe that a real code would pay attention to where in its address range a given malloc() returns, for example.

*That's* what makes this whole concept weird.

It seems like this is a perfect kernel space concept, but is quite foreign to userspace developers.

> A practical example of using this would be to avoid the need to send
> scatter buffer pointers to the remote. The remote writes into a memory
> ring and the ring is made 'endless' by clever use of remapping.

I don't understand -- please explain your example a bit more...?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F680A2B-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-12 21:47                                                                                               ` Jason Gunthorpe
       [not found]                                                                                                 ` <20130612214708.GD8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-12 21:47 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Wed, Jun 12, 2013 at 09:18:34PM +0000, Jeff Squyres (jsquyres) wrote:

> > Well, it creates a mess in another sense, because now you've lost
> > context. When your MPI goes to do a 1byte send the kernel may well
> > prefetch a few megabytes of page tables, whereas an implementation in
> > userspace still has the context and can say, no I don't need that..
> 
> It seems like there are Big Problems on either side of this problem
> (userspace and kernel).
> 
> I thought that ummunotify was a good balance between the two -- MPI
> kept its registration caches (which are annoying, but we have
> long-since understood that *someone* has to maintain them), but it
> gets a bulletproof way to keep them coherent.  That is what is
> missing in today's solutions: bulletproofness (plus we have to use
> the horrid glibc malloc hooks, which are deprecated and are going
> away).

Ditto.

Someone has to finish the ummunotify rewrite Roland
started. Realistically MPI is going to be the only user, can someone
from the MPI world do this?
 
> > It doesn't matter if there is no memory mapped to the address space,
> > the address space is still there.
> > 
> > Liran had a good example. You can register address space and then use
> > mmap/munmap/MAP_FIXED to mess around with where it points to
> 
> ...but this is not how people write applications.  Real apps use
> malloc (and some direct mmap, and perhaps even some shared memory).

*shrug* I used MAP_FIXED for some RDMA regions in my IB verbs apps,
specifically to create specalized high-performance memory
structures.

It isn't a general purpose technique for non-RDMA apps - but
especially when combined with ODP it is useful in some places.

> > A practical example of using this would be to avoid the need to send
> > scatter buffer pointers to the remote. The remote writes into a memory
> > ring and the ring is made 'endless' by clever use of remapping.
> 
> I don't understand -- please explain your example a bit more...?

You have a memory pool.

There are two mappings to this physical memory, one for the CPU to
use, one for RDMA to use.

The RDMA mapping is a linear ring, the remote just spews linearly via
RDMA WRITE.

When messages arrive the CPU xlates the RDMA ring virtual address to
the CPU address, and accesses the memory from there.

It then finds a free block in the memory pool and remaps it into the
RDMA pool and tells the remote that there is more free memory.

>From the perspective of the remote this creates an endless, apparently
linear, ring.

When the CPU is done with its memory it adds it back to free block
pool.

At the start of time the RDMA ring maps 1:1 to the CPU pool.  As xfers
happen the RDMA rings maps non-linearly depending on when the CPU is
done with the memory.

There are lots of details to make this work, but you avoid sending s/g
lists, and generally make communication more asynchronous.

s/g lists are expensive. A 1GB ring requires nearly 2MB to describe
with s/g lists, and a 40GB nic can turn that ring over 4 times per
second!

You can do something similar with sends, but sends have to
pre-size buffers, wheras this scheme lets you send any size message
with optimal memory usage.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                                 ` <20130612211742.GA8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-14 22:48                                                                                                   ` Jeff Squyres (jsquyres)
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-14 22:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liran Liss, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shachar Raindel

On Jun 12, 2013, at 5:17 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> Yes, it can, via MAP_FIXED. There are lots of fun tricks you can play
> using that.


You're missing the point.

Normal users (i.e., MPI users) don't do that.  They call malloc() and they get what they get.

The whole point of upper-layer APIs is that they hide all the network stuff from the application programmer.  Verbs is *hard* for the mere mortal to program.  MPI can do a great deal to hide the complexities of verbs from app developers, but one major concession that MPI (intentionally) made is that the *application provides the buffer*, not MPI.

Hence, we're stuck with what buffers the user passes in.

This is the root of the whole "MPI has a registration cache" issue.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                                 ` <20130612214708.GD8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-06-14 22:53                                                                                                   ` Jeff Squyres (jsquyres)
       [not found]                                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6886C8-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Squyres (jsquyres) @ 2013-06-14 22:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Jun 12, 2013, at 5:47 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> Someone has to finish the ummunotify rewrite Roland
> started. Realistically MPI is going to be the only user, can someone
> from the MPI world do this?

1. I tried to ask what needed to be done at the beginning of this thread and didn't get much of an answer.

2. We've (all) been asking for this functionality *for years*; I even helped with the first implementation.  Can't the verbs community finish it?  :-)  MPI is probably your biggest customer, after all...

>> ...but this is not how people write applications.  Real apps use
>> malloc (and some direct mmap, and perhaps even some shared memory).
> 
> *shrug* I used MAP_FIXED for some RDMA regions in my IB verbs apps,
> specifically to create specalized high-performance memory
> structures.

But you're not a chemist writing Fortran code to effect n-body simulations.

The target audience for MPI is scientists and engineers who are not (and should not be) network / systems developers.  They're focusing on their formulae and applications -- as they should be.

> It isn't a general purpose technique for non-RDMA apps - but
> especially when combined with ODP it is useful in some places.

I have no doubt that ODP solves problems for someone.  It just doesn't seem to solve the very-long-standing MPI issues with verbs and registration caches.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Status of "ummunot" branch?
       [not found]                                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6886C8-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
@ 2013-06-14 23:11                                                                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Gunthorpe @ 2013-06-14 23:11 UTC (permalink / raw)
  To: Jeff Squyres (jsquyres)
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shachar Raindel

On Fri, Jun 14, 2013 at 10:53:24PM +0000, Jeff Squyres (jsquyres) wrote:
> On Jun 12, 2013, at 5:47 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> > Someone has to finish the ummunotify rewrite Roland
> > started. Realistically MPI is going to be the only user, can someone
> > from the MPI world do this?
> 
> 1. I tried to ask what needed to be done at the beginning of this
> thread and didn't get much of an answer.

AFAIK, only Roland knows the state of his rewrite, hopefully he will
comment.. I told you everything I know about what happened with the
last attempt :|

> 2. We've (all) been asking for this functionality *for years*; I
> even helped with the first implementation.  Can't the verbs
> community finish it?  :-) MPI is probably your biggest customer,
> after all...

There isn't much of a verbs community, to be honest. Unless you can
get a vendor to commit resources to the project, I wouldn't expect
much.

> I have no doubt that ODP solves problems for someone.  It just
> doesn't seem to solve the very-long-standing MPI issues with verbs
> and registration caches.

As we've discussed it helps write a registration cache.

The 'All of address space' varient that Haggai mentioned should also
be very interesting.

But this is all future stuff.

FWIW, I agree with you that ummunot is something MPI really needs to
function.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2013-06-14 23:11 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-28 17:51 Status of "ummunot" branch? Jeff Squyres (jsquyres)
     [not found] ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F643196-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-05-28 17:52   ` Roland Dreier
     [not found]     ` <CAL1RGDUops1ju6zU=w3vKxcUcLHp6XJFKfBTDr4nm397UkhaYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-28 18:30       ` Jeff Squyres (jsquyres)
2013-05-29  8:53   ` Or Gerlitz
     [not found]     ` <CAJZOPZJc2Dq2jQgRspP_2c1j=4aJ40UxcBEcyiY_mhHPX1ptPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-29 22:56       ` Jeff Squyres (jsquyres)
     [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F64AAB7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-05-30  5:09           ` Or Gerlitz
     [not found]             ` <51A6DEEC.40305-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-30 15:52               ` Jeff Squyres (jsquyres)
2013-06-04  1:24       ` Jeff Squyres (jsquyres)
     [not found]         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F657918-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-04  8:37           ` Or Gerlitz
     [not found]             ` <51ADA761.2080107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-04  9:54               ` Haggai Eran
     [not found]                 ` <51ADB948.5080903-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-04 10:56                   ` Jeff Squyres (jsquyres)
     [not found]                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F659155-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-04 11:50                       ` Haggai Eran
     [not found]                         ` <51ADD489.3020902-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-04 17:04                           ` Jason Gunthorpe
     [not found]                             ` <20130604170441.GA13745-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-05  7:09                               ` Haggai Eran
2013-06-04 20:13                           ` Jeff Squyres (jsquyres)
     [not found]                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65AE40-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-05  7:14                               ` Haggai Eran
     [not found]                                 ` <51AEE53C.2090603-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-05 12:45                                   ` Jeff Squyres (jsquyres)
     [not found]                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65C855-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-05 13:39                                       ` Haggai Eran
     [not found]                                         ` <51AF3FA8.7000900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-05 16:53                                           ` Jeff Squyres (jsquyres)
     [not found]                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65D5D3-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-05 17:14                                               ` Jason Gunthorpe
     [not found]                                                 ` <20130605171426.GC30184-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-05 18:10                                                   ` Jeff Squyres (jsquyres)
     [not found]                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DC0D-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-05 18:18                                                       ` Jason Gunthorpe
     [not found]                                                         ` <20130605181853.GB1946-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-05 18:45                                                           ` Jeff Squyres (jsquyres)
     [not found]                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F65DF6F-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-05 19:05                                                               ` Jason Gunthorpe
     [not found]                                                                 ` <20130605190529.GA3044-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-06  2:58                                                                   ` Jeff Squyres (jsquyres)
2013-06-06  5:52                                                                   ` Haggai Eran
     [not found]                                                                     ` <51B023B9.9050000-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-06-06 23:33                                                                       ` Jeff Squyres (jsquyres)
     [not found]                                                                         ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66B79C-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-07 22:59                                                                           ` Jeff Squyres (jsquyres)
     [not found]                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F66E403-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-07 23:57                                                                               ` Jason Gunthorpe
     [not found]                                                                                 ` <20130607235731.GA25942-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-10  9:17                                                                                   ` Liran Liss
2013-06-10 14:49                                                                                   ` Jeff Squyres (jsquyres)
     [not found]                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F676E59-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-10 15:56                                                                                       ` Liran Liss
     [not found]                                                                                         ` <D554B471892C914E90E136467281724DAD695B50-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
2013-06-12 21:10                                                                                           ` Jeff Squyres (jsquyres)
     [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6808D7-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-12 21:17                                                                                               ` Jason Gunthorpe
     [not found]                                                                                                 ` <20130612211742.GA8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-14 22:48                                                                                                   ` Jeff Squyres (jsquyres)
2013-06-10 17:26                                                                                       ` Jason Gunthorpe
     [not found]                                                                                         ` <20130610172627.GC2391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-12 21:18                                                                                           ` Jeff Squyres (jsquyres)
     [not found]                                                                                             ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F680A2B-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-12 21:47                                                                                               ` Jason Gunthorpe
     [not found]                                                                                                 ` <20130612214708.GD8625-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-06-14 22:53                                                                                                   ` Jeff Squyres (jsquyres)
     [not found]                                                                                                     ` <EF66BBEB19BADC41AC8CCF5F684F07FC4F6886C8-nsZYYkk5h5QQ2GdVW7+PtKBKnGwkPULj@public.gmane.org>
2013-06-14 23:11                                                                                                       ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.