linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v1 for-next 00/16] On demand paging
       [not found] ` <5405D2D8.1040700@mellanox.com>
@ 2014-09-03 20:21   ` Or Gerlitz
       [not found]     ` <CAOha14xthZHSpS_T+XRgZcPqwaZvtMw0iGTzKjTyjdBuLhJ4Eg@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Or Gerlitz @ 2014-09-03 20:21 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Greg Kroah-Hartman, Sagi Grimberg, linux-kernel

On Tue, Sep 2, 2014, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> On 7/3/2014 11:44 AM, Haggai Eran wrote:
>>
>> Hi Roland,
>>
>> I understand that you were reluctant to review these patches as long as
>> there was an ongoing debate on whether or not the i_mmap_mutex should be
>> changed into a spinlock.
>>
>> It seems that the debate concluded with the decision to change it into a
>> rwsem [1], as apparently this provides the optimal performance with the new
>> optimistic spinning patch [2].
>>
>> I believe this means that there will be no problem adding paging support
>> to the RDMA stack that depends on sleepable MMU notifiers.
>
>
> Hi Roland,
>
> The ODP patch set was initially posted whole six months ago (March 2nd,
> 2014). We did it prior to LSF so you can discuss that with Sagi while he's
> there. Well no comment from your side so far. It's really (really) hard to
> do proper kernel development when the sub-system maintainer doesn't provide
> you almost no concrete feedback over half a year.
>
> Can you please go ahead and tell us your position re this features/patches?

Hi Roland,

Bump. Can you comment here? these patches were worked out here for
long time by a dedicated group and implement a strategic feature for
the RDMA industry.
I don't see why the RDMA kernel maintainer can leave the development
team in the air without any comment on their work for half a year.

Or.


>> Changes from V0: http://marc.info/?l=linux-rdma&m=139375790322547&w=2
>>
>> - Rebased against latest upstream / for-next branch.
>> - Removed dependency on patches that were accepted upstream.
>> - Removed pre-patches that were accepted upstream [3].
>> - Add extended uverb call for querying device (patch 1) and use kernel
>> device
>>    attributes to report ODP capabilities through the new uverb entry
>> instead of
>>    having a special verb.
>> - Allow upgrading page access permissions during page faults.
>> - Minor fixes to issues that came up during regression testing of the
>> patches.
>>
>> The following set of patches implements on-demand paging (ODP) support
>> in the RDMA stack and in the mlx5_ib Infiniband driver.
>>
>> What is on-demand paging?
>>
>> Applications register memory with an RDMA adapter using system calls,
>> and subsequently post IO operations that refer to the corresponding
>> virtual addresses directly to HW. Until now, this was achieved by
>> pinning the memory during the registration calls. The goal of on demand
>> paging is to avoid pinning the pages of registered memory regions (MRs).
>> This will allow users the same flexibility they get when swapping any
>> other part of their processes address spaces. Instead of requiring the
>> entire MR to fit in physical memory, we can allow the MR to be larger,
>> and only fit the current working set in physical memory.
>>
>> This can make programming with RDMA much simpler. Today, developers that
>> are working with more data than their RAM can hold need either to
>> deregister and reregister memory regions throughout their process's
>> life, or keep a single memory region and copy the data to it. On demand
>> paging will allow these developers to register a single MR at the
>> beginning of their process's life, and let the operating system manage
>> which pages needs to be fetched at a given time. In the future, we might
>> be able to provide a single memory access key for each process that
>> would provide the entire process's address as one large memory region,
>> and the developers wouldn't need to register memory regions at all.
>>
>> How does page faults generally work?
>>
>> With pinned memory regions, the driver would map the virtual addresses
>> to bus addresses, and pass these addresses to the HCA to associate them
>> with the new MR. With ODP, the driver is now allowed to mark some of the
>> pages in the MR as not-present. When the HCA attempts to perform memory
>> access for a communication operation, it notices the page is not
>> present, and raises a page fault event to the driver. In addition, the
>> HCA performs whatever operation is required by the transport protocol to
>> suspend communication until the page fault is resolved.
>>
>> Upon receiving the page fault interrupt, the driver first needs to know
>> on which virtual address the page fault occurred, and on what memory
>> key. When handling send/receive operations, this information is inside
>> the work queue. The driver reads the needed work queue elements, and
>> parses them to gather the address and memory key. For other RDMA
>> operations, the event generated by the HCA only contains the virtual
>> address and rkey, as there are no work queue elements involved.
>>
>> Having the rkey, the driver can find the relevant memory region in its
>> data structures, and calculate the actual pages needed to complete the
>> operation. It then uses get_user_pages to retrieve the needed pages back
>> to the memory, obtains dma mapping, and passes the addresses to the HCA.
>> Finally, the driver notifies the HCA it can continue operation on the
>> queue pair that encountered the page fault. The pages that
>> get_user_pages returned are unpinned immediately by releasing their
>> reference.
>>
>> How are invalidations handled?
>>
>> The patches add infrastructure to subscribe the RDMA stack as an mmu
>> notifier client [4]. Each process that uses ODP register a notifier
>> client.
>> When receiving page invalidation notifications, they are passed to the
>> mlx5_ib driver, which updates the HCA with new, not-present mappings.
>> Only after flushing the HCA's page table caches the notifier returns,
>> allowing the kernel to release the pages.
>>
>> What operations are supported?
>>
>> Currently only send, receive and RDMA write operations are supported on
>> the
>> RC protocol, and also send operations on the UD protocol. We hope to
>> implement support for other transports and operations in the future.
>>
>> The structure of the patchset
>>
>> Patches 1-6:
>> The first set of patches adds page fault support to the IB core layer,
>> allowing MRs to be registered without their pages to be pinned. Patch 1
>> adds an extended verb to query device attributes, and patch 2
>> adds capability bits, configuration options, and a method for querying
>> whether the paging capabilities from user-space. The next two patches
>> (3-4)
>> make some necessary changes to the ib_umem type. Patches 5 and 6 add
>> paging support and invalidation support respectively.
>>
>> Patches 7-12:
>> This set of patches add small size new functionality to the mlx5 driver
>> and
>> builds toward paging support. Patch 7 make changes to UMR mechanism
>> (an internal mechanism used by mlx5 to update device page mappings).
>> Patch 8 adds infrastructure support for page fault handling to the
>> mlx5_core module. Patch 9 queries the device for paging capabilities, and
>> patch 11 adds a function to do partial device page table updates. Finally,
>> patch 12 adds a helper function to read information from user-space work
>> queues in the driver's context.
>>
>> Patches 13-16:
>> The final part of this patch set finally adds paging support to the mlx5
>> driver. Patch 13 adds in mlx5_ib the infrastructure to handle page faults
>> coming from mlx5_core. Patch 14 adds the code to handle UD send page
>> faults
>> and RC send and receive page faults. Patch 15 adds support for page faults
>> caused by RDMA write operations, and patch 16 adds invalidation support to
>> the mlx5 driver, allowing pages to be unmapped dynamically.
>>
>> [1] [PATCH 0/5] mm: i_mmap_mutex to rwsem
>>      https://lkml.org/lkml/2013/6/24/683
>>
>> [2] Re: Performance regression from switching lock to rw-sem for anon-vma
>> tree
>>      https://lkml.org/lkml/2013/6/17/452
>>
>> [3] pre-patches that were accepted upstream:
>>    a74d241 IB/mlx5: Refactor UMR to have its own context struct
>>    48fea83 IB/mlx5: Set QP offsets and parameters for user QPs and not
>> just for kernel QPs
>>    b475598 mlx5_core: Store MR attributes in mlx5_mr_core during creation
>> and after UMR
>>    8605933 IB/mlx5: Add MR to radix tree in reg_mr_callback
>>
>> [4] Integrating KVM with the Linux Memory Management (presentation),
>>      Andrea Archangeli
>>
>> http://www.linux-kvm.org/wiki/images/3/33/KvmForum2008%24kdf2008_15.pdf
>>
>>
>> Haggai Eran (11):
>>    IB/core: Add an extended user verb to query device attributes
>>    IB/core: Replace ib_umem's offset field with a full address
>>    IB/core: Add umem function to read data from user-space
>>    IB/mlx5: Enhance UMR support to allow partial page table update
>>    net/mlx5_core: Add support for page faults events and low level
>>      handling
>>    IB/mlx5: Implement the ODP capability query verb
>>    IB/mlx5: Changes in memory region creation to support on-demand
>>      paging
>>    IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
>>    IB/mlx5: Add function to read WQE from user-space
>>    IB/mlx5: Page faults handling infrastructure
>>    IB/mlx5: Handle page faults
>>
>> Sagi Grimberg (1):
>>    IB/core: Add flags for on demand paging support
>>
>> Shachar Raindel (4):
>>    IB/core: Add support for on demand paging regions
>>    IB/core: Implement support for MMU notifiers regarding on demand
>>      paging regions
>>    IB/mlx5: Add support for RDMA write responder page faults
>>    IB/mlx5: Implement on demand paging by adding support for MMU
>>      notifiers
>>
>>   drivers/infiniband/Kconfig                     |  11 +
>>   drivers/infiniband/core/Makefile               |   1 +
>>   drivers/infiniband/core/umem.c                 |  63 +-
>>   drivers/infiniband/core/umem_odp.c             | 620
>> ++++++++++++++++++++
>>   drivers/infiniband/core/umem_rbtree.c          |  94 +++
>>   drivers/infiniband/core/uverbs.h               |   1 +
>>   drivers/infiniband/core/uverbs_cmd.c           | 170 ++++--
>>   drivers/infiniband/core/uverbs_main.c          |   5 +-
>>   drivers/infiniband/hw/amso1100/c2_provider.c   |   2 +-
>>   drivers/infiniband/hw/ehca/ehca_mrmw.c         |   2 +-
>>   drivers/infiniband/hw/ipath/ipath_mr.c         |   2 +-
>>   drivers/infiniband/hw/mlx5/Makefile            |   1 +
>>   drivers/infiniband/hw/mlx5/main.c              |  39 +-
>>   drivers/infiniband/hw/mlx5/mem.c               |  67 ++-
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h           | 114 +++-
>>   drivers/infiniband/hw/mlx5/mr.c                | 303 ++++++++--
>>   drivers/infiniband/hw/mlx5/odp.c               | 770
>> +++++++++++++++++++++++++
>>   drivers/infiniband/hw/mlx5/qp.c                | 198 +++++--
>>   drivers/infiniband/hw/nes/nes_verbs.c          |   4 +-
>>   drivers/infiniband/hw/ocrdma/ocrdma_verbs.c    |   2 +-
>>   drivers/infiniband/hw/qib/qib_mr.c             |   2 +-
>>   drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  11 +-
>>   drivers/net/ethernet/mellanox/mlx5/core/fw.c   |  35 +-
>>   drivers/net/ethernet/mellanox/mlx5/core/main.c |   8 +-
>>   drivers/net/ethernet/mellanox/mlx5/core/qp.c   | 134 ++++-
>>   include/linux/mlx5/device.h                    |  73 ++-
>>   include/linux/mlx5/driver.h                    |  20 +-
>>   include/linux/mlx5/qp.h                        |  63 ++
>>   include/rdma/ib_umem.h                         |  29 +-
>>   include/rdma/ib_umem_odp.h                     | 156 +++++
>>   include/rdma/ib_verbs.h                        |  47 +-
>>   include/uapi/rdma/ib_user_verbs.h              |  25 +
>>   32 files changed, 2907 insertions(+), 165 deletions(-)
>>   create mode 100644 drivers/infiniband/core/umem_odp.c
>>   create mode 100644 drivers/infiniband/core/umem_rbtree.c
>>   create mode 100644 drivers/infiniband/hw/mlx5/odp.c
>>   create mode 100644 include/rdma/ib_umem_odp.h
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
       [not found]     ` <CAOha14xthZHSpS_T+XRgZcPqwaZvtMw0iGTzKjTyjdBuLhJ4Eg@mail.gmail.com>
@ 2014-09-03 21:15       ` Roland Dreier
  2014-09-04 17:45         ` Jerome Glisse
  2014-09-09 14:21         ` Haggai Eran
  0 siblings, 2 replies; 8+ messages in thread
From: Roland Dreier @ 2014-09-03 21:15 UTC (permalink / raw)
  To: Latchesar Ionkov
  Cc: Or Gerlitz, linux-rdma, Greg Kroah-Hartman, Sagi Grimberg, Linux Kernel

> I would like to note that we at Los Alamos National Laboratory are very
> interested in this functionality and it would be great if it gets accepted.

Have you done any review or testing of these changes?  If so can you
share the results?

 - R.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-03 21:15       ` Roland Dreier
@ 2014-09-04 17:45         ` Jerome Glisse
  2014-09-09 14:21         ` Haggai Eran
  1 sibling, 0 replies; 8+ messages in thread
From: Jerome Glisse @ 2014-09-04 17:45 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Latchesar Ionkov, Or Gerlitz, linux-rdma, Greg Kroah-Hartman,
	Sagi Grimberg, Linux Kernel

On Wed, Sep 03, 2014 at 02:15:51PM -0700, Roland Dreier wrote:
> > I would like to note that we at Los Alamos National Laboratory are very
> > interested in this functionality and it would be great if it gets accepted.
> 
> Have you done any review or testing of these changes?  If so can you
> share the results?

So jumping in here. I am working on a similar issue, ie on a subsystem that
allow a device to mirror a process address space (or part of it) using its
own mmu and page table.

While i am aiming at providing a generic API, it is not yet fully cook thus
i think having a driver implement its own code in the meantime is something
we have to live with. I am sharing the frustration of Sagi or Haggai on how
hard it is to get any kind of review.

I have not yet reviewed this code but it's nearing the top of on my todo list.
Even thought i am not an authoritative figure in either mm or rdma.

For anyone interested in taking a peak at HMM (subsystem i am working on to
provide same functionality and more) :

http://marc.info/?l=linux-mm&m=140933942705466&w=4

Cheers,
Jérôme

> 
>  - R.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-03 21:15       ` Roland Dreier
  2014-09-04 17:45         ` Jerome Glisse
@ 2014-09-09 14:21         ` Haggai Eran
  2014-09-10  8:51           ` Haggai Eran
  2014-09-12 21:16           ` Or Gerlitz
  1 sibling, 2 replies; 8+ messages in thread
From: Haggai Eran @ 2014-09-09 14:21 UTC (permalink / raw)
  To: Roland Dreier, Latchesar Ionkov
  Cc: Or Gerlitz, linux-rdma, Greg Kroah-Hartman, Sagi Grimberg, Linux Kernel

On 04/09/2014 00:15, Roland Dreier wrote:
> Have you done any review or testing of these changes?  If so can you
> share the results?

We have tested this feature thoroughly inside Mellanox. We ran random
tests that performed MR registrations, memory mappings and unmappings,
calls to madvise with MADV_DONTNEED for invalidations, sending and
receiving of data, and RDMA operations. The test validated the integrity
of the data, and we verified the integrity of kernel memory by running
the tests under a debugging kernel.

Best regards,
Haggai


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-09 14:21         ` Haggai Eran
@ 2014-09-10  8:51           ` Haggai Eran
  2014-09-10  9:28             ` Sagi Grimberg
  2014-09-12 21:16           ` Or Gerlitz
  1 sibling, 1 reply; 8+ messages in thread
From: Haggai Eran @ 2014-09-10  8:51 UTC (permalink / raw)
  To: Roland Dreier, Latchesar Ionkov
  Cc: Or Gerlitz, linux-rdma, Greg Kroah-Hartman, Sagi Grimberg, Linux Kernel

On 09/09/2014 17:21, Haggai Eran wrote:
> On 04/09/2014 00:15, Roland Dreier wrote:
>> Have you done any review or testing of these changes?  If so can you
>> share the results?
> 
> We have tested this feature thoroughly inside Mellanox. We ran random
> tests that performed MR registrations, memory mappings and unmappings,
> calls to madvise with MADV_DONTNEED for invalidations, sending and
> receiving of data, and RDMA operations. The test validated the integrity
> of the data, and we verified the integrity of kernel memory by running
> the tests under a debugging kernel.

We wanted to add regarding performance testing of these patches, we have
tested ODP on several setups, including low-level RDMA micro-benchmarks,
MPI applications, and iSER. In all cases, ODP delivers the *same*
bare-metal performance as obtained with standard MRs, in terms of both
BW and latency. In addition, performance of standard MRs is not affected
by the presence of ODP applications.

The main benefits of ODP is the simplified programming model, simplified
management, and avoiding worst-case memory commitment.
For example, we were able to run multiple concurrent instances of iSER
targets, allowing over-commitment that otherwise wouldn’t be possible.
In the MPI case, both IMB (Pallas) and applications achieved the same
performance as the pin-down cache, with minimal memory locking
privileges while avoiding any glibc hooks for detecting invalidations.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-10  8:51           ` Haggai Eran
@ 2014-09-10  9:28             ` Sagi Grimberg
  0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2014-09-10  9:28 UTC (permalink / raw)
  To: Haggai Eran, Roland Dreier, Latchesar Ionkov
  Cc: Or Gerlitz, linux-rdma, Greg Kroah-Hartman, Sagi Grimberg, Linux Kernel

On 9/10/2014 11:51 AM, Haggai Eran wrote:
<SNIP>
>
> The main benefits of ODP is the simplified programming model, simplified
> management, and avoiding worst-case memory commitment.
> For example, we were able to run multiple concurrent instances of iSER
> targets, allowing over-commitment that otherwise wouldn’t be possible.

Just wanted to add that we're talking about TGT which is a user-space 
target, thus it's RDMA memory regions were allowed to be pageable (i.e. 
not pinned).

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-09 14:21         ` Haggai Eran
  2014-09-10  8:51           ` Haggai Eran
@ 2014-09-12 21:16           ` Or Gerlitz
  2014-09-17 15:18             ` Or Gerlitz
  1 sibling, 1 reply; 8+ messages in thread
From: Or Gerlitz @ 2014-09-12 21:16 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Latchesar Ionkov, linux-rdma, Greg Kroah-Hartman, Sagi Grimberg,
	Linux Kernel, Haggai Eran

On Tue, Sep 9, 2014, Haggai Eran <haggaie@mellanox.com> wrote:
> On 04/09/2014, Roland Dreier wrote:

>> Have you done any review or testing of these changes?  If so can you
>> share the results?

> We have tested this feature thoroughly inside Mellanox. We ran random
> tests that performed MR registrations, memory mappings and unmappings,
> calls to madvise with MADV_DONTNEED for invalidations, sending and
> receiving of data, and RDMA operations. The test validated the integrity
> of the data, and we verified the integrity of kernel memory by running
> the tests under a debugging kernel.

Hi Roland,

Per your request we provided the information on tests conducted with
the patches.

Note that the patches can't really disrupt existing applications that
don't set the new IB_ACCESS_ON_DEMAND MR flag when they register
memory. Also
the whole set of changes to the umem area is dependent on building with
CONFIG_INFINIBAND_ON_DEMAND_PAGING -- all in all, everything is in
place for protecting against potential regression that this series
could introduce.

As you didn't provide any feedback for > six months, and we have all
the above in place (report on stability tests, performance data and
mechanics to avoid regressions) I think it would be fair to get this
picked for the coming merge window, thoughts?

Or.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 for-next 00/16] On demand paging
  2014-09-12 21:16           ` Or Gerlitz
@ 2014-09-17 15:18             ` Or Gerlitz
  0 siblings, 0 replies; 8+ messages in thread
From: Or Gerlitz @ 2014-09-17 15:18 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, Latchesar Ionkov, linux-rdma, Greg Kroah-Hartman,
	Sagi Grimberg, Linux Kernel, Haggai Eran

On 9/13/2014 12:16 AM, Or Gerlitz wrote:
> Per your request we provided the information on tests conducted with
> the patches.
>
> Note that the patches can't really disrupt existing applications that
> don't set the new IB_ACCESS_ON_DEMAND MR flag when they register
> memory. Also the whole set of changes to the umem area is dependent on building with
> CONFIG_INFINIBAND_ON_DEMAND_PAGING -- all in all, everything is in
> place for protecting against potential regression that this series
> could introduce.
>
> As you didn't provide any feedback for > six months, and we have all
> the above in place (report on stability tests, performance data and
> mechanics to avoid regressions) I think it would be fair to get this
> picked for the coming merge window, thoughts?

Roland,

Can you please comment here, not only that you didn't provide any 
feedback on the patches, you
are even not willing to respond if the the data we gave addresses your 
questions on testing and performance. Are you planning to pick this to 
the next merge window?

Or.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-09-17 15:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1404377069-20585-1-git-send-email-haggaie@mellanox.com>
     [not found] ` <5405D2D8.1040700@mellanox.com>
2014-09-03 20:21   ` [PATCH v1 for-next 00/16] On demand paging Or Gerlitz
     [not found]     ` <CAOha14xthZHSpS_T+XRgZcPqwaZvtMw0iGTzKjTyjdBuLhJ4Eg@mail.gmail.com>
2014-09-03 21:15       ` Roland Dreier
2014-09-04 17:45         ` Jerome Glisse
2014-09-09 14:21         ` Haggai Eran
2014-09-10  8:51           ` Haggai Eran
2014-09-10  9:28             ` Sagi Grimberg
2014-09-12 21:16           ` Or Gerlitz
2014-09-17 15:18             ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).