From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com [IPv6:2607:f8b0:4003:c06::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id D3F5F21F38856 for ; Fri, 13 Oct 2017 11:18:50 -0700 (PDT) Received: by mail-oi0-x22e.google.com with SMTP id c202so15434142oih.9 for ; Fri, 13 Oct 2017 11:22:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20171013173145.GA18702@obsidianresearch.com> References: <150776922692.9144.16963640112710410217.stgit@dwillia2-desk3.amr.corp.intel.com> <20171012142319.GA11254@lst.de> <20171013065716.GB26461@lst.de> <20171013163822.GA17411@obsidianresearch.com> <20171013173145.GA18702@obsidianresearch.com> From: Dan Williams Date: Fri, 13 Oct 2017 11:22:21 -0700 Message-ID: Subject: Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Jason Gunthorpe Cc: "J. Bruce Fields" , Jan Kara , Andrew Morton , Arnd Bergmann , "Darrick J. Wong" , Linux API , "linux-nvdimm@lists.01.org" , Dave Chinner , linux-xfs@vger.kernel.org, Linux MM , Al Viro , Andy Lutomirski , Jeff Layton , linux-fsdevel , Linus Torvalds , Christoph Hellwig List-ID: On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe wrote: > On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: >> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >> wrote: >> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: >> > >> >> scheme specific to RDMA which seems like a waste to me when we can >> >> generically signal an event on the fd for any event that effects any >> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> >> so as far as I can see delaying the notification until MR-init is too >> >> late, too granular, and too RDMA specific. >> > >> > But for RDMA a FD is not what we care about - we want the MR handle so >> > the app knows which MR needs fixing. >> >> I'd rather put the onus on userspace to remember where it used a >> MAP_DIRECT mapping and be aware that all the mappings of that file are >> subject to a lease break. Sure, we could build up a pile of kernel >> infrastructure to notify on a per-MR basis, but I think that would >> only be worth it if leases were range based. As it is, the entire file >> is covered by a lease instance and all MRs that might reference that >> file get one notification. That said, we can always arrange for a >> per-driver callback at lease-break time so that it can do something >> above and beyond the default notification. > > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > > So, who should be responsible for MR coherency? Today we say the MPI > is responsible. But we can't really expect the MPI > to hook SIGIO and somehow try to reverse engineer what MRs are > impacted from a FD that may not even still be open. Ok, that's good insight that I didn't have. Userspace needs more help than just an fd notification. > I think, if you want to build a uAPI for notification of MR lease > break, then you need show how it fits into the above software model: > - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? > - How lease break can be done hitlessly, so the library user never > needs to know it is happening or see failed/missed transfers iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. > - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20171013173145.GA18702@obsidianresearch.com> References: <150776922692.9144.16963640112710410217.stgit@dwillia2-desk3.amr.corp.intel.com> <20171012142319.GA11254@lst.de> <20171013065716.GB26461@lst.de> <20171013163822.GA17411@obsidianresearch.com> <20171013173145.GA18702@obsidianresearch.com> From: Dan Williams Date: Fri, 13 Oct 2017 11:22:21 -0700 Message-ID: Subject: Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush To: Jason Gunthorpe Cc: Christoph Hellwig , "linux-nvdimm@lists.01.org" , linux-xfs@vger.kernel.org, Jan Kara , Arnd Bergmann , "Darrick J. Wong" , Linux API , Dave Chinner , "J. Bruce Fields" , Linux MM , Jeff Moyer , Al Viro , Andy Lutomirski , Ross Zwisler , linux-fsdevel , Jeff Layton , Linus Torvalds , Andrew Morton Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe wrote: > On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: >> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >> wrote: >> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: >> > >> >> scheme specific to RDMA which seems like a waste to me when we can >> >> generically signal an event on the fd for any event that effects any >> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> >> so as far as I can see delaying the notification until MR-init is too >> >> late, too granular, and too RDMA specific. >> > >> > But for RDMA a FD is not what we care about - we want the MR handle so >> > the app knows which MR needs fixing. >> >> I'd rather put the onus on userspace to remember where it used a >> MAP_DIRECT mapping and be aware that all the mappings of that file are >> subject to a lease break. Sure, we could build up a pile of kernel >> infrastructure to notify on a per-MR basis, but I think that would >> only be worth it if leases were range based. As it is, the entire file >> is covered by a lease instance and all MRs that might reference that >> file get one notification. That said, we can always arrange for a >> per-driver callback at lease-break time so that it can do something >> above and beyond the default notification. > > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > > So, who should be responsible for MR coherency? Today we say the MPI > is responsible. But we can't really expect the MPI > to hook SIGIO and somehow try to reverse engineer what MRs are > impacted from a FD that may not even still be open. Ok, that's good insight that I didn't have. Userspace needs more help than just an fd notification. > I think, if you want to build a uAPI for notification of MR lease > break, then you need show how it fits into the above software model: > - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? > - How lease break can be done hitlessly, so the library user never > needs to know it is happening or see failed/missed transfers iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. > - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f53.google.com ([209.85.218.53]:55905 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751823AbdJMSWW (ORCPT ); Fri, 13 Oct 2017 14:22:22 -0400 Received: by mail-oi0-f53.google.com with SMTP id g125so15571135oib.12 for ; Fri, 13 Oct 2017 11:22:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20171013173145.GA18702@obsidianresearch.com> References: <150776922692.9144.16963640112710410217.stgit@dwillia2-desk3.amr.corp.intel.com> <20171012142319.GA11254@lst.de> <20171013065716.GB26461@lst.de> <20171013163822.GA17411@obsidianresearch.com> <20171013173145.GA18702@obsidianresearch.com> From: Dan Williams Date: Fri, 13 Oct 2017 11:22:21 -0700 Message-ID: Subject: Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush Content-Type: text/plain; charset="UTF-8" Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Jason Gunthorpe Cc: Christoph Hellwig , "linux-nvdimm@lists.01.org" , linux-xfs@vger.kernel.org, Jan Kara , Arnd Bergmann , "Darrick J. Wong" , Linux API , Dave Chinner , "J. Bruce Fields" , Linux MM , Jeff Moyer , Al Viro , Andy Lutomirski , Ross Zwisler , linux-fsdevel , Jeff Layton , Linus Torvalds , Andrew Morton On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe wrote: > On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: >> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >> wrote: >> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: >> > >> >> scheme specific to RDMA which seems like a waste to me when we can >> >> generically signal an event on the fd for any event that effects any >> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> >> so as far as I can see delaying the notification until MR-init is too >> >> late, too granular, and too RDMA specific. >> > >> > But for RDMA a FD is not what we care about - we want the MR handle so >> > the app knows which MR needs fixing. >> >> I'd rather put the onus on userspace to remember where it used a >> MAP_DIRECT mapping and be aware that all the mappings of that file are >> subject to a lease break. Sure, we could build up a pile of kernel >> infrastructure to notify on a per-MR basis, but I think that would >> only be worth it if leases were range based. As it is, the entire file >> is covered by a lease instance and all MRs that might reference that >> file get one notification. That said, we can always arrange for a >> per-driver callback at lease-break time so that it can do something >> above and beyond the default notification. > > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > > So, who should be responsible for MR coherency? Today we say the MPI > is responsible. But we can't really expect the MPI > to hook SIGIO and somehow try to reverse engineer what MRs are > impacted from a FD that may not even still be open. Ok, that's good insight that I didn't have. Userspace needs more help than just an fd notification. > I think, if you want to build a uAPI for notification of MR lease > break, then you need show how it fits into the above software model: > - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? > - How lease break can be done hitlessly, so the library user never > needs to know it is happening or see failed/missed transfers iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. > - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush Date: Fri, 13 Oct 2017 11:22:21 -0700 Message-ID: References: <150776922692.9144.16963640112710410217.stgit@dwillia2-desk3.amr.corp.intel.com> <20171012142319.GA11254@lst.de> <20171013065716.GB26461@lst.de> <20171013163822.GA17411@obsidianresearch.com> <20171013173145.GA18702@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: In-Reply-To: <20171013173145.GA18702-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Jason Gunthorpe Cc: Christoph Hellwig , "linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org" , linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jan Kara , Arnd Bergmann , "Darrick J. Wong" , Linux API , Dave Chinner , "J. Bruce Fields" , Linux MM , Jeff Moyer , Al Viro , Andy Lutomirski , Ross Zwisler , linux-fsdevel , Jeff Layton , Linus Torvalds , Andrew Morton List-Id: linux-api@vger.kernel.org On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe wrote: > On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: >> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >> wrote: >> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: >> > >> >> scheme specific to RDMA which seems like a waste to me when we can >> >> generically signal an event on the fd for any event that effects any >> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> >> so as far as I can see delaying the notification until MR-init is too >> >> late, too granular, and too RDMA specific. >> > >> > But for RDMA a FD is not what we care about - we want the MR handle so >> > the app knows which MR needs fixing. >> >> I'd rather put the onus on userspace to remember where it used a >> MAP_DIRECT mapping and be aware that all the mappings of that file are >> subject to a lease break. Sure, we could build up a pile of kernel >> infrastructure to notify on a per-MR basis, but I think that would >> only be worth it if leases were range based. As it is, the entire file >> is covered by a lease instance and all MRs that might reference that >> file get one notification. That said, we can always arrange for a >> per-driver callback at lease-break time so that it can do something >> above and beyond the default notification. > > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > > So, who should be responsible for MR coherency? Today we say the MPI > is responsible. But we can't really expect the MPI > to hook SIGIO and somehow try to reverse engineer what MRs are > impacted from a FD that may not even still be open. Ok, that's good insight that I didn't have. Userspace needs more help than just an fd notification. > I think, if you want to build a uAPI for notification of MR lease > break, then you need show how it fits into the above software model: > - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? > - How lease break can be done hitlessly, so the library user never > needs to know it is happening or see failed/missed transfers iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. > - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead.