From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net [150.101.137.131]) by ml01.01.org (Postfix) with ESMTP id 08D4D20352604 for ; Mon, 30 Oct 2017 04:17:02 -0700 (PDT) Date: Mon, 30 Oct 2017 22:20:48 +1100 From: Dave Chinner Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Message-ID: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20171030083807.GA23278@quack2.suse.cz> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Jan Kara Cc: Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Jeff Layton , Sean Hefty , Matthew Wilcox , linux-rdma@vger.kernel.org, Michael Ellerman , Christoph Hellwig , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , Martin Schwidefsky , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, linux-fsdevel , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" List-ID: On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: > Hi, > > On Mon 30-10-17 13:00:23, Dave Chinner wrote: > > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: > > > Coming back to this since Dave has made clear that new locking to > > > coordinate get_user_pages() is a no-go. > > > > > > We can unmap to force new get_user_pages() attempts to block on the > > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs > > > to drop the mmap lock and wait. We need this lock dropped to get > > > around the problem that the driver will not start to drop page > > > references until it has elevated the page references on all the pages > > > in the I/O. If we need to drop the mmap lock that makes it impossible > > > to coordinate this unlock/retry loop within truncate_inode_pages_range > > > which would otherwise be the natural place to land this code. > > > > > > Would it be palatable to unmap and drain dma in any path that needs to > > > detach blocks from an inode? Something like the following that builds > > > on dax_wait_dma() tried to achieve, but does not introduce a new lock > > > for the fs to manage: > > > > > > retry: > > > per_fs_mmap_lock(inode); > > > unmap_mapping_range(mapping, start, end); /* new page references > > > cannot be established */ > > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { > > > per_fs_mmap_unlock(inode); /* new page references can happen, > > > so we need to start over */ > > > wait_for_page_idle(dax_page); > > > goto retry; > > > } > > > truncate_inode_pages_range(mapping, start, end); > > > per_fs_mmap_unlock(inode); > > > > These retry loops you keep proposing are just bloody horrible. They > > are basically just a method for blocking an operation until whatever > > condition is preventing the invalidation goes away. IMO, that's an > > ugly solution no matter how much lipstick you dress it up with. > > > > i.e. the blocking loops mean the user process is going to be blocked > > for arbitrary lengths of time. That's not a solution, it's just > > passing the buck - now the userspace developers need to work around > > truncate/hole punch being randomly blocked for arbitrary lengths of > > time. > > So I see substantial difference between how you and Christoph think this > should be handled. Christoph writes in [1]: > > The point is that we need to prohibit long term elevated page counts > with DAX anyway - we can't just let people grab allocated blocks forever > while ignoring file system operations. For stage 1 we'll just need to > fail those, and in the long run they will have to use a mechanism > similar to FL_LAYOUT locks to deal with file system allocation changes. > > So Christoph wants to block truncate until references are released, forbid > long term references until userspace acquiring them supports some kind of > lease-breaking. OTOH you suggest truncate should just proceed leaving > blocks allocated until references are released. I don't see what I'm suggesting is a solution to long term elevated page counts. Just something that can park extents until layout leases are broken and references released. That's a few tens of seconds at most. > We cannot have both... I'm leaning more towards the approach > Christoph suggests as it puts the burned to the place which is > causing it - the application having long term references - and > applications needing this should be sufficiently rare that we > don't have to devise a general mechanism in the kernel for this. I have no problems with blocking truncate forever if that's the desired solution for an elevated page count due to a DMA reference to a page. But that has absolutely nothing to do with the filesystem though - it's a page reference vs mapping invalidation problem, not a filesystem/inode problem. Perhaps pages with active DAX DMA mapping references need a page flag to indicate that invalidation must block on the page similar to the writeback flag... > If the solution Christoph suggests is acceptable to you, I think > we should first write a patch to forbid acquiring long term > references to DAX blocks. On top of that we can implement > mechanism to block truncate while there are short term references > pending (and for that retry loops would be IMHO acceptable). The problem with retry loops is that they are making a mess of an already complex set of locking contraints on the indoe IO path. It's rapidly descending into an unmaintainable mess - falling off the locking cliff only make sthe code harder to maintain - please look for solutions that don't require new locks or lock retry loops. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Date: Mon, 30 Oct 2017 22:20:48 +1100 Message-ID: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20171030083807.GA23278@quack2.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Dan Williams , Christoph Hellwig , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Sean Hefty , Jeff Layton , Matthew Wilcox , linux-rdma@vger.kernel.org, Michael Ellerman , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , linux-fsdevel , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux List-Id: linux-rdma@vger.kernel.org On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: > Hi, > > On Mon 30-10-17 13:00:23, Dave Chinner wrote: > > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: > > > Coming back to this since Dave has made clear that new locking to > > > coordinate get_user_pages() is a no-go. > > > > > > We can unmap to force new get_user_pages() attempts to block on the > > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs > > > to drop the mmap lock and wait. We need this lock dropped to get > > > around the problem that the driver will not start to drop page > > > references until it has elevated the page references on all the pages > > > in the I/O. If we need to drop the mmap lock that makes it impossible > > > to coordinate this unlock/retry loop within truncate_inode_pages_range > > > which would otherwise be the natural place to land this code. > > > > > > Would it be palatable to unmap and drain dma in any path that needs to > > > detach blocks from an inode? Something like the following that builds > > > on dax_wait_dma() tried to achieve, but does not introduce a new lock > > > for the fs to manage: > > > > > > retry: > > > per_fs_mmap_lock(inode); > > > unmap_mapping_range(mapping, start, end); /* new page references > > > cannot be established */ > > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { > > > per_fs_mmap_unlock(inode); /* new page references can happen, > > > so we need to start over */ > > > wait_for_page_idle(dax_page); > > > goto retry; > > > } > > > truncate_inode_pages_range(mapping, start, end); > > > per_fs_mmap_unlock(inode); > > > > These retry loops you keep proposing are just bloody horrible. They > > are basically just a method for blocking an operation until whatever > > condition is preventing the invalidation goes away. IMO, that's an > > ugly solution no matter how much lipstick you dress it up with. > > > > i.e. the blocking loops mean the user process is going to be blocked > > for arbitrary lengths of time. That's not a solution, it's just > > passing the buck - now the userspace developers need to work around > > truncate/hole punch being randomly blocked for arbitrary lengths of > > time. > > So I see substantial difference between how you and Christoph think this > should be handled. Christoph writes in [1]: > > The point is that we need to prohibit long term elevated page counts > with DAX anyway - we can't just let people grab allocated blocks forever > while ignoring file system operations. For stage 1 we'll just need to > fail those, and in the long run they will have to use a mechanism > similar to FL_LAYOUT locks to deal with file system allocation changes. > > So Christoph wants to block truncate until references are released, forbid > long term references until userspace acquiring them supports some kind of > lease-breaking. OTOH you suggest truncate should just proceed leaving > blocks allocated until references are released. I don't see what I'm suggesting is a solution to long term elevated page counts. Just something that can park extents until layout leases are broken and references released. That's a few tens of seconds at most. > We cannot have both... I'm leaning more towards the approach > Christoph suggests as it puts the burned to the place which is > causing it - the application having long term references - and > applications needing this should be sufficiently rare that we > don't have to devise a general mechanism in the kernel for this. I have no problems with blocking truncate forever if that's the desired solution for an elevated page count due to a DMA reference to a page. But that has absolutely nothing to do with the filesystem though - it's a page reference vs mapping invalidation problem, not a filesystem/inode problem. Perhaps pages with active DAX DMA mapping references need a page flag to indicate that invalidation must block on the page similar to the writeback flag... > If the solution Christoph suggests is acceptable to you, I think > we should first write a patch to forbid acquiring long term > references to DAX blocks. On top of that we can implement > mechanism to block truncate while there are short term references > pending (and for that retry loops would be IMHO acceptable). The problem with retry loops is that they are making a mess of an already complex set of locking contraints on the indoe IO path. It's rapidly descending into an unmaintainable mess - falling off the locking cliff only make sthe code harder to maintain - please look for solutions that don't require new locks or lock retry loops. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752908AbdJ3LUz (ORCPT ); Mon, 30 Oct 2017 07:20:55 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:46356 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751749AbdJ3LUx (ORCPT ); Mon, 30 Oct 2017 07:20:53 -0400 Date: Mon, 30 Oct 2017 22:20:48 +1100 From: Dave Chinner To: Jan Kara Cc: Dan Williams , Christoph Hellwig , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Sean Hefty , Jeff Layton , Matthew Wilcox , linux-rdma@vger.kernel.org, Michael Ellerman , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , linux-fsdevel , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, Martin Schwidefsky , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Message-ID: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171030083807.GA23278@quack2.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: > Hi, > > On Mon 30-10-17 13:00:23, Dave Chinner wrote: > > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: > > > Coming back to this since Dave has made clear that new locking to > > > coordinate get_user_pages() is a no-go. > > > > > > We can unmap to force new get_user_pages() attempts to block on the > > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs > > > to drop the mmap lock and wait. We need this lock dropped to get > > > around the problem that the driver will not start to drop page > > > references until it has elevated the page references on all the pages > > > in the I/O. If we need to drop the mmap lock that makes it impossible > > > to coordinate this unlock/retry loop within truncate_inode_pages_range > > > which would otherwise be the natural place to land this code. > > > > > > Would it be palatable to unmap and drain dma in any path that needs to > > > detach blocks from an inode? Something like the following that builds > > > on dax_wait_dma() tried to achieve, but does not introduce a new lock > > > for the fs to manage: > > > > > > retry: > > > per_fs_mmap_lock(inode); > > > unmap_mapping_range(mapping, start, end); /* new page references > > > cannot be established */ > > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { > > > per_fs_mmap_unlock(inode); /* new page references can happen, > > > so we need to start over */ > > > wait_for_page_idle(dax_page); > > > goto retry; > > > } > > > truncate_inode_pages_range(mapping, start, end); > > > per_fs_mmap_unlock(inode); > > > > These retry loops you keep proposing are just bloody horrible. They > > are basically just a method for blocking an operation until whatever > > condition is preventing the invalidation goes away. IMO, that's an > > ugly solution no matter how much lipstick you dress it up with. > > > > i.e. the blocking loops mean the user process is going to be blocked > > for arbitrary lengths of time. That's not a solution, it's just > > passing the buck - now the userspace developers need to work around > > truncate/hole punch being randomly blocked for arbitrary lengths of > > time. > > So I see substantial difference between how you and Christoph think this > should be handled. Christoph writes in [1]: > > The point is that we need to prohibit long term elevated page counts > with DAX anyway - we can't just let people grab allocated blocks forever > while ignoring file system operations. For stage 1 we'll just need to > fail those, and in the long run they will have to use a mechanism > similar to FL_LAYOUT locks to deal with file system allocation changes. > > So Christoph wants to block truncate until references are released, forbid > long term references until userspace acquiring them supports some kind of > lease-breaking. OTOH you suggest truncate should just proceed leaving > blocks allocated until references are released. I don't see what I'm suggesting is a solution to long term elevated page counts. Just something that can park extents until layout leases are broken and references released. That's a few tens of seconds at most. > We cannot have both... I'm leaning more towards the approach > Christoph suggests as it puts the burned to the place which is > causing it - the application having long term references - and > applications needing this should be sufficiently rare that we > don't have to devise a general mechanism in the kernel for this. I have no problems with blocking truncate forever if that's the desired solution for an elevated page count due to a DMA reference to a page. But that has absolutely nothing to do with the filesystem though - it's a page reference vs mapping invalidation problem, not a filesystem/inode problem. Perhaps pages with active DAX DMA mapping references need a page flag to indicate that invalidation must block on the page similar to the writeback flag... > If the solution Christoph suggests is acceptable to you, I think > we should first write a patch to forbid acquiring long term > references to DAX blocks. On top of that we can implement > mechanism to block truncate while there are short term references > pending (and for that retry loops would be IMHO acceptable). The problem with retry loops is that they are making a mess of an already complex set of locking contraints on the indoe IO path. It's rapidly descending into an unmaintainable mess - falling off the locking cliff only make sthe code harder to maintain - please look for solutions that don't require new locks or lock retry loops. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 30 Oct 2017 22:20:48 +1100 From: Dave Chinner To: Jan Kara Cc: Dan Williams , Christoph Hellwig , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Sean Hefty , Jeff Layton , Matthew Wilcox , linux-rdma@vger.kernel.org, Michael Ellerman , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , linux-fsdevel , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, Martin Schwidefsky , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Message-ID: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171030083807.GA23278@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: > Hi, > > On Mon 30-10-17 13:00:23, Dave Chinner wrote: > > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: > > > Coming back to this since Dave has made clear that new locking to > > > coordinate get_user_pages() is a no-go. > > > > > > We can unmap to force new get_user_pages() attempts to block on the > > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs > > > to drop the mmap lock and wait. We need this lock dropped to get > > > around the problem that the driver will not start to drop page > > > references until it has elevated the page references on all the pages > > > in the I/O. If we need to drop the mmap lock that makes it impossible > > > to coordinate this unlock/retry loop within truncate_inode_pages_range > > > which would otherwise be the natural place to land this code. > > > > > > Would it be palatable to unmap and drain dma in any path that needs to > > > detach blocks from an inode? Something like the following that builds > > > on dax_wait_dma() tried to achieve, but does not introduce a new lock > > > for the fs to manage: > > > > > > retry: > > > per_fs_mmap_lock(inode); > > > unmap_mapping_range(mapping, start, end); /* new page references > > > cannot be established */ > > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { > > > per_fs_mmap_unlock(inode); /* new page references can happen, > > > so we need to start over */ > > > wait_for_page_idle(dax_page); > > > goto retry; > > > } > > > truncate_inode_pages_range(mapping, start, end); > > > per_fs_mmap_unlock(inode); > > > > These retry loops you keep proposing are just bloody horrible. They > > are basically just a method for blocking an operation until whatever > > condition is preventing the invalidation goes away. IMO, that's an > > ugly solution no matter how much lipstick you dress it up with. > > > > i.e. the blocking loops mean the user process is going to be blocked > > for arbitrary lengths of time. That's not a solution, it's just > > passing the buck - now the userspace developers need to work around > > truncate/hole punch being randomly blocked for arbitrary lengths of > > time. > > So I see substantial difference between how you and Christoph think this > should be handled. Christoph writes in [1]: > > The point is that we need to prohibit long term elevated page counts > with DAX anyway - we can't just let people grab allocated blocks forever > while ignoring file system operations. For stage 1 we'll just need to > fail those, and in the long run they will have to use a mechanism > similar to FL_LAYOUT locks to deal with file system allocation changes. > > So Christoph wants to block truncate until references are released, forbid > long term references until userspace acquiring them supports some kind of > lease-breaking. OTOH you suggest truncate should just proceed leaving > blocks allocated until references are released. I don't see what I'm suggesting is a solution to long term elevated page counts. Just something that can park extents until layout leases are broken and references released. That's a few tens of seconds at most. > We cannot have both... I'm leaning more towards the approach > Christoph suggests as it puts the burned to the place which is > causing it - the application having long term references - and > applications needing this should be sufficiently rare that we > don't have to devise a general mechanism in the kernel for this. I have no problems with blocking truncate forever if that's the desired solution for an elevated page count due to a DMA reference to a page. But that has absolutely nothing to do with the filesystem though - it's a page reference vs mapping invalidation problem, not a filesystem/inode problem. Perhaps pages with active DAX DMA mapping references need a page flag to indicate that invalidation must block on the page similar to the writeback flag... > If the solution Christoph suggests is acceptable to you, I think > we should first write a patch to forbid acquiring long term > references to DAX blocks. On top of that we can implement > mechanism to block truncate while there are short term references > pending (and for that retry loops would be IMHO acceptable). The problem with retry loops is that they are making a mess of an already complex set of locking contraints on the indoe IO path. It's rapidly descending into an unmaintainable mess - falling off the locking cliff only make sthe code harder to maintain - please look for solutions that don't require new locks or lock retry loops. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org