From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-x241.google.com (mail-oi0-x241.google.com [IPv6:2607:f8b0:4003:c06::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id CAF822034CF70 for ; Mon, 30 Oct 2017 10:47:40 -0700 (PDT) Received: by mail-oi0-x241.google.com with SMTP id a132so23444907oih.11 for ; Mon, 30 Oct 2017 10:51:31 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> <20171030112048.GA4133@dastard> From: Dan Williams Date: Mon, 30 Oct 2017 10:51:30 -0700 Message-ID: Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Dave Chinner Cc: Michal Hocko , Jan Kara , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Jeff Layton , Christoph Hellwig , Matthew Wilcox , linux-rdma , Michael Ellerman , Jason Gunthorpe , Doug Ledford , Sean Hefty , Hal Rosenstock , linux-fsdevel , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, Martin Schwidefsky , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" List-ID: On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner wrote: > On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: >> Hi, >> >> On Mon 30-10-17 13:00:23, Dave Chinner wrote: >> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: >> > > Coming back to this since Dave has made clear that new locking to >> > > coordinate get_user_pages() is a no-go. >> > > >> > > We can unmap to force new get_user_pages() attempts to block on the >> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs >> > > to drop the mmap lock and wait. We need this lock dropped to get >> > > around the problem that the driver will not start to drop page >> > > references until it has elevated the page references on all the pages >> > > in the I/O. If we need to drop the mmap lock that makes it impossible >> > > to coordinate this unlock/retry loop within truncate_inode_pages_range >> > > which would otherwise be the natural place to land this code. >> > > >> > > Would it be palatable to unmap and drain dma in any path that needs to >> > > detach blocks from an inode? Something like the following that builds >> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock >> > > for the fs to manage: >> > > >> > > retry: >> > > per_fs_mmap_lock(inode); >> > > unmap_mapping_range(mapping, start, end); /* new page references >> > > cannot be established */ >> > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { >> > > per_fs_mmap_unlock(inode); /* new page references can happen, >> > > so we need to start over */ >> > > wait_for_page_idle(dax_page); >> > > goto retry; >> > > } >> > > truncate_inode_pages_range(mapping, start, end); >> > > per_fs_mmap_unlock(inode); >> > >> > These retry loops you keep proposing are just bloody horrible. They >> > are basically just a method for blocking an operation until whatever >> > condition is preventing the invalidation goes away. IMO, that's an >> > ugly solution no matter how much lipstick you dress it up with. >> > >> > i.e. the blocking loops mean the user process is going to be blocked >> > for arbitrary lengths of time. That's not a solution, it's just >> > passing the buck - now the userspace developers need to work around >> > truncate/hole punch being randomly blocked for arbitrary lengths of >> > time. >> >> So I see substantial difference between how you and Christoph think this >> should be handled. Christoph writes in [1]: >> >> The point is that we need to prohibit long term elevated page counts >> with DAX anyway - we can't just let people grab allocated blocks forever >> while ignoring file system operations. For stage 1 we'll just need to >> fail those, and in the long run they will have to use a mechanism >> similar to FL_LAYOUT locks to deal with file system allocation changes. >> >> So Christoph wants to block truncate until references are released, forbid >> long term references until userspace acquiring them supports some kind of >> lease-breaking. OTOH you suggest truncate should just proceed leaving >> blocks allocated until references are released. > > I don't see what I'm suggesting is a solution to long term elevated > page counts. Just something that can park extents until layout > leases are broken and references released. That's a few tens of > seconds at most. > >> We cannot have both... I'm leaning more towards the approach >> Christoph suggests as it puts the burned to the place which is >> causing it - the application having long term references - and >> applications needing this should be sufficiently rare that we >> don't have to devise a general mechanism in the kernel for this. > > I have no problems with blocking truncate forever if that's the > desired solution for an elevated page count due to a DMA reference > to a page. But that has absolutely nothing to do with the filesystem > though - it's a page reference vs mapping invalidation problem, not > a filesystem/inode problem. > > Perhaps pages with active DAX DMA mapping references need a page > flag to indicate that invalidation must block on the page similar to > the writeback flag... We effectively already have this flag since pages where is_zone_device_page() == true can only have their reference count elevated by get_user_pages(). More importantly we can not block invalidation on an elevated page count because that page count may never drop until all references have been acquired. I.e. iov_iter_get_pages() grabs a range of pages potentially across multiple vmas and does not drop any references in the range until all pages have had their count elevated. >> If the solution Christoph suggests is acceptable to you, I think >> we should first write a patch to forbid acquiring long term >> references to DAX blocks. On top of that we can implement >> mechanism to block truncate while there are short term references >> pending (and for that retry loops would be IMHO acceptable). > > The problem with retry loops is that they are making a mess of an > already complex set of locking contraints on the indoe IO path. It's > rapidly descending into an unmaintainable mess - falling off the > locking cliff only make sthe code harder to maintain - please look > for solutions that don't require new locks or lock retry loops. I was hoping to make the retry loop no worse than the one we already perform for xfs_break_layouts(), and then the approach can be easily shared between ext4 and xfs. However before we get there, we need quite a bit of reworks (require struct page for dax, use pfns in the dax radix, disable long held page reference counts for DAX i.e. RDMA / V4L2...). I'll submit those preparation steps first and then we can circle back to the "how to wait for DAX-DMA to end" problem. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Date: Mon, 30 Oct 2017 10:51:30 -0700 Message-ID: References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> <20171030112048.GA4133@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: In-Reply-To: <20171030112048.GA4133@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Jan Kara , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Jeff Layton , Sean Hefty , Matthew Wilcox , linux-rdma , Michael Ellerman , Christoph Hellwig , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , Martin Schwidefsky , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux List-Id: linux-rdma@vger.kernel.org On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner wrote: > On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: >> Hi, >> >> On Mon 30-10-17 13:00:23, Dave Chinner wrote: >> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: >> > > Coming back to this since Dave has made clear that new locking to >> > > coordinate get_user_pages() is a no-go. >> > > >> > > We can unmap to force new get_user_pages() attempts to block on the >> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs >> > > to drop the mmap lock and wait. We need this lock dropped to get >> > > around the problem that the driver will not start to drop page >> > > references until it has elevated the page references on all the pages >> > > in the I/O. If we need to drop the mmap lock that makes it impossible >> > > to coordinate this unlock/retry loop within truncate_inode_pages_range >> > > which would otherwise be the natural place to land this code. >> > > >> > > Would it be palatable to unmap and drain dma in any path that needs to >> > > detach blocks from an inode? Something like the following that builds >> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock >> > > for the fs to manage: >> > > >> > > retry: >> > > per_fs_mmap_lock(inode); >> > > unmap_mapping_range(mapping, start, end); /* new page references >> > > cannot be established */ >> > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { >> > > per_fs_mmap_unlock(inode); /* new page references can happen, >> > > so we need to start over */ >> > > wait_for_page_idle(dax_page); >> > > goto retry; >> > > } >> > > truncate_inode_pages_range(mapping, start, end); >> > > per_fs_mmap_unlock(inode); >> > >> > These retry loops you keep proposing are just bloody horrible. They >> > are basically just a method for blocking an operation until whatever >> > condition is preventing the invalidation goes away. IMO, that's an >> > ugly solution no matter how much lipstick you dress it up with. >> > >> > i.e. the blocking loops mean the user process is going to be blocked >> > for arbitrary lengths of time. That's not a solution, it's just >> > passing the buck - now the userspace developers need to work around >> > truncate/hole punch being randomly blocked for arbitrary lengths of >> > time. >> >> So I see substantial difference between how you and Christoph think this >> should be handled. Christoph writes in [1]: >> >> The point is that we need to prohibit long term elevated page counts >> with DAX anyway - we can't just let people grab allocated blocks forever >> while ignoring file system operations. For stage 1 we'll just need to >> fail those, and in the long run they will have to use a mechanism >> similar to FL_LAYOUT locks to deal with file system allocation changes. >> >> So Christoph wants to block truncate until references are released, forbid >> long term references until userspace acquiring them supports some kind of >> lease-breaking. OTOH you suggest truncate should just proceed leaving >> blocks allocated until references are released. > > I don't see what I'm suggesting is a solution to long term elevated > page counts. Just something that can park extents until layout > leases are broken and references released. That's a few tens of > seconds at most. > >> We cannot have both... I'm leaning more towards the approach >> Christoph suggests as it puts the burned to the place which is >> causing it - the application having long term references - and >> applications needing this should be sufficiently rare that we >> don't have to devise a general mechanism in the kernel for this. > > I have no problems with blocking truncate forever if that's the > desired solution for an elevated page count due to a DMA reference > to a page. But that has absolutely nothing to do with the filesystem > though - it's a page reference vs mapping invalidation problem, not > a filesystem/inode problem. > > Perhaps pages with active DAX DMA mapping references need a page > flag to indicate that invalidation must block on the page similar to > the writeback flag... We effectively already have this flag since pages where is_zone_device_page() == true can only have their reference count elevated by get_user_pages(). More importantly we can not block invalidation on an elevated page count because that page count may never drop until all references have been acquired. I.e. iov_iter_get_pages() grabs a range of pages potentially across multiple vmas and does not drop any references in the range until all pages have had their count elevated. >> If the solution Christoph suggests is acceptable to you, I think >> we should first write a patch to forbid acquiring long term >> references to DAX blocks. On top of that we can implement >> mechanism to block truncate while there are short term references >> pending (and for that retry loops would be IMHO acceptable). > > The problem with retry loops is that they are making a mess of an > already complex set of locking contraints on the indoe IO path. It's > rapidly descending into an unmaintainable mess - falling off the > locking cliff only make sthe code harder to maintain - please look > for solutions that don't require new locks or lock retry loops. I was hoping to make the retry loop no worse than the one we already perform for xfs_break_layouts(), and then the approach can be easily shared between ext4 and xfs. However before we get there, we need quite a bit of reworks (require struct page for dax, use pfns in the dax radix, disable long held page reference counts for DAX i.e. RDMA / V4L2...). I'll submit those preparation steps first and then we can circle back to the "how to wait for DAX-DMA to end" problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932775AbdJ3Rvg (ORCPT ); Mon, 30 Oct 2017 13:51:36 -0400 Received: from mail-oi0-f66.google.com ([209.85.218.66]:53183 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932760AbdJ3Rvb (ORCPT ); Mon, 30 Oct 2017 13:51:31 -0400 X-Google-Smtp-Source: ABhQp+QcweJDyxNaLWAE6kWo6jnZMct4zlBU5/PGMSQnh9LGAgXIoWtKDpMSZwhZOxmzVzn4zmrV/HZsxlI7IZToaYs= MIME-Version: 1.0 In-Reply-To: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> <20171030112048.GA4133@dastard> From: Dan Williams Date: Mon, 30 Oct 2017 10:51:30 -0700 Message-ID: Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support To: Dave Chinner Cc: Jan Kara , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Jeff Layton , Sean Hefty , Matthew Wilcox , linux-rdma , Michael Ellerman , Christoph Hellwig , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , Martin Schwidefsky , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, linux-fsdevel , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner wrote: > On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: >> Hi, >> >> On Mon 30-10-17 13:00:23, Dave Chinner wrote: >> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: >> > > Coming back to this since Dave has made clear that new locking to >> > > coordinate get_user_pages() is a no-go. >> > > >> > > We can unmap to force new get_user_pages() attempts to block on the >> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs >> > > to drop the mmap lock and wait. We need this lock dropped to get >> > > around the problem that the driver will not start to drop page >> > > references until it has elevated the page references on all the pages >> > > in the I/O. If we need to drop the mmap lock that makes it impossible >> > > to coordinate this unlock/retry loop within truncate_inode_pages_range >> > > which would otherwise be the natural place to land this code. >> > > >> > > Would it be palatable to unmap and drain dma in any path that needs to >> > > detach blocks from an inode? Something like the following that builds >> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock >> > > for the fs to manage: >> > > >> > > retry: >> > > per_fs_mmap_lock(inode); >> > > unmap_mapping_range(mapping, start, end); /* new page references >> > > cannot be established */ >> > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { >> > > per_fs_mmap_unlock(inode); /* new page references can happen, >> > > so we need to start over */ >> > > wait_for_page_idle(dax_page); >> > > goto retry; >> > > } >> > > truncate_inode_pages_range(mapping, start, end); >> > > per_fs_mmap_unlock(inode); >> > >> > These retry loops you keep proposing are just bloody horrible. They >> > are basically just a method for blocking an operation until whatever >> > condition is preventing the invalidation goes away. IMO, that's an >> > ugly solution no matter how much lipstick you dress it up with. >> > >> > i.e. the blocking loops mean the user process is going to be blocked >> > for arbitrary lengths of time. That's not a solution, it's just >> > passing the buck - now the userspace developers need to work around >> > truncate/hole punch being randomly blocked for arbitrary lengths of >> > time. >> >> So I see substantial difference between how you and Christoph think this >> should be handled. Christoph writes in [1]: >> >> The point is that we need to prohibit long term elevated page counts >> with DAX anyway - we can't just let people grab allocated blocks forever >> while ignoring file system operations. For stage 1 we'll just need to >> fail those, and in the long run they will have to use a mechanism >> similar to FL_LAYOUT locks to deal with file system allocation changes. >> >> So Christoph wants to block truncate until references are released, forbid >> long term references until userspace acquiring them supports some kind of >> lease-breaking. OTOH you suggest truncate should just proceed leaving >> blocks allocated until references are released. > > I don't see what I'm suggesting is a solution to long term elevated > page counts. Just something that can park extents until layout > leases are broken and references released. That's a few tens of > seconds at most. > >> We cannot have both... I'm leaning more towards the approach >> Christoph suggests as it puts the burned to the place which is >> causing it - the application having long term references - and >> applications needing this should be sufficiently rare that we >> don't have to devise a general mechanism in the kernel for this. > > I have no problems with blocking truncate forever if that's the > desired solution for an elevated page count due to a DMA reference > to a page. But that has absolutely nothing to do with the filesystem > though - it's a page reference vs mapping invalidation problem, not > a filesystem/inode problem. > > Perhaps pages with active DAX DMA mapping references need a page > flag to indicate that invalidation must block on the page similar to > the writeback flag... We effectively already have this flag since pages where is_zone_device_page() == true can only have their reference count elevated by get_user_pages(). More importantly we can not block invalidation on an elevated page count because that page count may never drop until all references have been acquired. I.e. iov_iter_get_pages() grabs a range of pages potentially across multiple vmas and does not drop any references in the range until all pages have had their count elevated. >> If the solution Christoph suggests is acceptable to you, I think >> we should first write a patch to forbid acquiring long term >> references to DAX blocks. On top of that we can implement >> mechanism to block truncate while there are short term references >> pending (and for that retry loops would be IMHO acceptable). > > The problem with retry loops is that they are making a mess of an > already complex set of locking contraints on the indoe IO path. It's > rapidly descending into an unmaintainable mess - falling off the > locking cliff only make sthe code harder to maintain - please look > for solutions that don't require new locks or lock retry loops. I was hoping to make the retry loop no worse than the one we already perform for xfs_break_layouts(), and then the approach can be easily shared between ext4 and xfs. However before we get there, we need quite a bit of reworks (require struct page for dax, use pfns in the dax radix, disable long held page reference counts for DAX i.e. RDMA / V4L2...). I'll submit those preparation steps first and then we can circle back to the "how to wait for DAX-DMA to end" problem. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20171030112048.GA4133@dastard> References: <150846713528.24336.4459262264611579791.stgit@dwillia2-desk3.amr.corp.intel.com> <20171020074750.GA13568@lst.de> <20171020093148.GA20304@lst.de> <20171026105850.GA31161@quack2.suse.cz> <20171030020023.GG3666@dastard> <20171030083807.GA23278@quack2.suse.cz> <20171030112048.GA4133@dastard> From: Dan Williams Date: Mon, 30 Oct 2017 10:51:30 -0700 Message-ID: Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support To: Dave Chinner Cc: Jan Kara , Michal Hocko , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , "J. Bruce Fields" , linux-mm , Paul Mackerras , Jeff Layton , Sean Hefty , Matthew Wilcox , linux-rdma , Michael Ellerman , Christoph Hellwig , Jason Gunthorpe , Doug Ledford , Hal Rosenstock , Martin Schwidefsky , Alexander Viro , Gerald Schaefer , "linux-nvdimm@lists.01.org" , Linux Kernel Mailing List , linux-xfs@vger.kernel.org, linux-fsdevel , Andrew Morton , "Darrick J. Wong" , "Kirill A. Shutemov" Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner wrote: > On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: >> Hi, >> >> On Mon 30-10-17 13:00:23, Dave Chinner wrote: >> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: >> > > Coming back to this since Dave has made clear that new locking to >> > > coordinate get_user_pages() is a no-go. >> > > >> > > We can unmap to force new get_user_pages() attempts to block on the >> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs >> > > to drop the mmap lock and wait. We need this lock dropped to get >> > > around the problem that the driver will not start to drop page >> > > references until it has elevated the page references on all the pages >> > > in the I/O. If we need to drop the mmap lock that makes it impossible >> > > to coordinate this unlock/retry loop within truncate_inode_pages_range >> > > which would otherwise be the natural place to land this code. >> > > >> > > Would it be palatable to unmap and drain dma in any path that needs to >> > > detach blocks from an inode? Something like the following that builds >> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock >> > > for the fs to manage: >> > > >> > > retry: >> > > per_fs_mmap_lock(inode); >> > > unmap_mapping_range(mapping, start, end); /* new page references >> > > cannot be established */ >> > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { >> > > per_fs_mmap_unlock(inode); /* new page references can happen, >> > > so we need to start over */ >> > > wait_for_page_idle(dax_page); >> > > goto retry; >> > > } >> > > truncate_inode_pages_range(mapping, start, end); >> > > per_fs_mmap_unlock(inode); >> > >> > These retry loops you keep proposing are just bloody horrible. They >> > are basically just a method for blocking an operation until whatever >> > condition is preventing the invalidation goes away. IMO, that's an >> > ugly solution no matter how much lipstick you dress it up with. >> > >> > i.e. the blocking loops mean the user process is going to be blocked >> > for arbitrary lengths of time. That's not a solution, it's just >> > passing the buck - now the userspace developers need to work around >> > truncate/hole punch being randomly blocked for arbitrary lengths of >> > time. >> >> So I see substantial difference between how you and Christoph think this >> should be handled. Christoph writes in [1]: >> >> The point is that we need to prohibit long term elevated page counts >> with DAX anyway - we can't just let people grab allocated blocks forever >> while ignoring file system operations. For stage 1 we'll just need to >> fail those, and in the long run they will have to use a mechanism >> similar to FL_LAYOUT locks to deal with file system allocation changes. >> >> So Christoph wants to block truncate until references are released, forbid >> long term references until userspace acquiring them supports some kind of >> lease-breaking. OTOH you suggest truncate should just proceed leaving >> blocks allocated until references are released. > > I don't see what I'm suggesting is a solution to long term elevated > page counts. Just something that can park extents until layout > leases are broken and references released. That's a few tens of > seconds at most. > >> We cannot have both... I'm leaning more towards the approach >> Christoph suggests as it puts the burned to the place which is >> causing it - the application having long term references - and >> applications needing this should be sufficiently rare that we >> don't have to devise a general mechanism in the kernel for this. > > I have no problems with blocking truncate forever if that's the > desired solution for an elevated page count due to a DMA reference > to a page. But that has absolutely nothing to do with the filesystem > though - it's a page reference vs mapping invalidation problem, not > a filesystem/inode problem. > > Perhaps pages with active DAX DMA mapping references need a page > flag to indicate that invalidation must block on the page similar to > the writeback flag... We effectively already have this flag since pages where is_zone_device_page() == true can only have their reference count elevated by get_user_pages(). More importantly we can not block invalidation on an elevated page count because that page count may never drop until all references have been acquired. I.e. iov_iter_get_pages() grabs a range of pages potentially across multiple vmas and does not drop any references in the range until all pages have had their count elevated. >> If the solution Christoph suggests is acceptable to you, I think >> we should first write a patch to forbid acquiring long term >> references to DAX blocks. On top of that we can implement >> mechanism to block truncate while there are short term references >> pending (and for that retry loops would be IMHO acceptable). > > The problem with retry loops is that they are making a mess of an > already complex set of locking contraints on the indoe IO path. It's > rapidly descending into an unmaintainable mess - falling off the > locking cliff only make sthe code harder to maintain - please look > for solutions that don't require new locks or lock retry loops. I was hoping to make the retry loop no worse than the one we already perform for xfs_break_layouts(), and then the approach can be easily shared between ext4 and xfs. However before we get there, we need quite a bit of reworks (require struct page for dax, use pfns in the dax radix, disable long held page reference counts for DAX i.e. RDMA / V4L2...). I'll submit those preparation steps first and then we can circle back to the "how to wait for DAX-DMA to end" problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org