All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>, Jan Kara <jack@suse.cz>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	linux-mm <linux-mm@kvack.org>, Paul Mackerras <paulus@samba.org>,
	Jeff Layton <jlayton@poochiereds.net>,
	Christoph Hellwig <hch@lst.de>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>,
	Doug Ledford <dledford@redhat.com>,
	Sean Hefty <sean.hefty@intel.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
Date: Mon, 30 Oct 2017 10:51:30 -0700	[thread overview]
Message-ID: <CAPcyv4jhrPz5Rcx=oLi7EVsR2_wVcKLo1Ekouj369HXu_Nf_nw@mail.gmail.com> (raw)
In-Reply-To: <20171030112048.GA4133@dastard>

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	linux-mm <linux-mm@kvack.org>, Paul Mackerras <paulus@samba.org>,
	Jeff Layton <jlayton@poochiereds.net>,
	Sean Hefty <sean.hefty@intel.com>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Christoph Hellwig <hch@lst.de>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>,
	Doug Ledford <dledford@redhat.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Linux
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
Date: Mon, 30 Oct 2017 10:51:30 -0700	[thread overview]
Message-ID: <CAPcyv4jhrPz5Rcx=oLi7EVsR2_wVcKLo1Ekouj369HXu_Nf_nw@mail.gmail.com> (raw)
In-Reply-To: <20171030112048.GA4133@dastard>

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	linux-mm <linux-mm@kvack.org>, Paul Mackerras <paulus@samba.org>,
	Jeff Layton <jlayton@poochiereds.net>,
	Sean Hefty <sean.hefty@intel.com>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Christoph Hellwig <hch@lst.de>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>,
	Doug Ledford <dledford@redhat.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
Date: Mon, 30 Oct 2017 10:51:30 -0700	[thread overview]
Message-ID: <CAPcyv4jhrPz5Rcx=oLi7EVsR2_wVcKLo1Ekouj369HXu_Nf_nw@mail.gmail.com> (raw)
In-Reply-To: <20171030112048.GA4133@dastard>

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	linux-mm <linux-mm@kvack.org>, Paul Mackerras <paulus@samba.org>,
	Jeff Layton <jlayton@poochiereds.net>,
	Sean Hefty <sean.hefty@intel.com>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Christoph Hellwig <hch@lst.de>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>,
	Doug Ledford <dledford@redhat.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
Date: Mon, 30 Oct 2017 10:51:30 -0700	[thread overview]
Message-ID: <CAPcyv4jhrPz5Rcx=oLi7EVsR2_wVcKLo1Ekouj369HXu_Nf_nw@mail.gmail.com> (raw)
In-Reply-To: <20171030112048.GA4133@dastard>

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-10-30 17:47 UTC|newest]

Thread overview: 143+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 01/13] dax: quiet bdev_dax_supported() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 02/13] dax: require 'struct page' for filesystem dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  7:57   ` Christoph Hellwig
2017-10-20  7:57     ` Christoph Hellwig
2017-10-20 15:23     ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 16:29       ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 22:29         ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-21  3:20           ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  4:16             ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  8:15               ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-23  5:18         ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  8:55           ` Dan Williams
2017-10-23  8:55             ` Dan Williams
2017-10-23 10:44             ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 11:20               ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 04/13] dax: stop using VM_HUGEPAGE " Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 05/13] dax: stop requiring a live device for dax_flush() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 06/13] dax: store pfns in the radix Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 07/13] dax: warn if dma collides with truncate Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease} Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20 12:39   ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20  2:40 ` [PATCH v3 12/13] dax: handle truncate of dma-busy pages Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20 13:05   ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 15:42     ` Dan Williams
2017-10-20 15:42       ` Dan Williams
2017-10-20 15:42       ` Dan Williams
2017-10-20 16:32       ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 17:27         ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 20:36           ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-21  8:11           ` Christoph Hellwig
2017-10-21  8:11             ` Christoph Hellwig
2017-10-20  2:40 ` [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  9:31   ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-26 10:58     ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 23:51       ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-27  6:48         ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27 11:42           ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-29 21:52             ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-27  6:45       ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-29 23:46       ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-30  2:00         ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  8:38           ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30 11:20             ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 17:51               ` Dan Williams [this message]
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4jhrPz5Rcx=oLi7EVsR2_wVcKLo1Ekouj369HXu_Nf_nw@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=bfields@fieldses.org \
    --cc=darrick.wong@oracle.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=dledford@redhat.com \
    --cc=gerald.schaefer@de.ibm.com \
    --cc=hal.rosenstock@gmail.com \
    --cc=hch@lst.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=jack@suse.cz \
    --cc=jgunthorpe@obsidianresearch.com \
    --cc=jlayton@poochiereds.net \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mawilcox@microsoft.com \
    --cc=mhocko@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@samba.org \
    --cc=schwidefsky@de.ibm.com \
    --cc=sean.hefty@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.