All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: linux-xfs@vger.kernel.org, Jan Kara <jack@suse.cz>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Linux MM <linux-mm@kvack.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
Date: Fri, 20 Oct 2017 08:42:00 -0700	[thread overview]
Message-ID: <CAPcyv4hXCJYTkUKs6NiOp=8kgExu+bgZnVn_v+Os7fVUc2NxFg@mail.gmail.com> (raw)
In-Reply-To: <1508504726.5572.41.camel@kernel.org>

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Matthew Wilcox <mawilcox@microsoft.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Linux MM <linux-mm@kvack.org>, Jeff Moyer <jmoyer@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
Date: Fri, 20 Oct 2017 08:42:00 -0700	[thread overview]
Message-ID: <CAPcyv4hXCJYTkUKs6NiOp=8kgExu+bgZnVn_v+Os7fVUc2NxFg@mail.gmail.com> (raw)
In-Reply-To: <1508504726.5572.41.camel@kernel.org>

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Matthew Wilcox <mawilcox@microsoft.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Linux MM <linux-mm@kvack.org>, Jeff Moyer <jmoyer@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
Date: Fri, 20 Oct 2017 08:42:00 -0700	[thread overview]
Message-ID: <CAPcyv4hXCJYTkUKs6NiOp=8kgExu+bgZnVn_v+Os7fVUc2NxFg@mail.gmail.com> (raw)
In-Reply-To: <1508504726.5572.41.camel@kernel.org>

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-10-20 15:38 UTC|newest]

Thread overview: 143+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 01/13] dax: quiet bdev_dax_supported() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 02/13] dax: require 'struct page' for filesystem dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  7:57   ` Christoph Hellwig
2017-10-20  7:57     ` Christoph Hellwig
2017-10-20 15:23     ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 16:29       ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 22:29         ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-21  3:20           ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  4:16             ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  8:15               ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-23  5:18         ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  8:55           ` Dan Williams
2017-10-23  8:55             ` Dan Williams
2017-10-23 10:44             ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 11:20               ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 04/13] dax: stop using VM_HUGEPAGE " Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 05/13] dax: stop requiring a live device for dax_flush() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 06/13] dax: store pfns in the radix Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 07/13] dax: warn if dma collides with truncate Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease} Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20 12:39   ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20  2:40 ` [PATCH v3 12/13] dax: handle truncate of dma-busy pages Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20 13:05   ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 15:42     ` Dan Williams [this message]
2017-10-20 15:42       ` Dan Williams
2017-10-20 15:42       ` Dan Williams
2017-10-20 16:32       ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 17:27         ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 20:36           ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-21  8:11           ` Christoph Hellwig
2017-10-21  8:11             ` Christoph Hellwig
2017-10-20  2:40 ` [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  9:31   ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-26 10:58     ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 23:51       ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-27  6:48         ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27 11:42           ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-29 21:52             ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-27  6:45       ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-29 23:46       ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-30  2:00         ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  8:38           ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30 11:20             ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 17:51               ` Dan Williams
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4hXCJYTkUKs6NiOp=8kgExu+bgZnVn_v+Os7fVUc2NxFg@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=darrick.wong@oracle.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mawilcox@microsoft.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.