From: Dan Williams <dan.j.williams@intel.com> To: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org, "Darrick J. Wong" <darrick.wong@oracle.com>, Christoph Hellwig <hch@lst.de>, linux-xfs <linux-xfs@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Andrew Morton <akpm@linux-foundation.org> Subject: Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Date: Tue, 2 Jan 2018 18:21:13 -0800 [thread overview] Message-ID: <CAPcyv4gXxoLwZUyHYGUN2i71fjT_r=Mf+USYDLKwPGEYco+1PQ@mail.gmail.com> (raw) In-Reply-To: <20180102230027.GV5858@dastard> On Tue, Jan 2, 2018 at 3:00 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote: >> xfs_break_layouts() scans for active pNFS layouts, drops locks and >> rescans for those layouts to be broken. xfs_sync_dma performs >> xfs_break_layouts and also scans for active dax-dma pages, drops locks >> and rescans for those pages to go idle. >> >> dax_flush_dma handles synchronizing against new page-busy events >> (get_user_pages). iIt invalidates all mappings to trigger the >> get_user_pages slow path which will eventually block on the >> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle >> callback that will fire when the page reference count reaches 1 (recall >> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops >> locks so we do not deadlock the process that might be trying to elevate >> the page count of more pages before arranging for any of them to go idle >> as is typically the case of iov_iter_get_pages. >> >> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f >> (xfs_wait_dax_page) to handle evaluating the page reference count and >> dropping locks when waiting. > > I don't see a problem with supporting this functionality, but I > see lots of problems with the code being presented. First of all, > I think the "sync dma" abstraction here is all wrong. > > In the case of the filesystem, we don't care about whether DMA has > completed or not, and we *shouldn't have to care* about deep, dark > secrets of other subsystems. > > If I read the current code, I see this in all the "truncate" paths: > > start op > break layout leases > change layout > > and in the IO path: > > start IO > break layout leases > map IO > issue IO > > What this change does is make the truncate paths read: > > start op > sync DMA > change layout > > but the IO path is unchanged. (This is not explained in comments or > commit messages). > > And I look at that "sync DMA" step and wonder why the hell we need > to "sync DMA" because DMA has nothing to do with high level > filesystem code. It doesn't tell me anything obvious about why we > need to do this, nor does it tell me what we're actually > synchronising against. > > What we care about in the filesystem code is whether there are > existing external references to file layout. If there's an external > reference, then it has to be broken before we can proceed and modify > the file layout. We don't care what owns that reference, just that > it has to broken before we continue. > > AFAIC, these DMA references are just another external layout > reference that needs to be broken. IOWs, this "sync DMA" complexity > needs to go inside xfs_break_layouts() as it is part of breaking the > external reference to the file layout - it does not replace the > layout breaking abstraction and so the implementation needs to > reflect that. These two sentences from the xfs_break_layouts() comment scared me down this path of distinguishing dax-dma waiting from pNFS layout lease break waiting: --- "Additionally we call it during the write operation, where aren't concerned about exposing unallocated blocks but just want to provide basic synchronization between a local writer and pNFS clients. mmap writes would also benefit from this sort of synchronization, but due to the tricky locking rules in the page fault path we don't bother." --- I was not sure about holding XFS_MMAPLOCK_EXCL over xfs_break_layouts() where it has historically not been held, and I was worried about the potential deadlock of requiring all pages to be unmapped and idle during a write. I.e. would we immediately deadlock if userspace performed direct-I/O to a file with a source buffer that was mapped from that same file? In general though, I agree that xfs_break_layouts() should comprehend both cases. I'll investigate if the deadlock is real and perhaps add a flag to xfs_break_layouts to distinguish the IO path from the truncate paths to at least make that detail internal to the layout breaking mechanism. >> + * Synchronize [R]DMA before changing the file's block map. For pNFS, >> + * recall all layouts. For DAX, wait for transient DMA to complete. All >> + * other DMA is handled by pinning page cache pages. >> + * >> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and >> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit. >> + */ >> +int xfs_sync_dma( >> + struct inode *inode, >> + uint *iolock) >> +{ >> + struct xfs_inode *ip = XFS_I(inode); >> + int error; >> + >> + while (true) { >> + error = xfs_break_layouts(inode, iolock); >> + if (error) >> + break; >> + >> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); >> + *iolock |= XFS_MMAPLOCK_EXCL; >> + >> + error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page); >> + if (error <= 0) >> + break; >> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); >> + *iolock &= ~XFS_MMAPLOCK_EXCL; >> + } > > At this level the loop seems sub-optimal. If we don't drop the > IOLOCK, then we have no reason to call xfs_break_layouts() a second > time. Hence in isolation this loop doesn' make sense. Yes, I > realise that dax_flush_dma() can result in all locks on the inode > being dropped, but that's hidden in another function whose calling > scope is not at all obvious from this code. > > Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it > is called. Nothing enforces the requirement that xfs_sync_dma() is > passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made. > Even if it was, I really dislike the idea of a function that > /assumes/ lock state - that's a landmine that will bite us in the > rear end at some unexpected point in the future. If you need to > cycle held locks on an inode, you need to pass the held lock state > to the function. I agree, and I thought about this, but at the time the callback is made the only way we could pass the lock context to xfs_wait_dax_page() would be to temporarily store it in the 'struct page' which seemed ugly at first glance. However, we do have space, we could alias it with the other unused 8-bytes of page->lru. At least that cleans up the filesystem interface to not need to make implicit locking assumptions... I'll go that route. > Another gripe I have is that calling xfs_sync_dma() implies the mmap > lock is held exclusively on return. Hiding this locking inside > xfs_sync_dma removes the code documentation that large tracts of > code are protected against page faults by the XFS_MMAPLOCK_EXCL lock > call. Instead of knowing at a glance that the truncate path > (xfs_vn_setattr) is protected against page faults, I have to > remember that xfs_sync_dma() now does this. > > This goes back to my initial comments of "what the hell does "sync > dma" mean in a filesystem context?" - it certainly doesn't make me > think about inode locking. I don't like hidden/implicitly locking > like this, because it breaks both the least-surprise and the > self-documenting code principles.... Thanks for this Dave, it's clear. I'll a spin a new version with some of the reworks proposed above, but holler if those don't address your core concern. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.cz>, "Darrick J. Wong" <darrick.wong@oracle.com>, linux-nvdimm@lists.01.org, linux-xfs <linux-xfs@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Christoph Hellwig <hch@lst.de> Subject: Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Date: Tue, 2 Jan 2018 18:21:13 -0800 [thread overview] Message-ID: <CAPcyv4gXxoLwZUyHYGUN2i71fjT_r=Mf+USYDLKwPGEYco+1PQ@mail.gmail.com> (raw) In-Reply-To: <20180102230027.GV5858@dastard> On Tue, Jan 2, 2018 at 3:00 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote: >> xfs_break_layouts() scans for active pNFS layouts, drops locks and >> rescans for those layouts to be broken. xfs_sync_dma performs >> xfs_break_layouts and also scans for active dax-dma pages, drops locks >> and rescans for those pages to go idle. >> >> dax_flush_dma handles synchronizing against new page-busy events >> (get_user_pages). iIt invalidates all mappings to trigger the >> get_user_pages slow path which will eventually block on the >> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle >> callback that will fire when the page reference count reaches 1 (recall >> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops >> locks so we do not deadlock the process that might be trying to elevate >> the page count of more pages before arranging for any of them to go idle >> as is typically the case of iov_iter_get_pages. >> >> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f >> (xfs_wait_dax_page) to handle evaluating the page reference count and >> dropping locks when waiting. > > I don't see a problem with supporting this functionality, but I > see lots of problems with the code being presented. First of all, > I think the "sync dma" abstraction here is all wrong. > > In the case of the filesystem, we don't care about whether DMA has > completed or not, and we *shouldn't have to care* about deep, dark > secrets of other subsystems. > > If I read the current code, I see this in all the "truncate" paths: > > start op > break layout leases > change layout > > and in the IO path: > > start IO > break layout leases > map IO > issue IO > > What this change does is make the truncate paths read: > > start op > sync DMA > change layout > > but the IO path is unchanged. (This is not explained in comments or > commit messages). > > And I look at that "sync DMA" step and wonder why the hell we need > to "sync DMA" because DMA has nothing to do with high level > filesystem code. It doesn't tell me anything obvious about why we > need to do this, nor does it tell me what we're actually > synchronising against. > > What we care about in the filesystem code is whether there are > existing external references to file layout. If there's an external > reference, then it has to be broken before we can proceed and modify > the file layout. We don't care what owns that reference, just that > it has to broken before we continue. > > AFAIC, these DMA references are just another external layout > reference that needs to be broken. IOWs, this "sync DMA" complexity > needs to go inside xfs_break_layouts() as it is part of breaking the > external reference to the file layout - it does not replace the > layout breaking abstraction and so the implementation needs to > reflect that. These two sentences from the xfs_break_layouts() comment scared me down this path of distinguishing dax-dma waiting from pNFS layout lease break waiting: --- "Additionally we call it during the write operation, where aren't concerned about exposing unallocated blocks but just want to provide basic synchronization between a local writer and pNFS clients. mmap writes would also benefit from this sort of synchronization, but due to the tricky locking rules in the page fault path we don't bother." --- I was not sure about holding XFS_MMAPLOCK_EXCL over xfs_break_layouts() where it has historically not been held, and I was worried about the potential deadlock of requiring all pages to be unmapped and idle during a write. I.e. would we immediately deadlock if userspace performed direct-I/O to a file with a source buffer that was mapped from that same file? In general though, I agree that xfs_break_layouts() should comprehend both cases. I'll investigate if the deadlock is real and perhaps add a flag to xfs_break_layouts to distinguish the IO path from the truncate paths to at least make that detail internal to the layout breaking mechanism. >> + * Synchronize [R]DMA before changing the file's block map. For pNFS, >> + * recall all layouts. For DAX, wait for transient DMA to complete. All >> + * other DMA is handled by pinning page cache pages. >> + * >> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and >> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit. >> + */ >> +int xfs_sync_dma( >> + struct inode *inode, >> + uint *iolock) >> +{ >> + struct xfs_inode *ip = XFS_I(inode); >> + int error; >> + >> + while (true) { >> + error = xfs_break_layouts(inode, iolock); >> + if (error) >> + break; >> + >> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); >> + *iolock |= XFS_MMAPLOCK_EXCL; >> + >> + error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page); >> + if (error <= 0) >> + break; >> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); >> + *iolock &= ~XFS_MMAPLOCK_EXCL; >> + } > > At this level the loop seems sub-optimal. If we don't drop the > IOLOCK, then we have no reason to call xfs_break_layouts() a second > time. Hence in isolation this loop doesn' make sense. Yes, I > realise that dax_flush_dma() can result in all locks on the inode > being dropped, but that's hidden in another function whose calling > scope is not at all obvious from this code. > > Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it > is called. Nothing enforces the requirement that xfs_sync_dma() is > passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made. > Even if it was, I really dislike the idea of a function that > /assumes/ lock state - that's a landmine that will bite us in the > rear end at some unexpected point in the future. If you need to > cycle held locks on an inode, you need to pass the held lock state > to the function. I agree, and I thought about this, but at the time the callback is made the only way we could pass the lock context to xfs_wait_dax_page() would be to temporarily store it in the 'struct page' which seemed ugly at first glance. However, we do have space, we could alias it with the other unused 8-bytes of page->lru. At least that cleans up the filesystem interface to not need to make implicit locking assumptions... I'll go that route. > Another gripe I have is that calling xfs_sync_dma() implies the mmap > lock is held exclusively on return. Hiding this locking inside > xfs_sync_dma removes the code documentation that large tracts of > code are protected against page faults by the XFS_MMAPLOCK_EXCL lock > call. Instead of knowing at a glance that the truncate path > (xfs_vn_setattr) is protected against page faults, I have to > remember that xfs_sync_dma() now does this. > > This goes back to my initial comments of "what the hell does "sync > dma" mean in a filesystem context?" - it certainly doesn't make me > think about inode locking. I don't like hidden/implicitly locking > like this, because it breaks both the least-surprise and the > self-documenting code principles.... Thanks for this Dave, it's clear. I'll a spin a new version with some of the reworks proposed above, but holler if those don't address your core concern.
next prev parent reply other threads:[~2018-01-03 2:16 UTC|newest] Thread overview: 136+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-12-24 0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-24 0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-04 8:16 ` Christoph Hellwig 2018-01-04 8:16 ` Christoph Hellwig 2017-12-24 0:56 ` [PATCH v4 02/18] ext4: auto disable dax instead of failing mount Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-03 14:20 ` Jan Kara 2018-01-03 14:20 ` Jan Kara 2017-12-24 0:56 ` [PATCH v4 03/18] ext2: " Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-03 14:21 ` Jan Kara 2018-01-03 14:21 ` Jan Kara 2017-12-24 0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-03 15:29 ` Jan Kara 2018-01-03 15:29 ` Jan Kara 2018-01-04 8:16 ` Christoph Hellwig 2018-01-04 8:16 ` Christoph Hellwig 2018-01-08 11:58 ` Gerald Schaefer 2018-01-08 11:58 ` Gerald Schaefer 2017-12-24 0:56 ` [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-03 15:27 ` Jan Kara 2018-01-03 15:27 ` Jan Kara 2017-12-24 0:56 ` [PATCH v4 06/18] dax: stop using VM_HUGEPAGE " Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-24 0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-27 0:17 ` Ross Zwisler 2017-12-27 0:17 ` Ross Zwisler 2018-01-02 20:15 ` Dan Williams 2018-01-02 20:15 ` Dan Williams 2018-01-03 15:39 ` Jan Kara 2018-01-03 15:39 ` Jan Kara 2017-12-24 0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-27 18:08 ` Ross Zwisler 2017-12-27 18:08 ` Ross Zwisler 2018-01-02 20:35 ` Dan Williams 2018-01-02 20:35 ` Dan Williams 2018-01-02 21:44 ` Dave Chinner 2018-01-02 21:44 ` Dave Chinner 2018-01-02 21:51 ` Dan Williams 2018-01-02 21:51 ` Dan Williams 2018-01-03 15:46 ` Jan Kara 2018-01-03 15:46 ` Jan Kara 2018-01-03 20:37 ` Jeff Moyer 2018-01-03 20:37 ` Jeff Moyer 2017-12-24 0:56 ` [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-04 8:20 ` Christoph Hellwig 2018-01-04 8:20 ` Christoph Hellwig 2017-12-24 0:56 ` [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams 2017-12-24 0:56 ` Dan Williams 2018-01-04 8:25 ` Christoph Hellwig 2018-01-04 8:25 ` Christoph Hellwig 2017-12-24 0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams 2017-12-24 0:56 ` Dan Williams 2017-12-27 5:29 ` Matthew Wilcox 2017-12-27 5:29 ` Matthew Wilcox 2018-01-02 20:21 ` Dan Williams 2018-01-02 20:21 ` Dan Williams 2018-01-03 16:05 ` Jan Kara 2018-01-03 16:05 ` Jan Kara 2018-01-04 8:27 ` Christoph Hellwig 2018-01-04 8:27 ` Christoph Hellwig 2018-01-02 21:41 ` Dave Chinner 2018-01-02 21:41 ` Dave Chinner 2017-12-24 0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-02 21:15 ` Darrick J. Wong 2018-01-02 21:15 ` Darrick J. Wong 2018-01-02 21:40 ` Dan Williams 2018-01-02 21:40 ` Dan Williams 2018-01-03 16:09 ` Jan Kara 2018-01-03 16:09 ` Jan Kara 2018-01-04 8:28 ` Christoph Hellwig 2018-01-04 8:28 ` Christoph Hellwig 2017-12-24 0:57 ` [PATCH v4 13/18] ext4: " Dan Williams 2017-12-24 0:57 ` Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-04 8:29 ` Christoph Hellwig 2018-01-04 8:29 ` Christoph Hellwig 2018-01-04 8:29 ` Christoph Hellwig 2017-12-24 0:57 ` [PATCH v4 14/18] ext2: " Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-04 8:29 ` Christoph Hellwig 2018-01-04 8:29 ` Christoph Hellwig 2017-12-24 0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-04 8:30 ` Christoph Hellwig 2018-01-04 8:30 ` Christoph Hellwig 2018-01-04 9:39 ` Jan Kara 2018-01-04 9:39 ` Jan Kara 2017-12-24 0:57 ` [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-04 8:30 ` Christoph Hellwig 2018-01-04 8:30 ` Christoph Hellwig 2017-12-24 0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-04 8:31 ` Christoph Hellwig 2018-01-04 8:31 ` Christoph Hellwig 2018-01-04 11:12 ` Jan Kara 2018-01-04 11:12 ` Jan Kara 2018-01-07 21:58 ` Dan Williams 2018-01-07 21:58 ` Dan Williams 2018-01-08 13:50 ` Jan Kara 2018-01-08 13:50 ` Jan Kara 2018-03-08 17:02 ` Dan Williams 2018-03-08 17:02 ` Dan Williams 2018-03-09 12:56 ` Jan Kara 2018-03-09 12:56 ` Jan Kara 2018-03-09 16:15 ` Dan Williams 2018-03-09 16:15 ` Dan Williams 2018-03-09 17:26 ` Dan Williams 2018-03-09 17:26 ` Dan Williams 2017-12-24 0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams 2017-12-24 0:57 ` Dan Williams 2018-01-02 21:07 ` Darrick J. Wong 2018-01-02 21:07 ` Darrick J. Wong 2018-01-02 23:00 ` Dave Chinner 2018-01-02 23:00 ` Dave Chinner 2018-01-03 2:21 ` Dan Williams [this message] 2018-01-03 2:21 ` Dan Williams 2018-01-03 7:51 ` Dave Chinner 2018-01-03 7:51 ` Dave Chinner 2018-01-04 8:34 ` Christoph Hellwig 2018-01-04 8:34 ` Christoph Hellwig 2018-01-04 8:33 ` Christoph Hellwig 2018-01-04 8:33 ` Christoph Hellwig 2018-01-04 8:17 ` [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Christoph Hellwig 2018-01-04 8:17 ` Christoph Hellwig 2018-01-04 8:17 ` Christoph Hellwig
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAPcyv4gXxoLwZUyHYGUN2i71fjT_r=Mf+USYDLKwPGEYco+1PQ@mail.gmail.com' \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=darrick.wong@oracle.com \ --cc=david@fromorbit.com \ --cc=hch@lst.de \ --cc=jack@suse.cz \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-nvdimm@lists.01.org \ --cc=linux-xfs@vger.kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.