From: Dave Chinner <david@fromorbit.com>
To: Christoph Hellwig <hch@lst.de>
Cc: ira.weiny@intel.com, linux-kernel@vger.kernel.org,
Alexander Viro <viro@zeniv.linux.org.uk>,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Dan Williams <dan.j.williams@intel.com>,
"Theodore Y. Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
Date: Tue, 25 Feb 2020 11:09:37 +1100 [thread overview]
Message-ID: <20200225000937.GA10776@dread.disaster.area> (raw)
In-Reply-To: <20200224175603.GE7771@lst.de>
On Mon, Feb 24, 2020 at 06:56:03PM +0100, Christoph Hellwig wrote:
> On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > > I am very much against this. There is a reason why we don't support
> > > changes of ops vectors at runtime anywhere else, because it is
> > > horribly complicated and impossible to get right. IFF we ever want
> > > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > > right way is to just add a few simple conditionals and merge the
> > > aops, which is much easier to reason about, and less costly in overall
> > > overhead.
> >
> > *cough*
> >
> > That's exactly what the original code did. And it was broken
> > because page faults call multiple aops that are dependent on the
> > result of the previous aop calls setting up the state correctly for
> > the latter calls. And when S_DAX changes between those calls, shit
> > breaks.
>
> No, the original code was broken because it didn't serialize between
> DAX and buffer access.
>
> Take a step back and look where the problems are, and they are not
> mostly with the aops. In fact the only aop useful for DAX is
> is ->writepages, and how it uses ->writepages is actually a bit of
> an abuse of that interface.
The races are all through the fops, too, which is one of the reasons
Darrick mentioned we should probably move this up to file ops
level...
> So what we really need it just a way to prevent switching the flag
> when a file is mapped,
That's not sufficient.
We also have to prevent the file from being mapped *while we are
switching*. Nothing in the mmap() path actually serialises against
filesystem operations, and the initial behavioural checks in the
page fault path are similarly unserialised against changing the
S_DAX flag.
e.g. there's a race against ->mmap() with switching the the S_DAX
flag. In xfs_file_mmap():
file_accessed(file);
vma->vm_ops = &xfs_file_vm_ops;
if (IS_DAX(inode))
vma->vm_flags |= VM_HUGEPAGE;
So if this runs while a switch from true to false is occurring,
we've got a non-dax VMA with huge pages enabled, and the non-dax
page fault path doesn't support this.
If we are really lucky, we'll have IS_DAX() set just long
enough to get through this check in xfs_filemap_huge_fault():
if (!IS_DAX(file_inode(vmf->vma->vm_file)))
return VM_FAULT_FALLBACK;
and then we'll get into __xfs_filemap_fault() and block on the
MMAPLOCK. When we actually get that lock, S_DAX will now be false
and we have a huge page fault running through a path (page cache IO)
that doesn't support huge pages...
This is the sort of landmine switching S_DAX without serialisation
causes. The methods themselves look sane, but it's cross-method
dependencies for a stable S_DAX flag that is the real problem.
Yes, we can probably fix this by adding XFS_MMAPLOCK_SHARED here,
but means every filesystem has to solve the same race conditions
itself. That's hard to get right and easy to get wrong. If the
VFS/mm subsystem provides the infrastructure needed to use this
functionality safely, then that is hard for filesystem developers to
get wrong.....
> and in the normal read/write path ensure the
> flag can't be switch while I/O is going on, which could either be
> done by ensuring it is only switched under i_rwsem or equivalent
> protection, or by setting the DAX flag once in the iocb similar to
> IOCB_DIRECT.
The iocb path is not the problem - that's entirely serialised
against S_DAX changes by the i_rwsem. The problem is that we have no
equivalent filesystem level serialisation for the entire mmap/page
fault path, and it checks S_DAX all over the place.
It's basically the same limitation that we have with mmap vs direct
IO - we can't lock out mmap when we do direct IO, so we cannot
guarantee coherency between the two. Similar here - we cannot
lockout mmap in any sane way, so we cannot guarantee coherency
between mmap and changing the S_DAX flag.
That's the underlying problem we need to solve here.
/me wonders if the best thing to do is to add a ->fault callout to
tell the filesystem to lock/unlock the inode right up at the top of
the page fault path, outside even the mmap_sem. That means all the
methods that the page fault calls are protected against S_DAX
changes, and it gives us a low cost method of serialising page
faults against DIO (e.g. via inode_dio_wait())....
> And they easiest way to get all this done is as a first step to
> just allowing switching the flag when no blocks are allocated at
> all, similar to how the rt flag works.
False equivalence - it is not similar because the RT flag changes
and their associated state checks are *already fully serialised* by
the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
that's the problem we need to solve...
In reality, I don't really care how we solve this problem, but we've
been down the path you are suggesting at least twice before and each
time we've ended up in a "that doesn't work" corner and had to
persue other solutions. I don't like going around in circles.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2020-02-25 0:09 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-21 0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
2020-02-21 0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-21 1:26 ` Dave Chinner
2020-02-27 17:52 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-21 1:34 ` Dave Chinner
2020-02-21 23:00 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-21 1:42 ` Dave Chinner
2020-02-21 23:04 ` Ira Weiny
2020-02-21 17:42 ` Christoph Hellwig
2020-02-21 0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
2020-02-21 0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-21 0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-02-22 0:28 ` Darrick J. Wong
2020-02-23 15:07 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
2020-02-21 17:44 ` Christoph Hellwig
2020-02-21 22:44 ` Dave Chinner
2020-02-21 23:26 ` Dan Williams
2020-02-24 17:56 ` Christoph Hellwig
2020-02-25 0:09 ` Dave Chinner [this message]
2020-02-25 17:36 ` Christoph Hellwig
2020-02-25 19:37 ` Jeff Moyer
2020-02-26 9:28 ` Jonathan Halliday
2020-02-26 11:31 ` Jan Kara
2020-02-26 11:56 ` Jonathan Halliday
2020-02-26 16:10 ` Ira Weiny
2020-02-26 16:46 ` Dan Williams
2020-02-26 17:20 ` Jan Kara
2020-02-26 17:54 ` Dan Williams
2020-02-25 21:03 ` Ira Weiny
2020-02-26 11:17 ` Jan Kara
2020-02-26 15:57 ` Ira Weiny
2020-02-22 0:33 ` Darrick J. Wong
2020-02-23 15:03 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-21 0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
2020-02-22 0:31 ` Darrick J. Wong
2020-02-23 15:04 ` Ira Weiny
2020-02-24 0:34 ` Dave Chinner
2020-02-24 19:57 ` Ira Weiny
2020-02-24 22:32 ` Dave Chinner
2020-02-25 21:12 ` Ira Weiny
2020-02-25 22:59 ` Dave Chinner
2020-02-26 18:02 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-21 17:45 ` Christoph Hellwig
2020-02-21 18:06 ` Ira Weiny
2020-02-21 0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-21 0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-21 0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
2020-02-27 2:43 ` Ira Weiny
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200225000937.GA10776@dread.disaster.area \
--to=david@fromorbit.com \
--cc=dan.j.williams@intel.com \
--cc=darrick.wong@oracle.com \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).