On Mon, Mar 02, 2020 at 12:26:44PM +1100, Dave Chinner wrote: > > /* > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 1784478270e1..3a7863ba51b9 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) > > * and return. Otherwise fallthrough to buffered io for > > * the rest of the read. Buffered reads will not work for > > * DAX files, so don't bother trying. > > + * > > + * IS_DAX is protected under ->read_iter lock > > */ > > if (retval < 0 || !count || iocb->ki_pos >= size || > > IS_DAX(inode)) > > This check is in the DIO path, be we can't do DIO on DAX enabled > files to begin with, so we can only get here if S_DAX is not set on > the file. > > Further, if IOCB_DIRECT is set, neither ext4 nor XFS call > generic_file_read_iter(); they run the iomap_dio_rw() path directly > instead. Only ext2 calls generic_file_read_iter() to do direct IO, > so it's the only filesystem that needs this IS_DAX() check in it. > > I think we should fix ext2 to be implemented like ext4 and XFS - > they implement the buffered IO fallback, should it be required, > themselves and never use generic_file_read_iter() for direct IO. > > That would allow us to add this to generic_file_read_iter(): > > if (WARN_ON_ONCE(IS_DAX(inode)) > return -EINVAL; > > to indicate that this should never be called directly on a DAX > capable filesystem. This places all the responsibility for managing > DAX behaviour on the filesystem, which then allows us to reason more > solidly about how the filesystem IO paths use and check the S_DAX > flag. > > i.e. changing the on-disk flag already locks out the filesystem IO > path via the i_rwsem(), and all the filesystem IO paths (buffered, > direct IO and dax) are serialised by this flag. Hence we can check > once in the filesystem path once we have the i_rwsem held and > know that S_DAX will not change until we release it. > > ..... and now I realise what I was sitting on the fence about.... > > I don't like the aops locking in call_read/write_iter() because it > is actually redundant: the filesystem should be doing the necessary > locking in the IO path via the i_rwsem to prevent S_DAX from > changing while it is doing the IO. > > IOWs, we need to restructure the locking inside the filesystem > read_iter and write_iter methods so that the i_rwsem protects the > S_DAX flag from changing dynamically. They all do: > > if (dax) > do_dax_io() > if (direct) > do_direct_io() > do_buffered_io() > > And then we take the i_rwsem inside each of those functions and do > the IO. What we actually need to do is something like this: > > inode_lock_shared() > if (dax) > do_dax_io() > if (direct) > do_direct_io() > do_buffered_io() > inode_unlock_shared() > > And remove the inode locking from inside the individual IO methods > themselves. It's a bit more complex than this because buffered > writes require exclusive locking, but this completely removes the > need for holding an aops lock over these methods. > > I've attached a couple of untested patches (I've compiled them, so > they must be good!) to demonstrate what I mean for the XFS IO path. > The read side removes a heap of duplicate code, but the write side > is .... unfortunately complex. Have to think about that more. And now with patches... Cheers, Dave. -- Dave Chinner david@fromorbit.com