All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@infradead.org>,
	Theodore Ts'o <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Lukas Czerner <lczerner@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH] ext4: introduce per-inode DAX flag
Date: Wed, 30 Aug 2017 12:00:59 +0200	[thread overview]
Message-ID: <20170830100059.GC28354@quack2.suse.cz> (raw)
In-Reply-To: <20170829225717.GG10621@dastard>

On Wed 30-08-17 08:57:17, Dave Chinner wrote:
> On Tue, Aug 29, 2017 at 05:49:22PM +0200, Jan Kara wrote:
> > On Mon 28-08-17 20:10:14, Dave Chinner wrote:
> > > On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote:
> > > > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote:
> > > > > > Nah, -o dax works very well.  It's just the flag instead of the -o dax
> > > > > > option or rather switching it on a mapped file will probably be very dangerous.
> > > > > 
> > > > > In what way is it dangerous, Christoph?
> > > > 
> > > > When I run the following script as a normal user:
> > > > 
> > > > FSXDIR=~/xfstests/ltp/
> > > > FILE=/mnt/foo
> > > > 
> > > > ${FSXDIR}/fsx $FILE &
> > > > 
> > > > while true; do
> > > >     xfs_io -c 'chattr +x' $FILE
> > > >     xfs_io -c 'chattr -x' $FILE
> > > > done
> > > >
> > > > I get this nice little crash:
> > > 
> > > Can you please package that up into an xfstest?
> > > 
> > > > root@testvm:~# sh test.sh
> > > > skipping zero size read
> > > > skipping insert range behind EOF
> > > > truncating to largest ever: 0x3a290
> > > > zero_range to largest ever: 0x3a8d1
> > > > zero_range to largest ever: 0x3fe3e
> > > > zero_range to largest ever: 0x40000
> > > > [  344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > > > [  344.899306] IP: iomap_page_mkwrite+0x17/0xf0
> > > > [  344.899795] PGD 7db37067
> > > > [  344.899796] P4D 7db37067
> > > > [  344.900099] PUD 78c61067
> > > > [  344.900389] PMD 0
> > > > [  344.900665]
> > > > [  344.901075] Oops: 0000 [#1] SMP
> > > > [  344.901536] Modules linked in:
> > > > [  344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199
> > > > [  344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > > > [  344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000
> > > > [  344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0
> > > > [  344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246
> > > > [  344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001
> > > > [  344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0
> > > > [  344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000
> > > > [  344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010
> > > > [  344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580
> > >                       ^^^^^^^^^^^^^^^^
> > > 
> > > vmf->page is null.
> > > 
> > > Which means IS_DAX changed half way through a fault, despite us
> > > holding the MMAPLOCK and protecting all the filesystem side of the
> > > fault code from races.
> > > 
> > > Seems to me that even allowing filesystems to switch between
> > > different mapping tree behaviours based on an inode flag is a
> > > fundamentally broken model. The fault action that needs to taken by
> > > the filesystem has already been predetermined by the fault
> > > processing that has already occurred and placed into the contents of
> > > the vmf we've been passed.
> > 
> > I don't think the problem is actually within MM in this particular case.
> > The problem seems to be that xfs_filemap_fault() checks IS_DAX without
> > holding MMAPLOCK and so it can change after that test and before the test
> > in xfs_filemap_page_mkwrite().
> 
> First thing I did was fix that, and it still paniced with vmf->page
> = null in iomap_page_mkwrite, then I realised it was irrelevant

OK, drat.

> because if it can change between xfs_filemap_fault() and
> xfs_filemap_page_mkwrite() it can change anytime in the fault path
> we are not holding the MMAPLOCK....

Agreed but...

> > > Hence I think that if we need to process the fault as a DAX fault
> > > then the vmf needs to tell us that, not require us to look up an
> > > inode flag to determine what to do. ANd if the inode flag changes,
> > > then that needs to be propagated through the mapping and VMAs in a
> > > sane fashion, not just run an invalidation from the filesystem. I
> > > don't know enough about the VM code to say anything useful about how
> > > this needs to be set up, but it's clear that mapping invalidation
> > > and behaviour swaps can't be completely serialised against page
> > > faults from the filesystem side.
> > 
> > But there is no difference in vmf setup from generic MM side. In particular
> > vmf->page is set by the ->fault handler and then it is passed to
> > ->page_mkwrite handler.
> 
> Yes there is. DAX can call the .fault() handler directly from
> handle_pte_fault() for write faults on DAX when there is no pte
> installed. In which case vmf->page is null because we don't go
> through do_wp_page() to install the page we faulted on in the vmf
> before calling .page_mkwrite().
> 
> IOWs, if we fault between the invalidation in the IS_DAX changing
> code and dropping the MMAPLOCK once the transaction inode change is
> complete, then that fault will find an empty pte. It will then call
> ->filemap_fault with vmf->page = NULL, and block on the MMAPLOCK
> until the filesystem transaction is complete. Then it checks
> IS_DAX, finds it's not set, so assumes that we have a page cache
> fault, not a DAX fault....

So if you get to xfs_filemap_fault(), lock MMAPLOCK, then check IS_DAX() -
not set - then you go to filemap_fault() which is perfectly fine with
vmf->page being NULL. What am I missing?

The oops can happen if you call iomap_page_mkwrite() with vmf->page set
to NULL. However I don't see how that is possible after all IS_DAX checks
are under MMAPLOCK - iomap_page_mkwrite() cannot be called through
xfs_filemap_fault() anymore once DAX checks are consistent. And if it gets
called through xfs_filemap_page_mkwrite(), it means that
xfs_filemap_fault() was called before and must have returned
VM_FAULT_LOCKED which means that it has set vmf->page to a correct page.
[sorry, the page lock and entry lock I was speaking about in my previous
email was indeed wrong]

I've tried to reproduce this on my machine but for some reason setting of
per-inode DAX flag does not work. I have (with today's checkout of
xfsprogs):

marvin5:~/:[0]# cat /proc/mounts
...
/dev/ram0 /mnt xfs rw,relatime,attr2,dax,inode64,noquota 0 0

marvin5:~/:[0]# ls /mnt/
fsxfile  fsxfile.fsxgood  fsxfile.fsxlog

marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile
-p-------------- /mnt/fsxfile 

marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'chattr +x' /mnt/fsxfile

marvin5:~/:[0]# ./xfsprogs-dev/io/xfs_io -c 'lsattr' /mnt/fsxfile
-p-------------- /mnt/fsxfile 

No DAX flag set and no error... What am I doing wrong?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-08-30 10:01 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-02 16:09 [PATCH] ext4: introduce per-inode DAX flag Lukas Czerner
2017-08-05  8:46 ` Christoph Hellwig
2017-08-07 12:12   ` Lukas Czerner
2017-08-08  9:00     ` Lukas Czerner
2017-08-11 10:01       ` Christoph Hellwig
2017-08-11 12:11         ` Lukas Czerner
2017-08-11 12:58           ` Christoph Hellwig
2017-08-11 13:41             ` Lukas Czerner
2017-08-24 18:20               ` Theodore Ts'o
2017-08-25  7:54                 ` Christoph Hellwig
2017-08-25 15:14                   ` Theodore Ts'o
2017-08-25 15:40                     ` Christoph Hellwig
2017-08-25 16:28                       ` Theodore Ts'o
2017-08-25 23:33                       ` Dave Chinner
2017-08-28  7:38                         ` Christoph Hellwig
2017-08-28 10:10                           ` Dave Chinner
2017-08-29 15:49                             ` Jan Kara
2017-08-29 22:57                               ` Dave Chinner
2017-08-30 10:00                                 ` Jan Kara [this message]
2017-08-30 12:34                                   ` Christoph Hellwig
2017-08-30 15:00                                     ` Theodore Ts'o
2017-08-30 15:30                                       ` Lukas Czerner
2017-08-30 15:29                                     ` Jan Kara
2017-08-30 16:05                                   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170830100059.GC28354@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.