linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-ext4@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-xfs@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes
Date: Wed, 6 Sep 2017 11:09:46 -0600	[thread overview]
Message-ID: <20170906170946.GC17663@linux.intel.com> (raw)
In-Reply-To: <20170906094700.GC27916@quack2.suse.cz>

On Wed, Sep 06, 2017 at 11:47:00AM +0200, Jan Kara wrote:
> On Tue 05-09-17 16:35:38, Ross Zwisler wrote:
> > The IOCTL path which switches the journaling mode for an inode is currently
> > unsafe because it doesn't properly do a writeback and invalidation on the
> > inode.  In XFS, for example, safe transitions of S_DAX are handled by
> > xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,
> > does a writeback via filemap_write_and_wait() and an invalidation via
> > invalidate_inode_pages2().
> > 
> > Without this in place we can see the following kernel warning when we try
> > and insert a DAX exceptional entry but find that a dirty page cache page is
> > still in the mapping->radix_tree:
> > 
> >  WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550
> >  Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm
> >  CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> >  task: ffff88020ccd0000 task.stack: ffffc900021d4000
> >  RIP: 0010:__delete_from_page_cache+0x375/0x550
> >  RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002
> >  RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68
> >  RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5
> >  RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70
> >  R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70
> >  R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68
> >  FS:  00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0
> >  Call Trace:
> >   dax_insert_mapping_entry+0x158/0x2c0
> >   dax_iomap_fault+0x1020/0x1bb0
> >   ext4_dax_huge_fault+0xc8/0x160
> >   ext4_dax_fault+0x10/0x20
> >   __do_fault+0x20/0x110
> >   __handle_mm_fault+0x97d/0x1120
> >   handle_mm_fault+0x188/0x2f0
> >   __do_page_fault+0x28f/0x590
> >   trace_do_page_fault+0x58/0x2c0
> >   do_async_page_fault+0x2c/0x90
> >   async_page_fault+0x28/0x30
> > 
> > I'm pretty sure we could make a test that shows userspace visible data
> > corruption as well in this scenario.
> > 
> > Make it safe to change the journaling mode and turn on or off S_DAX by
> > adding locking to properly lock out page faults (i_mmap_sem) and then doing
> > the writeback and invalidate.  I/O is already held off because all callers
> > of ext4_ioctl_setflags() hold the inode lock.
> 
> Yeah, this is a good point. It is just that this is not enough as I
> discovered in [1]. You also need to tear down & recreate VMAs when changing
> DAX flag which is a bit tricky. So for now I think returning EBUSY when
> file is mmaped and we'd like to flip DAX flag is the best solution. Hmm?
> 
> [1] https://www.spinics.net/lists/linux-xfs/msg09859.html

Yea, thanks for the link, I totally missed this discussion (obviously). 

Cool, I'll rework this for v2.

> > The locking for this new code is complex because of the following:
> > 
> > 1) filemap_write_and_wait() eventually calls ext4_writepages(), which
> > acquires the sbi->s_journal_flag_rwsem.  This lock ranks above the
> > jbdw_handle which is eventually taken by ext4_journal_start().  This
> > essentially means that the writeback has to happen outside of the context
> > of an active journal handle (outside of ext4_journal_start() to
> > ext4_journal_stop().)
> > 
> > 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
> > this lock again ranks above the jbd2_handle taken by ext4_journal_start().
> > So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
> > outside of the context of an active journal handle.
> 
> Welcome to the joy of fs locking ;)

:)  Well, I feel like I learned a lot more about ext4 during this patch set!

> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > CC: stable@vger.kernel.org
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

  reply	other threads:[~2017-09-06 17:09 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-05 22:35 [PATCH 0/9] add ext4 per-inode DAX flag Ross Zwisler
2017-09-05 22:35 ` [PATCH 1/9] ext4: remove duplicate extended attributes defs Ross Zwisler
2017-09-06  7:29   ` Jan Kara
2017-09-05 22:35 ` [PATCH 2/9] xfs: always use DAX if mount option is used Ross Zwisler
2017-09-05 22:35 ` [PATCH 3/9] xfs: validate bdev support for DAX inode flag Ross Zwisler
2017-09-05 22:35 ` [PATCH 4/9] ext4: add ext4_should_use_dax() Ross Zwisler
2017-09-05 22:35 ` [PATCH 5/9] ext4: ext4_change_inode_journal_flag error handling Ross Zwisler
2017-09-05 22:35 ` [PATCH 6/9] ext4: safely transition S_DAX on journaling changes Ross Zwisler
2017-09-06  9:47   ` Jan Kara
2017-09-06 17:09     ` Ross Zwisler [this message]
2017-09-05 22:35 ` [PATCH 7/9] ext4: prevent data corruption with inline data + DAX Ross Zwisler
2017-09-06 20:55   ` Andreas Dilger
2017-09-06 23:11     ` Ross Zwisler
2017-09-05 22:35 ` [PATCH 8/9] ext4: add sanity check for encryption " Ross Zwisler
2017-09-05 22:35 ` [PATCH 9/9] ext4: add per-inode DAX flag Ross Zwisler
2017-09-06  2:12 ` [PATCH 0/9] add ext4 " Eric Sandeen
2017-09-06 17:07   ` Ross Zwisler
2017-09-07 20:54     ` Dan Williams
2017-09-07 21:13       ` Ross Zwisler
2017-09-07 21:26         ` Andreas Dilger
2017-09-07 21:51           ` Ross Zwisler
2017-09-07 22:12             ` Dave Chinner
2017-09-07 22:19               ` Ross Zwisler
2017-09-07 23:25                 ` Dave Chinner
2017-09-08  9:48                   ` Jan Kara
2017-09-08 15:39                   ` Theodore Ts'o
2017-09-11  8:47                     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170906170946.GC17663@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).