linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>
Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
Date: Mon, 8 Feb 2016 14:48:40 +0100	[thread overview]
Message-ID: <20160208134840.GC9451@quack.suse.cz> (raw)
In-Reply-To: <20160207064349.GA25849@linux.intel.com>

On Sat 06-02-16 23:43:49, Ross Zwisler wrote:
> On Sun, Feb 07, 2016 at 10:40:53AM +1100, Dave Chinner wrote:
> > On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote:
> > > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote:
> > > > I think changes aren't very intrusive so we can feed them in during RC
> > > > phase and frankly, you have to move to using ->writepages() anyway to make
> > > > sync(2) work reliably.
> > > 
> > > I've been looking into this a bit more, and I don't think we actually want to
> > > have DAX flushing in response to sync(2) or syncfs(2).  Here's some text from
> > > the BUGS section of the sync(2) man page:
> > > 
> > > BUGS
> > > 	According to the standard specification (e.g., POSIX.1-2001), sync()
> > > 	schedules the writes,  but  may  return before  the  actual  writing
> > > 	is done.  However, since version 1.3.20 Linux does actually wait.
> > > 	(This still does not guarantee data integrity: modern disks have large
> > > 	caches.) 
> > > 
> > > Based on this I don't believe that it is a requirement that sync and syncfs
> > > actually flush the data durably to media before they return - they just need
> > > to make sure it has been sent to the device, which is always true for all
> > > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range().
> > 
> > That's an assumption we've already pointed out as being platform
> > dependent, not to mention also being undesirable from a performance
> > point of view (writes are 10x slower into pmem than into the page
> > cache using the same physical memory!).
> > 
> > Further, the ordering constraints of modern filesystems mean that
> > they guarantee durability of data written back when sync() is run.
> > i.e.  ext4, XFS, btrfs, etc all ensure that sync guarantees data
> > integrity is maintained across all the data and metadata written
> > back during sync().
> > 
> > e.g. for XFS we do file size update transactions at IO completion.
> > sync() triggers writeback of data, then runs a transaction that
> > modifies the file size so that the data is valid on disk. We
> > absolutely need to ensure that this transaction is durable before
> > sync() returns, otherwise we lose that data if a failure occurs
> > immediately after sync() returns because the size update is not on
> > disk.
> > 
> > Users are right to complain when data written before a sync() call
> > is made does not accessible after a crash/reboot because we failed
> > to make it durable. That's why ->sync_fs(wait) is called at the end
> > of the sync() implementation - it enables filesystems to ensure all
> > data and metadata written during the sync processing is on durable
> > storage.
> > 
> > IOWs, we can't language-lawyer or weasel-word our way out of
> > providing durability guarantees for sync().
> 
> To be clear, we are only talking about user data that was mmaped.  All I/Os
> initiated by the filesystem for metadata, and all user issued I/Os will be
> durable on media before the I/O is completed.  Nothing we do or don't do for
> fsync, msync, sync or syncfs() will break filesystem metadata guarantees.

Yes, but be careful here:

So far ext4 (and AFAIK XFS as well) have depended on the fact that
blkdev_issue_flush() (or WRITE_FLUSH request) will flush all the volatile
caches for the storage. We do such calls / writes from transaction commit
code to make sure that all metadata *and data* writes reach permanent
storage before we make some metadata changes visible. This is necessary so
that we don't expose uninitialized block contents after a power failure
(think of making i_size increase / block allocation visible after power
failure without data being durable). 

For PMEM, we ignore blkdev_issue_flush() / WRITE_FLUSH requests on the
grounds that writes are either done bypassing caches (the case for metadata
IO and IO done via dax_do_io()). Currently we are fine even for mmap
because both ext4 & XFS zero out allocated blocks using non-cached writes
so even though latest data needn't be on persistent storage we won't expose
stale data. But it is a catch that may hit us in future. So from this POV
flushing caches from ->writepages() would also make PMEM give more similar
guarantees to fs as ordinary block storage.

> I think the question then is: "Do users currently rely on data written to an
> mmap to be durable on media after a sync() or syncfs(), in spite of the BUGS
> section quoted above?"
> 
> If the answer for this is "yes", then I agree that we need to enhance the
> current DAX fsync code to cover the sync use case.

IMO the answer is yes.

> However, as stated previously I don't think that it is as easy as just moving
> the call to dax_writeback_mapping_range() to the sync() call path.  On my
> system sync() is called every 5 seconds, and flushing every page of every DAX
> mmap that has ever been dirtied every 5 seconds is unreasonable.

It is not sync() that is called but background writeback. But you are
correct that ->writepages() will get called for a dirty inode every 5
seconds.

> If we need to support the sync() use case with our DAX mmap flushing code, I
> think the only way to do that is to clear out radix entries as we flush them,
> so that the sync calls that happen every 5 seconds are only flushing new
> writes.

I agree and I forgot about the issue that we currently don't clear out
dirty radix tree entries for DAX. I also agree that clearing of written
radix tree entries needs to be implemented before flushing caches from
->writepages() is practical.

> I think we're actually pretty far from this, at least for v4.5.  The issue is
> that I don't think we can safely clear radix tree dirty entries during the DAX
> code that is called via ->writepages().  To do this correctly we would need to
> also mark the PTE as clean so that when the userspace process next writes to
> their mmap mapping we would get a new fault to make the page writable. This
> would allow us to re-dirty the DAX entry in the radix tree.

Yup, so I agree that clearing of radix tree entries probably is not a 4.5
material.

> I implemented code to do this in v2 of my set, but ripped it out in v3:
> 
> https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2)
> https://lkml.org/lkml/2015/12/8/583  (DAX fsync v3)
> 
> The race that compelled this removal is described here:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html
> 
> If this is really where we need to be, that's fine, we'll have to figure out
> some way to defeat the race and get this code into v4.6.

I think we need to find a way to safely clear radix tree entries anyway.
And I agree that without page lock it is not easy. I see two possibilities
here:

1) Protect dax_writeback_mapping_range() with exclusive
EXT4_I(inode)->i_mmap_sem and provide new callback called from
wp_page_reuse() to mark index dirty in the radix tree (that would take
shared i_mmap_sem). This is relatively expensive but it should work.

Or

2) We could devote one bit in the radix tree exceptional entry as a bitlock
equivalent to page lock. This will likely scale better but it is more
intrusive.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2016-02-08 13:48 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-28 19:35 [PATCH 1/2] block: fix pfn_mkwrite() DAX fault handler Ross Zwisler
2016-01-28 19:35 ` [PATCH 2/2] dax: fix bdev NULL pointer dereferences Ross Zwisler
2016-01-28 20:21   ` Dan Williams
2016-01-28 21:38   ` Christoph Hellwig
2016-01-29 18:28     ` Ross Zwisler
2016-01-29 23:34       ` Ross Zwisler
2016-01-30  0:18         ` Dan Williams
2016-01-31 22:44         ` Dave Chinner
2016-01-30  5:28       ` Matthew Wilcox
2016-01-30  6:01         ` Dan Williams
2016-01-30  7:08           ` Jared Hulbert
2016-01-31  2:32           ` Matthew Wilcox
2016-01-31  6:12             ` Ross Zwisler
2016-01-31 10:55               ` Matthew Wilcox
2016-01-31 16:38                 ` Dan Williams
2016-01-31 18:07                   ` Matthew Wilcox
2016-01-31 18:18                     ` Dan Williams
2016-01-31 18:27                       ` Matthew Wilcox
2016-01-31 18:50                         ` Dan Williams
2016-01-31 19:51                     ` Dan Williams
2016-02-01 13:44             ` Matthew Wilcox
2016-02-01 14:51         ` Jan Kara
2016-02-01 20:49           ` Matthew Wilcox
2016-02-01 21:47           ` Dave Chinner
2016-02-02  6:06             ` Jared Hulbert
2016-02-02  6:46               ` Dan Williams
2016-02-02  8:05                 ` Jared Hulbert
2016-02-02 16:51                   ` Dan Williams
2016-02-02 21:46                     ` Jared Hulbert
2016-02-03  0:34                       ` Matthew Wilcox
2016-02-03  1:21                         ` Jared Hulbert
2016-02-02 11:17             ` Jan Kara
2016-02-02 16:33               ` Dan Williams
2016-02-02 16:46                 ` Jan Kara
2016-02-02 17:10                   ` Dan Williams
2016-02-02 17:34                     ` Ross Zwisler
2016-02-02 17:46                       ` Dan Williams
2016-02-02 17:47                         ` Dan Williams
2016-02-02 18:24                           ` Ross Zwisler
2016-02-02 18:46                         ` Matthew Wilcox
2016-02-02 18:59                           ` Dan Williams
2016-02-02 20:14                             ` Matthew Wilcox
2016-02-03 11:09                           ` Jan Kara
2016-02-03 10:46                       ` Jan Kara
2016-02-03 20:13                         ` Ross Zwisler
2016-02-04  9:15                           ` Jan Kara
2016-02-04 23:38                             ` Ross Zwisler
2016-02-06 23:15                             ` Dave Chinner
2016-02-07  5:27                               ` Ross Zwisler
2016-02-04 19:56                         ` Ross Zwisler
2016-02-04 20:29                           ` Jan Kara
2016-02-04 22:19                             ` Ross Zwisler
2016-02-05 22:25                             ` Ross Zwisler
2016-02-06 23:40                               ` Dave Chinner
2016-02-07  6:43                                 ` Ross Zwisler
2016-02-08 13:48                                   ` Jan Kara [this message]
2016-02-07  8:38                               ` Christoph Hellwig
2016-02-08 15:55                                 ` Ross Zwisler
2016-02-02 18:41               ` Ross Zwisler
2016-02-02 18:53                 ` Ross Zwisler
2016-02-02  0:02     ` Ross Zwisler
2016-02-02  7:10       ` Dave Chinner
2016-02-02 10:34       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160208134840.GC9451@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).