Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>
Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
Date: Wed, 3 Feb 2016 13:13:28 -0700	[thread overview]
Message-ID: <20160203201328.GC29275@linux.intel.com> (raw)
In-Reply-To: <20160203104611.GF12574@quack.suse.cz>

On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote:
> On Tue 02-02-16 10:34:56, Ross Zwisler wrote:
<>
> > Oh, no, the primary issue with calling get_block() in the fsync path isn't
> > performance.  It's that we don't have any idea what get_block() function to
> > call.
> > 
> > The fault handler calls all come from the filesystem directly, so they can
> > easily give us an appropriate get_block() function pointer.  But the
> > dax_writeback_mapping_range() calls come from the generic code in
> > mm/filemap.c, and don't know what get_block() to pass in.
> > 
> > During one iteration I had the calls to dax_writeback_mapping_range()
> > happening in the filesystem fsync code so that it could pass in get_block(),
> > but Dave Chinner pointed out that this misses other paths in the filesystem
> > that need to have things flushed via a call to filemap_write_and_wait_range().
> 
> Let's clear this up a bit: The problem with using ->fsync() method is that
> it doesn't get called for sync(2). 

Ugh, yep.  We need to correct this.

> We could use ->sync_fs() to flush caches
> in case of sync(2) (that's what's happening for normal storage) but the
> problem with PMEM is that "flush all cached data" operation effectively
> means iterate through all modified pages and we didn't want to implement
> this for DAX fsync code.

I'm not sure what you mean by this - this is exactly what we implemented for
the DAX fsync code?  We iterate through all dirty pages, as recorded in the
radix tree, and flush them to media.

It seems like we could and should do the same thing for sync(2) and syncfs()?

> So we have decided to do cache flushing for DAX at a different point - mark
> inodes which may have writes cached as dirty and use writeback code for the
> cache flushing. But looking at it now, we have actually chosen a wrong
> place to do the flushing in the writeback path - note that sync(2) writes
> data via __writeback_single_inode() -> do_writepages() and thus doesn't
> even get to filemap_write_and_wait().

True - but I think we have to basically add another case for sync() regardless
of what we do, since AFAICS the fsync() and sync() paths never intersect.  So,
if we call into the DAX code at the filesystem level we'll need to do it in
both ->fsync and ->sync_fs/->writepages, and if we do it in common MM code
we'll need to do it in filemap_write_and_wait_range() and do_writepages() or
similar.

Here is the comment from Dave Chinner that had me move to having the calls to
dax_writeback_mapping_range() into the generic mm code:

   > Lastly, this flushing really needs to be inside
   > filemap_write_and_wait_range(), because we call the writeback code
   > from many more places than just fsync to ensure ordering of various
   > operations such that files are in known state before proceeding
   > (e.g. hole punch).

https://lkml.org/lkml/2015/11/16/847

Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for
hole punch, truncate, and block relocation (xfs_shift_file_space() &&
ext4_collapse_range()/ext4_insert_range()).

I think having DAX special case code sprinkled over all these call sites seems
like an indication that we've made a bad choice, and that we should centralize
in the mm layer with filemap_write_and_wait_range() and do_writepages().

OTOH, I think that it might be true that we don't actually need to cover all
these non-fsync/sync cases.  In the page cache case when we have dirty data in
the page cache, that data will be actively lost if we evict a dirty page cache
page without flushing it to media first.  For DAX, though, the data will
remain consistent with the physical address to which it was written regardless
of whether it's in the processor cache or not - really the only reason I see
to flush is in response to a fsync/msync/sync/syncfs so that our data is
durable on media in case of a power loss.  The case where we could throw dirty
data out of the page cache and essentially lose writes simply doesn't exist.

Does this sound correct, or am I missing something?

> So revisiting the decision I see two options:
> 
> 1) Move the DAX flushing code from filemap_write_and_wait() into
> ->writepages() fs callback. There the filesystem can provide all the
> information it needs including bdev, get_block callback, or whatever.

This seems fine as long as we add it to ->fsync as well since ->writepages is
never called in that path, and as long as we are okay with skipping DAX
writebacks on hole punch, truncate, and block relocation.

> 2) Back out even further and implement own tracking and iteration of inodes
> to write.
> 
> So far I still think 2) is not worth the complexity (although it would
> bring DAX code closer to how things behave with standard storage) so I
> would go for 1).