Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

From: Matthew Wilcox <willy@linux.intel.com>
To: Jared Hulbert <jaredeh@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.com>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>
Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
Date: Tue, 2 Feb 2016 19:34:16 -0500	[thread overview]
Message-ID: <20160203003416.GD3260@linux.intel.com> (raw)
In-Reply-To: <CA+ZsKJ4rrgQNnnrdvmnTP2GcrZna83+yUV_GFBhEQ6HDKqd7HA@mail.gmail.com>

On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote:
> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> >> The filesystem I'm concerned with is AXFS
> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
> >> Which I've been planning on trying to merge again due to a recent
> >> resurgence of interest.  The device model for AXFS is... weird.  It
> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
> >> block, and unmanaged physical memory.  It's a terribly useful model
> >> for embedded.  Anyway AXFS is readonly so hacking in a read only
> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy
> >> enough.  But... it would be cool if similar small embedded focused RW
> >> filesystems were enabled.
> >
> > Are those also out of tree?
> 
> Of course.  Merging embedded filesystems is little merging regular
> filesystems except 98% of you reviewers don't want it merged.

You should at least be able to get it into staging these days.  I mean,
look at some of the junk that's in staging ... and I don't think AXFS was
nearly as bad.

> IMO you're making DAX more complex by overly coupling to the bdev and
> I think it could bite you later.  I submit this rework of the radix
> tree and confusion about where to get the real bdev as evidence.  I'm
> guessing that it won't be the last time.  It's unnecessary to couple
> it like this, and in fact is not how the vfs has been layered in the
> past.

Huh?  The rework to use the radix tree for PFNs was done with one eye
firmly on your usage case.  Just because I had to thread the get_block
interface through it for the moment doesn't mean that I didn't have
the "how do we get rid of get_block entirely" question on my mind.

Using get_block seemed like the right idea three years ago.  I didn't
know just how fundamentally ext4 and XFS disagree on how it should be
used.

> To look at the the downside consider dax_fault().  Its called on a
> fault to a user memory map, uses the filesystems get_block() to lookup
> a sector so you can ask a block device to convert it to an address on
> a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
> dripping with memory semantic interfaces, the dax_fault() call are
> fundamentally about memory, the pmem calls are memory, the hardware is
> memory, and yet it directly calls bdev_direct_access().  It's out of
> place.

What was out of place was the old 'get_xip_mem' in address_space
operations.  Returning a kernel virtual address and a PFN from a
filesystem operation?  That looks awful.  All the other operations deal
in struct pages, file offsets and occasionally sectors.  Of course, we
don't have a struct page, so a pfn makes sense, but the kernel virtual
address being returned was a gargantuan layering problem.

> The legacy vfs/mm code didn't have this layering problem either.  Even
> filemap_fault() that dax_fault() is modeled after doesn't call any
> bdev methods directly, when it needs something it asks the filesystem
> with a ->readpage().  The precedence is that you ask the filesystem
> for what you need.  Look at the get_bdev() thing you've concluded you
> need.  It _almost_ makes my point.  I just happen to be of the opinion
> that you don't actually want or need the bdev, you want the pfn/kaddr
> so you can flush or map or memcpy().

You want the pfn.  The device driver doesn't have enough information to
give you a (coherent with userspace) kaddr.  That's what (some future
arch-specific implementation of) dax_map_pfn() is for.  That's why it
takes 'index' as a parameter, so you can calculate where it'll be mapped
in userspace, and determine an appropriate kernel virtual address to
use for it.