From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755323AbcBCBV6 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 2 Feb 2016 20:21:58 -0500
Received: from mail-wm0-f68.google.com ([74.125.82.68]:34367 "EHLO
	mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755128AbcBCBV4 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 2 Feb 2016 20:21:56 -0500
MIME-Version: 1.0
In-Reply-To: <20160203003416.GD3260@linux.intel.com>
References: <20160128213858.GA29114@infradead.org>
	<20160129182815.GB5224@linux.intel.com>
	<20160130052833.GY2948@linux.intel.com>
	<20160201145147.GD13740@quack.suse.cz>
	<20160201214730.GR20456@dastard>
	<CA+ZsKJ7RVX6EiMoPd62pOjgKWuSxizn8nuZwVJHqNwx3gbHrJg@mail.gmail.com>
	<CAPcyv4gGEAfVCoGmABBK4jLkkQpgcT==wBjfOMBRSq_DQXY7gw@mail.gmail.com>
	<CA+ZsKJ526NVUJ1wZCRyL142Uy6oYgN516Ua195Fnrkohb_D5JA@mail.gmail.com>
	<CAPcyv4iM9NGO3SbYqEg4HfT6f-ant-debNHpAaBDvXj+-hAjRQ@mail.gmail.com>
	<CA+ZsKJ4rrgQNnnrdvmnTP2GcrZna83+yUV_GFBhEQ6HDKqd7HA@mail.gmail.com>
	<20160203003416.GD3260@linux.intel.com>
Date: Tue, 2 Feb 2016 17:21:55 -0800
Message-ID: <CA+ZsKJ5MXN7_rAp5y_HYor9zrrAG6ZW6Grix4fou5XDegup_gg@mail.gmail.com>
Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
From: Jared Hulbert <jaredeh@gmail.com>
To: Matthew Wilcox <willy@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
        Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
        Ross Zwisler <ross.zwisler@linux.intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.com>,
        Linux FS Devel <linux-fsdevel@vger.kernel.org>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote:
>> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> >> The filesystem I'm concerned with is AXFS
>> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
>> >> Which I've been planning on trying to merge again due to a recent
>> >> resurgence of interest.  The device model for AXFS is... weird.  It
>> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
>> >> block, and unmanaged physical memory.  It's a terribly useful model
>> >> for embedded.  Anyway AXFS is readonly so hacking in a read only
>> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy
>> >> enough.  But... it would be cool if similar small embedded focused RW
>> >> filesystems were enabled.
>> >
>> > Are those also out of tree?
>>
>> Of course.  Merging embedded filesystems is little merging regular
>> filesystems except 98% of you reviewers don't want it merged.
>
> You should at least be able to get it into staging these days.  I mean,
> look at some of the junk that's in staging ... and I don't think AXFS was
> nearly as bad.

Thanks....? ;)

>> IMO you're making DAX more complex by overly coupling to the bdev and
>> I think it could bite you later.  I submit this rework of the radix
>> tree and confusion about where to get the real bdev as evidence.  I'm
>> guessing that it won't be the last time.  It's unnecessary to couple
>> it like this, and in fact is not how the vfs has been layered in the
>> past.
>
> Huh?  The rework to use the radix tree for PFNs was done with one eye
> firmly on your usage case.  Just because I had to thread the get_block
> interface through it for the moment doesn't mean that I didn't have
> the "how do we get rid of get_block entirely" question on my mind.

Oh yeah.  I think we're on the same page.  But I'm not sure Dan is.  I
get the need to phase this in too.

> Using get_block seemed like the right idea three years ago.  I didn't
> know just how fundamentally ext4 and XFS disagree on how it should be
> used.

Sure.  I can see that.

>> To look at the the downside consider dax_fault().  Its called on a
>> fault to a user memory map, uses the filesystems get_block() to lookup
>> a sector so you can ask a block device to convert it to an address on
>> a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
>> dripping with memory semantic interfaces, the dax_fault() call are
>> fundamentally about memory, the pmem calls are memory, the hardware is
>> memory, and yet it directly calls bdev_direct_access().  It's out of
>> place.
>
> What was out of place was the old 'get_xip_mem' in address_space
> operations.  Returning a kernel virtual address and a PFN from a
> filesystem operation?  That looks awful.

Yes.  Yes it does!  But at least my big hack was just one line. ;)
Nobody really even seemed to notice at the time.

>  All the other operations deal
> in struct pages, file offsets and occasionally sectors.  Of course, we
> don't have a struct page, so a pfn makes sense, but the kernel virtual
> address being returned was a gargantuan layering problem.

Well yes, but it was an expedient hack.

>> The legacy vfs/mm code didn't have this layering problem either.  Even
>> filemap_fault() that dax_fault() is modeled after doesn't call any
>> bdev methods directly, when it needs something it asks the filesystem
>> with a ->readpage().  The precedence is that you ask the filesystem
>> for what you need.  Look at the get_bdev() thing you've concluded you
>> need.  It _almost_ makes my point.  I just happen to be of the opinion
>> that you don't actually want or need the bdev, you want the pfn/kaddr
>> so you can flush or map or memcpy().
>
> You want the pfn.  The device driver doesn't have enough information to
> give you a (coherent with userspace) kaddr.  That's what (some future
> arch-specific implementation of) dax_map_pfn() is for.  That's why it
> takes 'index' as a parameter, so you can calculate where it'll be mapped
> in userspace, and determine an appropriate kernel virtual address to
> use for it.

Oh.... I think I'm just beginning to catch your vision for
dax_map_pfn().  I still don't get why we can't just do semi-arch
specific flushing instead of the alignment thing.  But that just might
be epic ignorance on my part.  Either way flush or magic alignments
dax_(un)map_pfn() would handle it, right?