From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755323AbcBCBV6 (ORCPT ); Tue, 2 Feb 2016 20:21:58 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:34367 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755128AbcBCBV4 (ORCPT ); Tue, 2 Feb 2016 20:21:56 -0500 MIME-Version: 1.0 In-Reply-To: <20160203003416.GD3260@linux.intel.com> References: <20160128213858.GA29114@infradead.org> <20160129182815.GB5224@linux.intel.com> <20160130052833.GY2948@linux.intel.com> <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> <20160203003416.GD3260@linux.intel.com> Date: Tue, 2 Feb 2016 17:21:55 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences From: Jared Hulbert To: Matthew Wilcox Cc: Dan Williams , Dave Chinner , Jan Kara , Ross Zwisler , Christoph Hellwig , LKML , Alexander Viro , Andrew Morton , Jan Kara , Linux FS Devel , linux-nvdimm Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox wrote: > On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote: >> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams wrote: >> >> The filesystem I'm concerned with is AXFS >> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> >> Which I've been planning on trying to merge again due to a recent >> >> resurgence of interest. The device model for AXFS is... weird. It >> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> >> block, and unmanaged physical memory. It's a terribly useful model >> >> for embedded. Anyway AXFS is readonly so hacking in a read only >> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> >> enough. But... it would be cool if similar small embedded focused RW >> >> filesystems were enabled. >> > >> > Are those also out of tree? >> >> Of course. Merging embedded filesystems is little merging regular >> filesystems except 98% of you reviewers don't want it merged. > > You should at least be able to get it into staging these days. I mean, > look at some of the junk that's in staging ... and I don't think AXFS was > nearly as bad. Thanks....? ;) >> IMO you're making DAX more complex by overly coupling to the bdev and >> I think it could bite you later. I submit this rework of the radix >> tree and confusion about where to get the real bdev as evidence. I'm >> guessing that it won't be the last time. It's unnecessary to couple >> it like this, and in fact is not how the vfs has been layered in the >> past. > > Huh? The rework to use the radix tree for PFNs was done with one eye > firmly on your usage case. Just because I had to thread the get_block > interface through it for the moment doesn't mean that I didn't have > the "how do we get rid of get_block entirely" question on my mind. Oh yeah. I think we're on the same page. But I'm not sure Dan is. I get the need to phase this in too. > Using get_block seemed like the right idea three years ago. I didn't > know just how fundamentally ext4 and XFS disagree on how it should be > used. Sure. I can see that. >> To look at the the downside consider dax_fault(). Its called on a >> fault to a user memory map, uses the filesystems get_block() to lookup >> a sector so you can ask a block device to convert it to an address on >> a DIMM. Come on, that's awkward. Everything around dax_fault() is >> dripping with memory semantic interfaces, the dax_fault() call are >> fundamentally about memory, the pmem calls are memory, the hardware is >> memory, and yet it directly calls bdev_direct_access(). It's out of >> place. > > What was out of place was the old 'get_xip_mem' in address_space > operations. Returning a kernel virtual address and a PFN from a > filesystem operation? That looks awful. Yes. Yes it does! But at least my big hack was just one line. ;) Nobody really even seemed to notice at the time. > All the other operations deal > in struct pages, file offsets and occasionally sectors. Of course, we > don't have a struct page, so a pfn makes sense, but the kernel virtual > address being returned was a gargantuan layering problem. Well yes, but it was an expedient hack. >> The legacy vfs/mm code didn't have this layering problem either. Even >> filemap_fault() that dax_fault() is modeled after doesn't call any >> bdev methods directly, when it needs something it asks the filesystem >> with a ->readpage(). The precedence is that you ask the filesystem >> for what you need. Look at the get_bdev() thing you've concluded you >> need. It _almost_ makes my point. I just happen to be of the opinion >> that you don't actually want or need the bdev, you want the pfn/kaddr >> so you can flush or map or memcpy(). > > You want the pfn. The device driver doesn't have enough information to > give you a (coherent with userspace) kaddr. That's what (some future > arch-specific implementation of) dax_map_pfn() is for. That's why it > takes 'index' as a parameter, so you can calculate where it'll be mapped > in userspace, and determine an appropriate kernel virtual address to > use for it. Oh.... I think I'm just beginning to catch your vision for dax_map_pfn(). I still don't get why we can't just do semi-arch specific flushing instead of the alignment thing. But that just might be epic ignorance on my part. Either way flush or magic alignments dax_(un)map_pfn() would handle it, right?