From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756017AbcBBVqN (ORCPT ); Tue, 2 Feb 2016 16:46:13 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:36024 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752217AbcBBVqI (ORCPT ); Tue, 2 Feb 2016 16:46:08 -0500 MIME-Version: 1.0 In-Reply-To: References: <1454009704-25959-1-git-send-email-ross.zwisler@linux.intel.com> <1454009704-25959-2-git-send-email-ross.zwisler@linux.intel.com> <20160128213858.GA29114@infradead.org> <20160129182815.GB5224@linux.intel.com> <20160130052833.GY2948@linux.intel.com> <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> Date: Tue, 2 Feb 2016 13:46:06 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences From: Jared Hulbert To: Dan Williams Cc: Dave Chinner , Jan Kara , Matthew Wilcox , Ross Zwisler , Christoph Hellwig , LKML , Alexander Viro , Andrew Morton , Jan Kara , Linux FS Devel , linux-nvdimm Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams wrote: > On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert wrote: > [..] >> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a >> decade. This CONFIG_BLOCK dependency is a result of an incremental >> feature from a certain point of view ;) >> >> The obvious 'driver' is physical RAM without a particular driver. >> Remember please I'm talking about embedded. RAM measured in MiB and >> funky one off hardware etc. In the embedded world there are lots of >> ways that persistent memory has been supported in device specific ways >> without the new fancypants NFIT and Intel instructions,so frankly >> they don't fit in the PMEM stuff. Maybe they could be supported in >> PMEM but not without effort to bring embedded players to the table. > > Not sure what you're trying to say here. An ACPI NFIT only feeds the > generic libnvdimm device model. You don't need NFIT to get pmem. Right... I'm just not seeing how the libnvdimm device model fits, is relevant, or useful to a persistent SRAM in embedded. Therefore I don't see some of the user will have a driver. >> The other drivers are the MTD drivers, probably as read-only for now. >> But the paradigm there isn't so different from what PMEM looks like >> with asymmetric read/write capabilities. >> >> The filesystem I'm concerned with is AXFS >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> Which I've been planning on trying to merge again due to a recent >> resurgence of interest. The device model for AXFS is... weird. It >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> block, and unmanaged physical memory. It's a terribly useful model >> for embedded. Anyway AXFS is readonly so hacking in a read only >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> enough. But... it would be cool if similar small embedded focused RW >> filesystems were enabled. > > Are those also out of tree? Of course. Merging embedded filesystems is little merging regular filesystems except 98% of you reviewers don't want it merged. >> I don't expect you to taint DAX with design requirements for this >> stuff that it wasn't built for, nobody ends up happy in that case. >> However, if enabling the filesystem to manage the bdev_direct_access() >> interactions solves some of the "alternate device" problems you are >> discussing here, then there is a chance we can accommodate both. >> Sometimes that works. >> >> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the >> focus anyway. Does it help to support the weirder XFS and btrfs >> device models to enable the filesystem to handle the >> bdev_direct_access() stuff? > > It's not clear that it does. We just clarified with xfs and ext4 that > we can really on get_blocks(). That solves the immediate concern with > multi-device filesystems. IMO you're making DAX more complex by overly coupling to the bdev and I think it could bite you later. I submit this rework of the radix tree and confusion about where to get the real bdev as evidence. I'm guessing that it won't be the last time. It's unnecessary to couple it like this, and in fact is not how the vfs has been layered in the past. The trouble with vfs work has been that it straddles the line between mm and block, unfortunately that line is dark chasm with ill defined boundaries. DAX is even more exciting because it's trying to duct tape the filesystem even closer to the mm system, one could argue it's actually in some respects enabling the filesystem to bypass the mm code. On top of that DAX is designed to enable block based filesystems to use RAM like devices. Bolting the block device interface on to NVDIMM is a brilliant hack and the right design choice, but it's still a hack. The upside is it enables the reuse of all this glorious legacy filesystem code which does a pretty amazing job of handling what the pmem device applications need considering they were designed to manage data on platters of slow spinning rust. How would DAX look like developed with a filesystem purpose built for pmem? To look at the the downside consider dax_fault(). Its called on a fault to a user memory map, uses the filesystems get_block() to lookup a sector so you can ask a block device to convert it to an address on a DIMM. Come on, that's awkward. Everything around dax_fault() is dripping with memory semantic interfaces, the dax_fault() call are fundamentally about memory, the pmem calls are memory, the hardware is memory, and yet it directly calls bdev_direct_access(). It's out of place. The legacy vfs/mm code didn't have this layering problem either. Even filemap_fault() that dax_fault() is modeled after doesn't call any bdev methods directly, when it needs something it asks the filesystem with a ->readpage(). The precedence is that you ask the filesystem for what you need. Look at the get_bdev() thing you've concluded you need. It _almost_ makes my point. I just happen to be of the opinion that you don't actually want or need the bdev, you want the pfn/kaddr so you can flush or map or memcpy().