From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754136AbcBCKqE (ORCPT ); Wed, 3 Feb 2016 05:46:04 -0500 Received: from mx2.suse.de ([195.135.220.15]:52200 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751029AbcBCKp7 (ORCPT ); Wed, 3 Feb 2016 05:45:59 -0500 Date: Wed, 3 Feb 2016 11:46:11 +0100 From: Jan Kara To: Ross Zwisler Cc: Dan Williams , Jan Kara , Dave Chinner , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , Jan Kara , linux-fsdevel , linux-nvdimm Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences Message-ID: <20160203104611.GF12574@quack.suse.cz> References: <20160128213858.GA29114@infradead.org> <20160129182815.GB5224@linux.intel.com> <20160130052833.GY2948@linux.intel.com> <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> <20160202111723.GD12574@quack.suse.cz> <20160202164642.GE12574@quack.suse.cz> <20160202173456.GB23963@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160202173456.GB23963@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara wrote: > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara wrote: > > >> [..] > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > >> > have other more pressing things to fix remains... > > >> > > >> We don't need pfns in the radix to support XFS RT configurations. > > >> Just call get_blocks() again and use the sector, or am I missing > > >> something? > > > > > > You are correct. But if you decide to pay the cost of additional > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > else. So my understanding was that the whole point of games with radix tree > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > durability guarantees for DAX-unaware applications. A DAX-aware > > application is going to skip fsync, and the get_blocks() cost, to do > > cache management itself. > > > > Willy pointed out some other potential benefits, assuming a suitable > > replacement for the protections afforded by the block-device > > percpu_ref counter can be found. However, optimizing for the > > DAX-unaware-application case seems the wrong motivation to me. > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > performance. It's that we don't have any idea what get_block() function to > call. > > The fault handler calls all come from the filesystem directly, so they can > easily give us an appropriate get_block() function pointer. But the > dax_writeback_mapping_range() calls come from the generic code in > mm/filemap.c, and don't know what get_block() to pass in. > > During one iteration I had the calls to dax_writeback_mapping_range() > happening in the filesystem fsync code so that it could pass in get_block(), > but Dave Chinner pointed out that this misses other paths in the filesystem > that need to have things flushed via a call to filemap_write_and_wait_range(). Let's clear this up a bit: The problem with using ->fsync() method is that it doesn't get called for sync(2). We could use ->sync_fs() to flush caches in case of sync(2) (that's what's happening for normal storage) but the problem with PMEM is that "flush all cached data" operation effectively means iterate through all modified pages and we didn't want to implement this for DAX fsync code. So we have decided to do cache flushing for DAX at a different point - mark inodes which may have writes cached as dirty and use writeback code for the cache flushing. But looking at it now, we have actually chosen a wrong place to do the flushing in the writeback path - note that sync(2) writes data via __writeback_single_inode() -> do_writepages() and thus doesn't even get to filemap_write_and_wait(). So revisiting the decision I see two options: 1) Move the DAX flushing code from filemap_write_and_wait() into ->writepages() fs callback. There the filesystem can provide all the information it needs including bdev, get_block callback, or whatever. 2) Back out even further and implement own tracking and iteration of inodes to write. So far I still think 2) is not worth the complexity (although it would bring DAX code closer to how things behave with standard storage) so I would go for 1). Honza -- Jan Kara SUSE Labs, CR