From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934206AbcBDUaM (ORCPT ); Thu, 4 Feb 2016 15:30:12 -0500 Received: from mx2.suse.de ([195.135.220.15]:47367 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933854AbcBDU3r (ORCPT ); Thu, 4 Feb 2016 15:29:47 -0500 Date: Thu, 4 Feb 2016 21:29:57 +0100 From: Jan Kara To: Ross Zwisler Cc: Jan Kara , Dan Williams , Dave Chinner , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , Jan Kara , linux-fsdevel , linux-nvdimm Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences Message-ID: <20160204202957.GB6895@quack.suse.cz> References: <20160130052833.GY2948@linux.intel.com> <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> <20160202111723.GD12574@quack.suse.cz> <20160202164642.GE12574@quack.suse.cz> <20160202173456.GB23963@linux.intel.com> <20160203104611.GF12574@quack.suse.cz> <20160204195619.GA31860@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160204195619.GA31860@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 04-02-16 12:56:19, Ross Zwisler wrote: > On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > > > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara wrote: > > > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara wrote: > > > > >> [..] > > > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > > > >> > have other more pressing things to fix remains... > > > > >> > > > > >> We don't need pfns in the radix to support XFS RT configurations. > > > > >> Just call get_blocks() again and use the sector, or am I missing > > > > >> something? > > > > > > > > > > You are correct. But if you decide to pay the cost of additional > > > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > > > else. So my understanding was that the whole point of games with radix tree > > > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > > > durability guarantees for DAX-unaware applications. A DAX-aware > > > > application is going to skip fsync, and the get_blocks() cost, to do > > > > cache management itself. > > > > > > > > Willy pointed out some other potential benefits, assuming a suitable > > > > replacement for the protections afforded by the block-device > > > > percpu_ref counter can be found. However, optimizing for the > > > > DAX-unaware-application case seems the wrong motivation to me. > > > > > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > > performance. It's that we don't have any idea what get_block() function to > > > call. > > > > > > The fault handler calls all come from the filesystem directly, so they can > > > easily give us an appropriate get_block() function pointer. But the > > > dax_writeback_mapping_range() calls come from the generic code in > > > mm/filemap.c, and don't know what get_block() to pass in. > > > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > > happening in the filesystem fsync code so that it could pass in get_block(), > > > but Dave Chinner pointed out that this misses other paths in the filesystem > > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > > > Let's clear this up a bit: The problem with using ->fsync() method is that > > it doesn't get called for sync(2). We could use ->sync_fs() to flush caches > > in case of sync(2) (that's what's happening for normal storage) but the > > problem with PMEM is that "flush all cached data" operation effectively > > means iterate through all modified pages and we didn't want to implement > > this for DAX fsync code. > > > > So we have decided to do cache flushing for DAX at a different point - mark > > inodes which may have writes cached as dirty and use writeback code for the > > cache flushing. But looking at it now, we have actually chosen a wrong > > place to do the flushing in the writeback path - note that sync(2) writes > > data via __writeback_single_inode() -> do_writepages() and thus doesn't > > even get to filemap_write_and_wait(). > > > > So revisiting the decision I see two options: > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > ->writepages() fs callback. There the filesystem can provide all the > > information it needs including bdev, get_block callback, or whatever. > > > > 2) Back out even further and implement own tracking and iteration of inodes > > to write. > > > > So far I still think 2) is not worth the complexity (although it would > > bring DAX code closer to how things behave with standard storage) so I > > would go for 1). > > Jan, just to clarify, are you proposing this change for v4.5 in the remaining > RCs as an alternative to the get_bdev() patch? > > https://lkml.org/lkml/2016/2/2/941 Yes, because I don't think anything like ->get_bdev() is needed at all. Look: dax_do_io(), __dax_fault(), __dax_pmd_fault(), dax_zero_page_range() don't really need bdev - we have agreed that get_block() fills that in just fine. dax_clear_blocks() has IMO just the wrong signature - it should take bdev and not inode as an argument. Because combination inode + bdev sector doesn't really make much sense. dax_writeback_mapping_range() is the only remaining offender and it can easily take bdev as an argument when called from ->writepages(). > Or can we move forward with get_bdev(), and try and figure out this new way of > calling fsync/msync for v4.6? My main concern here is that changing how the > DAX sync code gets called will affect all three filesystems as well as MM, and > that it might be too much for RC inclusion... I think changes aren't very intrusive so we can feed them in during RC phase and frankly, you have to move to using ->writepages() anyway to make sync(2) work reliably. Honza -- Jan Kara SUSE Labs, CR