From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933332AbcBDXiN (ORCPT ); Thu, 4 Feb 2016 18:38:13 -0500 Received: from mga03.intel.com ([134.134.136.65]:19779 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932783AbcBDXiJ (ORCPT ); Thu, 4 Feb 2016 18:38:09 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,397,1449561600"; d="scan'208";a="877339767" Date: Thu, 4 Feb 2016 16:38:01 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Dave Chinner , Dan Williams , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , linux-fsdevel , linux-nvdimm Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences Message-ID: <20160204233801.GA5365@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , Dave Chinner , Dan Williams , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , linux-fsdevel , linux-nvdimm References: <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> <20160202111723.GD12574@quack.suse.cz> <20160202164642.GE12574@quack.suse.cz> <20160202173456.GB23963@linux.intel.com> <20160203104611.GF12574@quack.suse.cz> <20160203201328.GC29275@linux.intel.com> <20160204091558.GB4956@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160204091558.GB4956@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote: <> > Yes, you are right. filemap_write_and_wait_range() actually doesn't > guarantee data durability. That function only means all dirty data has been > sent to storage and the storage has acknowledged them. This is noop for > PMEM. So we are perfectly fine ignoring calls to > filemap_write_and_wait_range(). What guarantees data durability are only > ->fsync() and ->sync_fs() calls. But some code could get upset by seeing > that filemap_write_and_wait_range() didn't actually get rid of dirty pages > (in some special cases like inode eviction or similar). That's why I'd > choose one of the two options for consistency: > > 1) Treat inode indexes to flush as close to dirty pages as we can - this > means inode is dirty with all the tracking associated with it, radix tree > entries have dirty tag, we get rid of these in ->writepages(). We are close > to this currently. I think we're actually pretty far from this, at least for v4.5. The issue is that I don't think we can safely clear radix tree dirty entries during the DAX code that is called via ->writepages(). To do this correctly we would need to also mark the PTE as clean so that when the userspace process next writes to their mmap mapping we would get a new fault to make the page writable. This would allow us to re-dirty the DAX entry in the radix tree. I implemented code to do this in v2 of my set, but ripped it out in v3: https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) The race that compelled this removal is described here: https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html (sorry for all the links) Anyway, for v4.5 I think whatever solution we come up with must be okay with an ever growing list of dirty radix tree entries, as we currently have. Are you aware of a reason why this won't work, or was the cleaning of the radix tree entries just a good goal to have? (And I agree it is a good goal, I just don't know how to do it safely.) > 2) Completely avoid the dirty tracking and writeback code and reimplement > everything in DAX code. > > Because some hybrid between these is IMHO bound to provoke weird (and very > hard to find) bugs. > > > > So revisiting the decision I see two options: > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > ->writepages() fs callback. There the filesystem can provide all the > > > information it needs including bdev, get_block callback, or whatever. > > > > This seems fine as long as we add it to ->fsync as well since ->writepages is > > never called in that path, and as long as we are okay with skipping DAX > > writebacks on hole punch, truncate, and block relocation. > > Look at ext4_sync_file() -> filemap_write_and_wait_range() -> > __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 > checks which would need to be changed. Ah, cool, I missed this path. Thank you for setting me straight.