From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753377AbcBGGoB (ORCPT <rfc822;w@1wt.eu>);
	Sun, 7 Feb 2016 01:44:01 -0500
Received: from mga02.intel.com ([134.134.136.20]:23359 "EHLO mga02.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751067AbcBGGn6 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Sun, 7 Feb 2016 01:43:58 -0500
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.22,409,1449561600"; 
   d="scan'208";a="648411541"
Date: Sat, 6 Feb 2016 23:43:49 -0700
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>, Jan Kara <jack@suse.cz>,
        Dan Williams <dan.j.williams@intel.com>,
        Matthew Wilcox <willy@linux.intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.com>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>
Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
Message-ID: <20160207064349.GA25849@linux.intel.com>
Mail-Followup-To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>
References: <20160202111723.GD12574@quack.suse.cz>
 <CAPcyv4j8MtNrsW86HepcS3JdO-Gkt+PCcS5GdPi1m3jwCsJ+nQ@mail.gmail.com>
 <20160202164642.GE12574@quack.suse.cz>
 <CAPcyv4hsChZS4Qv7XHpZm-qYuPrpTqgcpBC56jazE+y=B7q3CQ@mail.gmail.com>
 <20160202173456.GB23963@linux.intel.com>
 <20160203104611.GF12574@quack.suse.cz>
 <20160204195619.GA31860@linux.intel.com>
 <20160204202957.GB6895@quack.suse.cz>
 <20160205222500.GA15463@linux.intel.com>
 <20160206234053.GG31407@dastard>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160206234053.GG31407@dastard>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Feb 07, 2016 at 10:40:53AM +1100, Dave Chinner wrote:
> On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote:
> > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote:
> > > I think changes aren't very intrusive so we can feed them in during RC
> > > phase and frankly, you have to move to using ->writepages() anyway to make
> > > sync(2) work reliably.
> > 
> > I've been looking into this a bit more, and I don't think we actually want to
> > have DAX flushing in response to sync(2) or syncfs(2).  Here's some text from
> > the BUGS section of the sync(2) man page:
> > 
> > BUGS
> > 	According to the standard specification (e.g., POSIX.1-2001), sync()
> > 	schedules the writes,  but  may  return before  the  actual  writing
> > 	is done.  However, since version 1.3.20 Linux does actually wait.
> > 	(This still does not guarantee data integrity: modern disks have large
> > 	caches.) 
> > 
> > Based on this I don't believe that it is a requirement that sync and syncfs
> > actually flush the data durably to media before they return - they just need
> > to make sure it has been sent to the device, which is always true for all
> > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range().
> 
> That's an assumption we've already pointed out as being platform
> dependent, not to mention also being undesirable from a performance
> point of view (writes are 10x slower into pmem than into the page
> cache using the same physical memory!).
> 
> Further, the ordering constraints of modern filesystems mean that
> they guarantee durability of data written back when sync() is run.
> i.e.  ext4, XFS, btrfs, etc all ensure that sync guarantees data
> integrity is maintained across all the data and metadata written
> back during sync().
> 
> e.g. for XFS we do file size update transactions at IO completion.
> sync() triggers writeback of data, then runs a transaction that
> modifies the file size so that the data is valid on disk. We
> absolutely need to ensure that this transaction is durable before
> sync() returns, otherwise we lose that data if a failure occurs
> immediately after sync() returns because the size update is not on
> disk.
> 
> Users are right to complain when data written before a sync() call
> is made does not accessible after a crash/reboot because we failed
> to make it durable. That's why ->sync_fs(wait) is called at the end
> of the sync() implementation - it enables filesystems to ensure all
> data and metadata written during the sync processing is on durable
> storage.
> 
> IOWs, we can't language-lawyer or weasel-word our way out of
> providing durability guarantees for sync().

To be clear, we are only talking about user data that was mmaped.  All I/Os
initiated by the filesystem for metadata, and all user issued I/Os will be
durable on media before the I/O is completed.  Nothing we do or don't do for
fsync, msync, sync or syncfs() will break filesystem metadata guarantees.

I think the question then is: "Do users currently rely on data written to an
mmap to be durable on media after a sync() or syncfs(), in spite of the BUGS
section quoted above?"

If the answer for this is "yes", then I agree that we need to enhance the
current DAX fsync code to cover the sync use case.

However, as stated previously I don't think that it is as easy as just moving
the call to dax_writeback_mapping_range() to the sync() call path.  On my
system sync() is called every 5 seconds, and flushing every page of every DAX
mmap that has ever been dirtied every 5 seconds is unreasonable.

If we need to support the sync() use case with our DAX mmap flushing code, I
think the only way to do that is to clear out radix entries as we flush them,
so that the sync calls that happen every 5 seconds are only flushing new
writes.

I think we're actually pretty far from this, at least for v4.5.  The issue is
that I don't think we can safely clear radix tree dirty entries during the DAX
code that is called via ->writepages().  To do this correctly we would need to
also mark the PTE as clean so that when the userspace process next writes to
their mmap mapping we would get a new fault to make the page writable. This
would allow us to re-dirty the DAX entry in the radix tree.

I implemented code to do this in v2 of my set, but ripped it out in v3:

https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2)
https://lkml.org/lkml/2015/12/8/583  (DAX fsync v3)

The race that compelled this removal is described here:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html

If this is really where we need to be, that's fine, we'll have to figure out
some way to defeat the race and get this code into v4.6.