From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754261AbcBEWZM (ORCPT ); Fri, 5 Feb 2016 17:25:12 -0500 Received: from mga03.intel.com ([134.134.136.65]:10373 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751431AbcBEWZJ (ORCPT ); Fri, 5 Feb 2016 17:25:09 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,402,1449561600"; d="scan'208";a="897342670" Date: Fri, 5 Feb 2016 15:25:00 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Dan Williams , Dave Chinner , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , Jan Kara , linux-fsdevel , linux-nvdimm Subject: Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences Message-ID: <20160205222500.GA15463@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , Dan Williams , Dave Chinner , Matthew Wilcox , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Alexander Viro , Andrew Morton , Jan Kara , linux-fsdevel , linux-nvdimm References: <20160201145147.GD13740@quack.suse.cz> <20160201214730.GR20456@dastard> <20160202111723.GD12574@quack.suse.cz> <20160202164642.GE12574@quack.suse.cz> <20160202173456.GB23963@linux.intel.com> <20160203104611.GF12574@quack.suse.cz> <20160204195619.GA31860@linux.intel.com> <20160204202957.GB6895@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160204202957.GB6895@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > I think changes aren't very intrusive so we can feed them in during RC > phase and frankly, you have to move to using ->writepages() anyway to make > sync(2) work reliably. I've been looking into this a bit more, and I don't think we actually want to have DAX flushing in response to sync(2) or syncfs(2). Here's some text from the BUGS section of the sync(2) man page: BUGS According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) Based on this I don't believe that it is a requirement that sync and syncfs actually flush the data durably to media before they return - they just need to make sure it has been sent to the device, which is always true for all writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). The fsync(2) man page, on the other hand, *does* require that a call to fsync() will result in the data being durable on the media before the system call returns. From fsync(2): fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. I think we are doing the right thing if we only call dax_writeback_mapping_range() in response to fsync() and msync(), but skip it in response to sync() and syncfs(). Practically this also simplifies things a lot because we don't need to worry about cleaning the DAX inodes. With my naive implementation where I was calling dax_writeback_mapping_range() in ext4_writepages(), we were flushing all DAX pages that had ever been dirtied every 5 seconds in response to a periodic filesystem sync(), which I'm sure is untenable. Please let me know if anyone disagrees with any of the above. Based on this, I'm off to move the calls to dax_writeback_mapping_range() back into the filesystem specific fsync()/msync() paths so that we can provide an appropriate block device.