From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932479AbcILBkp (ORCPT ); Sun, 11 Sep 2016 21:40:45 -0400 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:51887 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756253AbcILBkk (ORCPT ); Sun, 11 Sep 2016 21:40:40 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AJDQBhBtZXEAI1LHleHQEFAQsBgzkBAQEBAR6BBE+CeoN5hkCVawEBAQEBAQaMeYYZgg+CA4YXBAICgTE5FAECAQEBAQEBAQYBAQEBAQEBATdAhGEBAQEDATocKAsIAxgJJQ8FJQMHGgESiEIHwDABAQgCASQehUqFGIdugi8FmWOMJoMcgg+NXYxVg3seg1uBSCo0h2MBAQE Date: Mon, 12 Sep 2016 11:40:35 +1000 From: Dave Chinner To: Ross Zwisler , Dan Williams , Xiao Guangrong , Dave Hansen , Paolo Bonzini , Andrew Morton , Michal Hocko , Gleb Natapov , mtosatti@redhat.com, KVM list , "linux-kernel@vger.kernel.org" , Stefan Hajnoczi , Yumei Huang , Linux MM , "linux-nvdimm@lists.01.org" , linux-fsdevel Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Message-ID: <20160912014035.GB30497@dastard> References: <20160908225636.GB15167@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160908225636.GB15167@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote: > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote: > > My understanding is that it is looking for the VM_MIXEDMAP flag which > > is already ambiguous for determining if DAX is enabled even if this > > dynamic listing issue is fixed. XFS has arranged for DAX to be a > > per-inode capability and has an XFS-specific inode flag. We can make > > that a common inode flag, but it seems we should have a way to > > interrogate the mapping itself in the case where the inode is unknown > > or unavailable. I'm thinking extensions to mincore to have flags for > > DAX and possibly whether the page is part of a pte, pmd, or pud > > mapping. Just floating that idea before starting to look into the > > implementation, comments or other ideas welcome... > > I think this goes back to our previous discussion about support for the PMEM > programming model. Really I think what NVML needs isn't a way to tell if it > is getting a DAX mapping, but whether it is getting a DAX mapping on a > filesystem that fully supports the PMEM programming model. This of course is > defined to be a filesystem where it can do all of its flushes from userspace > safely and never call fsync/msync, and that allocations that happen in page > faults will be synchronized to media before the page fault completes. > > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for > everything or can I rely fully on flushes from userspace?" "need fsync/msync" is a dynamic state of an inode, not a static property. i.e. users can do things that change an inode behind the back of a mapping, even if they are not aware that this might happen. As such, a filesystem can invalidate an existing mapping at any time and userspace won't notice because it will simply fault in a new mapping on the next access... > For all existing implementations, I think the answer is "you need to use > fsync/msync" because we don't yet have proper support for the PMEM programming > model. Yes, that is correct. FWIW, I don't think it will ever be possible to support this .... wonderful "PMEM programming model" from any current or future kernel filesystem without a very specific set of restrictions on what can be done to a file. e.g. 1. the file has to be fully allocated and zeroed before use. Preallocation/zeroing via unwritten extents is not allowed. Sparse files are not allowed. Shared extents are not allowed. 2. set the "PMEM_IMMUTABLE" inode flag - filesystem must check the file is fully allocated before allowing it to be set, and caller must have CAP_LINUX_IMMUTABLE. 3. Inode metadata is now immutable, and file data can only be accessed and/or modified via mmap(). 4. All non-mmap methods of inode data modification will now fail with EPERM. 5. all methods of inode metadata modification will now fail with EPERM, timestamp udpdates will be ignored. 6. PMEM_IMMUTABLE flag can only be removed if the file is not currently mapped and caller has CAP_LINUX_IMMUTABLE. A flag like this /should/ make it possible to avoid fsync/msync() on a file for existing filesystems, but it also means that such files have significant management issues (hence the need for CAP_LINUX_IMMUTABLE to cover it's use). Cheers, Dave. -- Dave Chinner david@fromorbit.com