From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753671AbcIMBx0 (ORCPT ); Mon, 12 Sep 2016 21:53:26 -0400 Received: from mail-pf0-f195.google.com ([209.85.192.195]:34832 "EHLO mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751567AbcIMBxY (ORCPT ); Mon, 12 Sep 2016 21:53:24 -0400 Date: Tue, 13 Sep 2016 11:53:11 +1000 From: Nicholas Piggin To: Dave Chinner Cc: Christoph Hellwig , "Oliver O'Halloran" , Yumei Huang , Michal Hocko , Xiao Guangrong , Andrew Morton , KVM list , Linux MM , Gleb Natapov , "linux-nvdimm@lists.01.org" , mtosatti@redhat.com, "linux-kernel@vger.kernel.org" , Dave Hansen , Stefan Hajnoczi , linux-fsdevel , Paolo Bonzini Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Message-ID: <20160913115311.509101b0@roar.ozlabs.ibm.com> In-Reply-To: <20160912213435.GD30497@dastard> References: <20160908225636.GB15167@linux.intel.com> <20160912052703.GA1897@infradead.org> <20160912075128.GB21474@infradead.org> <20160912180507.533b3549@roar.ozlabs.ibm.com> <20160912213435.GD30497@dastard> Organization: IBM X-Mailer: Claws Mail 3.14.0 (GTK+ 2.24.30; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 13 Sep 2016 07:34:36 +1000 Dave Chinner wrote: > On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote: > > On Mon, 12 Sep 2016 00:51:28 -0700 > > Christoph Hellwig wrote: > > > > > On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote: > > > > What are the problems here? Is this a matter of existing filesystems > > > > being unable/unwilling to support this or is it just fundamentally > > > > broken? > > > > > > It's a fundamentally broken model. See Dave's post that actually was > > > sent slightly earlier then mine for the list of required items, which > > > is fairly unrealistic. You could probably try to architect a file > > > system for it, but I doubt it would gain much traction. > > > > It's not fundamentally broken, it just doesn't fit well existing > > filesystems. > > > > Dave's post of requirements is also wrong. A filesystem does not have > > to guarantee all that, it only has to guarantee that is the case for > > a given block after it has a mapping and page fault returns, other > > operations can be supported by invalidating mappings, etc. > > Sure, but filesystems are completely unaware of what is mapped at > any given time, or what constraints that mapping might have. Trying > to make filesystems aware of per-page mapping constraints seems like I'm not sure what you mean. The filesystem can hand out mappings and fault them in itself. It can invalidate them. > a fairly significant layering violation based on a flawed > assumption. i.e. that operations on other parts of the file do not > affect the block that requires immutable metadata. > > e.g an extent operation in some other area of the file can cause a > tip-to-root extent tree split or merge, and that moves the metadata > that points to the mapped block that we've told userspace "doesn't > need fsync". We now need an fsync to ensure that the metadata is > consistent on disk again, even though that block has not physically > been moved. You don't, because the filesystem can invalidate existing mappings and do the right thing when they are faulted in again. That's the big^Wmedium hammer approach that can cope with most problems. But let me understand your example in the absence of that. - Application mmaps a file, faults in block 0 - FS allocates block, creates mappings, syncs metadata, sets "no fsync" flag for that block, and completes the fault. - Application writes some data to block 0, completes userspace flushes * At this point, a crash must return with above data (or newer). - Application starts writing more stuff into block 0 - Concurrently, fault in block 1 - FS starts to allocate, splits trees including mappings to block 0 * Crash Is that right? How does your filesystem lose data before the sync point? > IOWs, the immutable data block updates are now not > ordered correctly w.r.t. other updates done to the file, especially > when we consider crash recovery.... > > All this will expose is an unfixable problem with ordering of stable > data + metadata operations and their synchronisation. As such, it > seems like nothing but a major cluster-fuck to try to do mapping > specific, per-block immutable metadata - it adds major complexity > and even more untractable problems. > > Yes, we /could/ try to solve this but, quite frankly, it's far > easier to change the broken PMEM programming model assumptions than > it is to implement what you are suggesting. Or to do what Christoph > suggested and just use a wrapper around something like device > mapper to hand out chunks of unchanging, static pmem to > applications... If there is any huge complexity or unsolved problem, it is in XFS. Conceptual problem is simple.