From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752690AbcIIJAz (ORCPT ); Fri, 9 Sep 2016 05:00:55 -0400 Received: from mga06.intel.com ([134.134.136.31]:64007 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750839AbcIIJAu (ORCPT ); Fri, 9 Sep 2016 05:00:50 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,304,1470726000"; d="scan'208";a="758751721" Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) To: Dan Williams , Ross Zwisler , Dave Hansen , Paolo Bonzini , Andrew Morton , Michal Hocko , Gleb Natapov , mtosatti@redhat.com, KVM list , "linux-kernel@vger.kernel.org" , Stefan Hajnoczi , Yumei Huang , Linux MM , "linux-nvdimm@lists.01.org" , linux-fsdevel References: <20160908225636.GB15167@linux.intel.com> From: Xiao Guangrong Message-ID: <5d5ef209-e005-12c6-9b34-1fdd21e1e6e2@linux.intel.com> Date: Fri, 9 Sep 2016 16:55:08 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/09/2016 07:04 AM, Dan Williams wrote: > On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler > wrote: >> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote: >>> [ adding linux-fsdevel and linux-nvdimm ] >>> >>> On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong >>> wrote: >>> [..] >>>> However, it is not easy to handle the case that the new VMA overlays with >>>> the old VMA >>>> already got by userspace. I think we have some choices: >>>> 1: One way is completely skipping the new VMA region as current kernel code >>>> does but i >>>> do not think this is good as the later VMAs will be dropped. >>>> >>>> 2: show the un-overlayed portion of new VMA. In your case, we just show the >>>> region >>>> (0x2000 -> 0x3000), however, it can not work well if the VMA is a new >>>> created >>>> region with different attributions. >>>> >>>> 3: completely show the new VMA as this patch does. >>>> >>>> Which one do you prefer? >>>> >>> >>> I don't have a preference, but perhaps this breakage and uncertainty >>> is a good opportunity to propose a more reliable interface for NVML to >>> get the information it needs? >>> >>> My understanding is that it is looking for the VM_MIXEDMAP flag which >>> is already ambiguous for determining if DAX is enabled even if this >>> dynamic listing issue is fixed. XFS has arranged for DAX to be a >>> per-inode capability and has an XFS-specific inode flag. We can make >>> that a common inode flag, but it seems we should have a way to >>> interrogate the mapping itself in the case where the inode is unknown >>> or unavailable. I'm thinking extensions to mincore to have flags for >>> DAX and possibly whether the page is part of a pte, pmd, or pud >>> mapping. Just floating that idea before starting to look into the >>> implementation, comments or other ideas welcome... >> >> I think this goes back to our previous discussion about support for the PMEM >> programming model. Really I think what NVML needs isn't a way to tell if it >> is getting a DAX mapping, but whether it is getting a DAX mapping on a >> filesystem that fully supports the PMEM programming model. This of course is >> defined to be a filesystem where it can do all of its flushes from userspace >> safely and never call fsync/msync, and that allocations that happen in page >> faults will be synchronized to media before the page fault completes. >> >> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for >> everything or can I rely fully on flushes from userspace?" >> >> For all existing implementations, I think the answer is "you need to use >> fsync/msync" because we don't yet have proper support for the PMEM programming >> model. >> >> My best idea of how to support this was a per-inode flag similar to the one >> supported by XFS that says "you have a PMEM capable DAX mapping", which NVML >> would then interpret to mean "you can do flushes from userspace and be fully >> safe". I think we really want this interface to be common over XFS and ext4. >> >> If we can figure out a better way of doing this interface, say via mincore, >> that's fine, but I don't think we can detangle this from the PMEM API >> discussion. > > Whether a persistent memory mapping requires an msync/fsync is a > filesystem specific question. This mincore proposal is separate from > that. Consider device-DAX for volatile memory or mincore() called on > an anonymous memory range. In those cases persistence and filesystem > metadata are not in the picture, but it would still be useful for > userspace to know "is there page cache backing this mapping?" or "what > is the TLB geometry of this mapping?". I got a question about msync/fsync which is beyond the topic of this thread :) Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? Thanks!