From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20151103011653.GO10656@dastard> References: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> <20151102043058.6610.15559.stgit@dwillia2-desk3.amr.corp.intel.com> <20151103011653.GO10656@dastard> Date: Mon, 2 Nov 2015 20:56:24 -0800 Message-ID: Subject: Re: [PATCH v3 14/15] dax: dirty extent notification From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org To: Dave Chinner Cc: Jens Axboe , Jan Kara , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Ross Zwisler , Christoph Hellwig List-ID: On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner wrote: > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote: >> DAX-enabled block device drivers can use hints from fs/dax.c to >> optimize their internal tracking of potentially dirty cpu cache lines. >> If a DAX mapping is being used for synchronous operations, dax_do_io(), >> a dax-enabled block-driver knows that fs/dax.c will handle immediate >> flushing. For asynchronous mappings, i.e. returned to userspace via >> mmap, the driver can track active extents of the media for flushing. > > So, essentially, you are marking the calls into the mapping calls > with BLKDAX_F_DIRTY when the mapping is requested for a write page > fault? Hence allowing the block device to track "dirty pages" > exactly? Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the total storage capacity), because tracking dirty mappings should be temporary compatibility hack and not a first class citizen. > But, really, if we're going to use Ross's mapping tree patches that > use exceptional entries to track dirty pfns, why do we need to this > special interface from DAX to the block device? Ross's changes will > track mmap'd ranges that are dirtied at the filesytem inode level, > and the fsync/writeback will trigger CPU cache writeback of those > dirty ranges. This will work for block devices that are mapped by > DAX, too, because they have a inode+mapping tree, too. > > And if we are going to use Ross's infrastructure (which, when we > work the kinks out of, I think we will), we really should change > dax_do_io() to track pfns that are dirtied this way, too. That will > allow us to get rid of all the cache flushing from the DAX layer > (they'll get pushed into fsync/writeback) and so we only take the > CPU cache flushing penalties when synchronous operations are > requested by userspace... No, we definitely can't do that. I think your mental model of the cache flushing is similar to the disk model where a small buffer is flushed after a large streaming write. Both Ross' patches and my approach suffer from the same horror that the cache flushing is O(N) currently, so we don't want to make it responsible for more data ranges areas than is strictly necessary. >> We can later extend the DAX paths to indicate when an async mapping is >> "closed" allowing the active extents to be marked clean. > > Yes, that's a basic feature of Ross's patches. Hence I think this > special case DAX<->bdev interface is the wrong direction to be > taking. So here's my problem with the "track dirty mappings" in the core mm/vfs approach, it's harder to unwind and delete when it turns out no application actually needs it, or the platform gives us an O(1) flush method that is independent of dirty pte tracking. We have the NVML [1] library as the recommended method for applications to interact with persistent memory and it is not using fsync/msync for its synchronization primitives, it's managing the cache directly. The *only* user for tracking dirty DAX mappings is unmodified legacy applications that do mmap I/O and call fsync/msync. DAX in my opinion is not a transparent accelerator of all existing apps, it's a targeted mechanism for applications ready to take advantage of byte addressable persistent memory. This is why I'm a big supporter of your per-inode DAX control proposal. The fact that fsync is painful for large amounts of dirty data is a feature. It detects inodes that should have had DAX-disabled in the first instance. The only advantage of the radix approach is that the second fsync after the big hit may be faster, but that still can't beat either targeted disabling of DAX or updating the app to use NVML. So, again, I remain to be convinced that we need to carry complexity in the core kernel when we have the page cache to cover those cases. The driver solution is a minimal extension of the data bdev_direct_access() is already sending down to the driver, and covers the gap without mm/fs entanglements while we figure out a longer term solution. [1]: https://github.com/pmem/nvml From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932146AbbKCE42 (ORCPT ); Mon, 2 Nov 2015 23:56:28 -0500 Received: from mail-wi0-f177.google.com ([209.85.212.177]:34418 "EHLO mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752051AbbKCE40 (ORCPT ); Mon, 2 Nov 2015 23:56:26 -0500 MIME-Version: 1.0 In-Reply-To: <20151103011653.GO10656@dastard> References: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> <20151102043058.6610.15559.stgit@dwillia2-desk3.amr.corp.intel.com> <20151103011653.GO10656@dastard> Date: Mon, 2 Nov 2015 20:56:24 -0800 Message-ID: Subject: Re: [PATCH v3 14/15] dax: dirty extent notification From: Dan Williams To: Dave Chinner Cc: Jens Axboe , Jan Kara , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Ross Zwisler , Christoph Hellwig Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner wrote: > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote: >> DAX-enabled block device drivers can use hints from fs/dax.c to >> optimize their internal tracking of potentially dirty cpu cache lines. >> If a DAX mapping is being used for synchronous operations, dax_do_io(), >> a dax-enabled block-driver knows that fs/dax.c will handle immediate >> flushing. For asynchronous mappings, i.e. returned to userspace via >> mmap, the driver can track active extents of the media for flushing. > > So, essentially, you are marking the calls into the mapping calls > with BLKDAX_F_DIRTY when the mapping is requested for a write page > fault? Hence allowing the block device to track "dirty pages" > exactly? Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the total storage capacity), because tracking dirty mappings should be temporary compatibility hack and not a first class citizen. > But, really, if we're going to use Ross's mapping tree patches that > use exceptional entries to track dirty pfns, why do we need to this > special interface from DAX to the block device? Ross's changes will > track mmap'd ranges that are dirtied at the filesytem inode level, > and the fsync/writeback will trigger CPU cache writeback of those > dirty ranges. This will work for block devices that are mapped by > DAX, too, because they have a inode+mapping tree, too. > > And if we are going to use Ross's infrastructure (which, when we > work the kinks out of, I think we will), we really should change > dax_do_io() to track pfns that are dirtied this way, too. That will > allow us to get rid of all the cache flushing from the DAX layer > (they'll get pushed into fsync/writeback) and so we only take the > CPU cache flushing penalties when synchronous operations are > requested by userspace... No, we definitely can't do that. I think your mental model of the cache flushing is similar to the disk model where a small buffer is flushed after a large streaming write. Both Ross' patches and my approach suffer from the same horror that the cache flushing is O(N) currently, so we don't want to make it responsible for more data ranges areas than is strictly necessary. >> We can later extend the DAX paths to indicate when an async mapping is >> "closed" allowing the active extents to be marked clean. > > Yes, that's a basic feature of Ross's patches. Hence I think this > special case DAX<->bdev interface is the wrong direction to be > taking. So here's my problem with the "track dirty mappings" in the core mm/vfs approach, it's harder to unwind and delete when it turns out no application actually needs it, or the platform gives us an O(1) flush method that is independent of dirty pte tracking. We have the NVML [1] library as the recommended method for applications to interact with persistent memory and it is not using fsync/msync for its synchronization primitives, it's managing the cache directly. The *only* user for tracking dirty DAX mappings is unmodified legacy applications that do mmap I/O and call fsync/msync. DAX in my opinion is not a transparent accelerator of all existing apps, it's a targeted mechanism for applications ready to take advantage of byte addressable persistent memory. This is why I'm a big supporter of your per-inode DAX control proposal. The fact that fsync is painful for large amounts of dirty data is a feature. It detects inodes that should have had DAX-disabled in the first instance. The only advantage of the radix approach is that the second fsync after the big hit may be faster, but that still can't beat either targeted disabling of DAX or updating the app to use NVML. So, again, I remain to be convinced that we need to carry complexity in the core kernel when we have the page cache to cover those cases. The driver solution is a minimal extension of the data bdev_direct_access() is already sending down to the driver, and covers the gap without mm/fs entanglements while we figure out a longer term solution. [1]: https://github.com/pmem/nvml