From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 4 Nov 2015 07:51:31 +1100 From: Dave Chinner Subject: Re: [PATCH v3 14/15] dax: dirty extent notification Message-ID: <20151103205131.GH19199@dastard> References: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> <20151102043058.6610.15559.stgit@dwillia2-desk3.amr.corp.intel.com> <20151103011653.GO10656@dastard> <20151103054039.GQ10656@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Dan Williams Cc: Jens Axboe , Jan Kara , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Ross Zwisler , Christoph Hellwig List-ID: On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner wrote: > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote: > >> No, we definitely can't do that. I think your mental model of the > >> cache flushing is similar to the disk model where a small buffer is > >> flushed after a large streaming write. Both Ross' patches and my > >> approach suffer from the same horror that the cache flushing is O(N) > >> currently, so we don't want to make it responsible for more data > >> ranges areas than is strictly necessary. > > > > I didn't see anything that was O(N) in Ross's patches. What part of > > the fsync algorithm that Ross proposed are you refering to here? > > We have to issue clflush per touched virtual address rather than a > constant number of physical ways, or a flush-all instruction. ..... > > So don't tell me that tracking dirty pages in the radix tree too > > slow for DAX and that DAX should not be used for POSIX IO based > > applications - it should be as fast as buffered IO, if not faster, > > and if it isn't then we've screwed up real bad. And right now, we're > > screwing up real bad. > > Again, it's not the dirty tracking in the radix I'm worried about it's > looping through all the virtual addresses within those pages.. So, let me summarise what I think you've just said. You are 1. fine with looping through the virtual addresses doing cache flushes synchronously when doing IO despite it having significant latency and performance costs. 2. Happy to hack a method into DAX to bypass the filesystems by pushing information to the block device for it to track regions that need cache flushes, then add infrastructure to the block device to track those dirty regions and then walk those addresses and issue cache flushes when the filesystem issues a REQ_FLUSH IO regardless of whether the filesystem actually needs those cachelines flushed for that specific IO? 3. Not happy to use the generic mm/vfs level infrastructure architectected specifically to provide the exact asynchronous cache flushing/writeback semantics we require because it will cause too many cache flushes, even though the number of cache flushes will be, at worst, the same as in 2). 1) will work, but as we can see it is *slow*. 3) is what Ross is implementing - it's a tried and tested architecture that all mm/fs developers understand, and his explanation of why it will work for pmem is pretty solid and completely platform/hardware architecture independent. Which leaves this question: How does 2) save us anything in terms of avoiding iterating virtual addresses and issuing cache flushes over 3)? And is it sufficient to justify hacking a bypass into DAX and the additional driver level complexity of having to add dirty region tracking, flushing and cleaning to REQ_FLUSH operations? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756013AbbKCUvs (ORCPT ); Tue, 3 Nov 2015 15:51:48 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:3921 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753672AbbKCUvr (ORCPT ); Tue, 3 Nov 2015 15:51:47 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2D+BgCaHTlW/+rW03ZegzuBQqpNAQEGiy6FJYQGhg0CAgEBAoFBTQEBAQEBAYELhDYBAQQ6HCECEAgDEgYJJQ8FJQMNFBOILcIgAQEIAgEgGYV1hUWJOAWWRo0agWKWcoNyY4IRHYFqKjSFNAEBAQ Date: Wed, 4 Nov 2015 07:51:31 +1100 From: Dave Chinner To: Dan Williams Cc: Jens Axboe , Jan Kara , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Ross Zwisler , Christoph Hellwig Subject: Re: [PATCH v3 14/15] dax: dirty extent notification Message-ID: <20151103205131.GH19199@dastard> References: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> <20151102043058.6610.15559.stgit@dwillia2-desk3.amr.corp.intel.com> <20151103011653.GO10656@dastard> <20151103054039.GQ10656@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner wrote: > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote: > >> No, we definitely can't do that. I think your mental model of the > >> cache flushing is similar to the disk model where a small buffer is > >> flushed after a large streaming write. Both Ross' patches and my > >> approach suffer from the same horror that the cache flushing is O(N) > >> currently, so we don't want to make it responsible for more data > >> ranges areas than is strictly necessary. > > > > I didn't see anything that was O(N) in Ross's patches. What part of > > the fsync algorithm that Ross proposed are you refering to here? > > We have to issue clflush per touched virtual address rather than a > constant number of physical ways, or a flush-all instruction. ..... > > So don't tell me that tracking dirty pages in the radix tree too > > slow for DAX and that DAX should not be used for POSIX IO based > > applications - it should be as fast as buffered IO, if not faster, > > and if it isn't then we've screwed up real bad. And right now, we're > > screwing up real bad. > > Again, it's not the dirty tracking in the radix I'm worried about it's > looping through all the virtual addresses within those pages.. So, let me summarise what I think you've just said. You are 1. fine with looping through the virtual addresses doing cache flushes synchronously when doing IO despite it having significant latency and performance costs. 2. Happy to hack a method into DAX to bypass the filesystems by pushing information to the block device for it to track regions that need cache flushes, then add infrastructure to the block device to track those dirty regions and then walk those addresses and issue cache flushes when the filesystem issues a REQ_FLUSH IO regardless of whether the filesystem actually needs those cachelines flushed for that specific IO? 3. Not happy to use the generic mm/vfs level infrastructure architectected specifically to provide the exact asynchronous cache flushing/writeback semantics we require because it will cause too many cache flushes, even though the number of cache flushes will be, at worst, the same as in 2). 1) will work, but as we can see it is *slow*. 3) is what Ross is implementing - it's a tried and tested architecture that all mm/fs developers understand, and his explanation of why it will work for pmem is pretty solid and completely platform/hardware architecture independent. Which leaves this question: How does 2) save us anything in terms of avoiding iterating virtual addresses and issuing cache flushes over 3)? And is it sufficient to justify hacking a bypass into DAX and the additional driver level complexity of having to add dirty region tracking, flushing and cleaning to REQ_FLUSH operations? Cheers, Dave. -- Dave Chinner david@fromorbit.com