All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>, Jan Kara <jack@suse.cz>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v3 14/15] dax: dirty extent notification
Date: Tue, 3 Nov 2015 13:19:08 -0800	[thread overview]
Message-ID: <CAPcyv4jyO8eVOcyPEi8Ga382UQw4DmG4gzcyqkcHS2JsFqfQ0g@mail.gmail.com> (raw)
In-Reply-To: <20151103205131.GH19199@dastard>

On Tue, Nov 3, 2015 at 12:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> >> No, we definitely can't do that.   I think your mental model of the
>> >> cache flushing is similar to the disk model where a small buffer is
>> >> flushed after a large streaming write.  Both Ross' patches and my
>> >> approach suffer from the same horror that the cache flushing is O(N)
>> >> currently, so we don't want to make it responsible for more data
>> >> ranges areas than is strictly necessary.
>> >
>> > I didn't see anything that was O(N) in Ross's patches. What part of
>> > the fsync algorithm that Ross proposed are you refering to here?
>>
>> We have to issue clflush per touched virtual address rather than a
>> constant number of physical ways, or a flush-all instruction.
> .....
>> > So don't tell me that tracking dirty pages in the radix tree too
>> > slow for DAX and that DAX should not be used for POSIX IO based
>> > applications - it should be as fast as buffered IO, if not faster,
>> > and if it isn't then we've screwed up real bad. And right now, we're
>> > screwing up real bad.
>>
>> Again, it's not the dirty tracking in the radix I'm worried about it's
>> looping through all the virtual addresses within those pages..
>
> So, let me summarise what I think you've just said. You are
>
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.

No, like I said in the blkdev_issue_zeroout thread we need to replace
looping flushes with non-temporal stores and delayed wmb_pmem()
wherever possible.

> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?

I'm happier with a temporary driver level hack than a temporary core
kernel change.  This requirement to flush by virtual address is
something that, in my opinion, must be addressed by the platform with
a reliable global flush or by walking a small constant number of
physical-cache-ways.  I think we're getting ahead of ourselves jumping
to solving this in the core kernel while the question of how to do
efficient large flushes is still pending.

> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).

Correct, because if/when a platform solution arrives the need to track
dirty pfns evaporates.

> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
>
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?
>

Given what we are talking about amounts to a hardware workaround I
think that kind of logic belongs in a driver.  If the cache flushing
gets fixed and we stop needing to track individual cachelines the
flush implementation will look and feel much more like existing
storage drivers.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>, Jan Kara <jack@suse.cz>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v3 14/15] dax: dirty extent notification
Date: Tue, 3 Nov 2015 13:19:08 -0800	[thread overview]
Message-ID: <CAPcyv4jyO8eVOcyPEi8Ga382UQw4DmG4gzcyqkcHS2JsFqfQ0g@mail.gmail.com> (raw)
In-Reply-To: <20151103205131.GH19199@dastard>

On Tue, Nov 3, 2015 at 12:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> >> No, we definitely can't do that.   I think your mental model of the
>> >> cache flushing is similar to the disk model where a small buffer is
>> >> flushed after a large streaming write.  Both Ross' patches and my
>> >> approach suffer from the same horror that the cache flushing is O(N)
>> >> currently, so we don't want to make it responsible for more data
>> >> ranges areas than is strictly necessary.
>> >
>> > I didn't see anything that was O(N) in Ross's patches. What part of
>> > the fsync algorithm that Ross proposed are you refering to here?
>>
>> We have to issue clflush per touched virtual address rather than a
>> constant number of physical ways, or a flush-all instruction.
> .....
>> > So don't tell me that tracking dirty pages in the radix tree too
>> > slow for DAX and that DAX should not be used for POSIX IO based
>> > applications - it should be as fast as buffered IO, if not faster,
>> > and if it isn't then we've screwed up real bad. And right now, we're
>> > screwing up real bad.
>>
>> Again, it's not the dirty tracking in the radix I'm worried about it's
>> looping through all the virtual addresses within those pages..
>
> So, let me summarise what I think you've just said. You are
>
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.

No, like I said in the blkdev_issue_zeroout thread we need to replace
looping flushes with non-temporal stores and delayed wmb_pmem()
wherever possible.

> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?

I'm happier with a temporary driver level hack than a temporary core
kernel change.  This requirement to flush by virtual address is
something that, in my opinion, must be addressed by the platform with
a reliable global flush or by walking a small constant number of
physical-cache-ways.  I think we're getting ahead of ourselves jumping
to solving this in the core kernel while the question of how to do
efficient large flushes is still pending.

> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).

Correct, because if/when a platform solution arrives the need to track
dirty pfns evaporates.

> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
>
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?
>

Given what we are talking about amounts to a hardware workaround I
think that kind of logic belongs in a driver.  If the cache flushing
gets fixed and we stop needing to track individual cachelines the
flush implementation will look and feel much more like existing
storage drivers.

  reply	other threads:[~2015-11-03 21:19 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-02  4:29 [PATCH v3 00/15] block, dax updates for 4.4 Dan Williams
2015-11-02  4:29 ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 01/15] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03  0:51   ` Dave Chinner
2015-11-03  0:51     ` Dave Chinner
2015-11-03  3:27     ` Dan Williams
2015-11-03  3:27       ` Dan Williams
2015-11-03  4:48       ` Dave Chinner
2015-11-03  4:48         ` Dave Chinner
2015-11-03  5:31         ` Dan Williams
2015-11-03  5:31           ` Dan Williams
2015-11-03  5:52           ` Dave Chinner
2015-11-03  5:52             ` Dave Chinner
2015-11-03  7:24             ` Dan Williams
2015-11-03  7:24               ` Dan Williams
2015-11-03 16:21           ` Jan Kara
2015-11-03 16:21             ` Jan Kara
2015-11-03 17:57           ` Ross Zwisler
2015-11-03 17:57             ` Ross Zwisler
2015-11-03 20:59             ` Dave Chinner
2015-11-03 20:59               ` Dave Chinner
2015-11-02  4:29 ` [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03 19:01   ` Ross Zwisler
2015-11-03 19:01     ` Ross Zwisler
2015-11-03 19:09     ` Jeff Moyer
2015-11-03 22:50     ` Dan Williams
2015-11-03 22:50       ` Dan Williams
2016-01-18 10:42   ` Geert Uytterhoeven
2016-01-18 10:42     ` Geert Uytterhoeven
2015-11-02  4:30 ` [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:15   ` Ross Zwisler
2015-11-03 19:15     ` Ross Zwisler
2015-11-02  4:30 ` [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:32   ` Ross Zwisler
2015-11-03 19:32     ` Ross Zwisler
2015-11-03 21:39     ` Dan Williams
2015-11-03 21:39       ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 06/15] um: kill pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02 16:30   ` Joe Perches
2015-11-02 16:30     ` Joe Perches
2015-11-02  4:30 ` [PATCH v3 09/15] block: notify queue death confirmation Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 11/15] block: introduce bdev_file_inode() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 12/15] block: enable dax for raw block devices Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 13/15] block, dax: make dax mappings opt-in by default Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  0:32   ` Dave Chinner
2015-11-03  0:32     ` Dave Chinner
2015-11-03  7:35     ` Dan Williams
2015-11-03  7:35       ` Dan Williams
2015-11-03 20:20       ` Dave Chinner
2015-11-03 20:20         ` Dave Chinner
2015-11-03 23:04         ` Dan Williams
2015-11-03 23:04           ` Dan Williams
2015-11-04 19:23           ` Dan Williams
2015-11-04 19:23             ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 14/15] dax: dirty extent notification Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  1:16   ` Dave Chinner
2015-11-03  1:16     ` Dave Chinner
2015-11-03  4:56     ` Dan Williams
2015-11-03  4:56       ` Dan Williams
2015-11-03  5:40       ` Dave Chinner
2015-11-03  5:40         ` Dave Chinner
2015-11-03  7:20         ` Dan Williams
2015-11-03  7:20           ` Dan Williams
2015-11-03 20:51           ` Dave Chinner
2015-11-03 20:51             ` Dave Chinner
2015-11-03 21:19             ` Dan Williams [this message]
2015-11-03 21:19               ` Dan Williams
2015-11-03 21:37             ` Ross Zwisler
2015-11-03 21:37               ` Ross Zwisler
2015-11-03 21:43               ` Dan Williams
2015-11-03 21:43                 ` Dan Williams
2015-11-03 21:18       ` Ross Zwisler
2015-11-03 21:18         ` Ross Zwisler
2015-11-03 21:34         ` Dan Williams
2015-11-03 21:34           ` Dan Williams
2015-11-02  4:31 ` [PATCH v3 15/15] pmem: blkdev_issue_flush support Dan Williams
2015-11-02  4:31   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jyO8eVOcyPEi8Ga382UQw4DmG4gzcyqkcHS2JsFqfQ0g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=axboe@fb.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.