All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: device-mapper development <dm-devel@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: [patch 4/4] dm-writecache: use new API for flushing
Date: Wed, 30 May 2018 15:39:18 -0700	[thread overview]
Message-ID: <CAPcyv4gD-ZaXyV1SFAndobUYJOGLkx9rOyb5zNbzJXB_kVCzoA@mail.gmail.com> (raw)
In-Reply-To: <CAPcyv4j3pDQ3YMK5Eije0dgGw+1CQYazPJ-zvqfvGF-9O_gVJw@mail.gmail.com>

On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>
>> On Mon, 28 May 2018, Dan Williams wrote:
>>
>>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >
>>> >
>>> > On Sat, 26 May 2018, Dan Williams wrote:
>>> >
>>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>>> >> >
>>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >> >
>>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>>> >> >> >>
>>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>>> >> >> >>
>>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>>> >> >>
>>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>>> >> >> that as a trailblazing requirement, I see that as typical review and a
>>> >> >> reduction of the operation space that you are proposing.
>>> >> >
>>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>>> >> > slower than memcpy and explicit cache flush.
>>> >> >
>>> >> > Suppose that you want to write data to a block device and make it
>>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>>> >> >
>>> >> > Now - how to implement these two bios on persistent memory:
>>> >> >
>>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>>> >> > wmb() - this is the optimal implementation.
>>> >> >
>>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>>> >> > does arch_wb_cache_pmem() on the affected range.
>>> >> >
>>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>>> >> > as memcpy() followed by a cache flush.
>>> >> >
>>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>>> >> > of dirty lines and the cache-flushing code has to write these lines back
>>> >> > and that is slow.
>>> >> >
>>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>>> >> > received), the processor already flushed some part of the cache on its
>>> >> > own, so the cache-flushing function has less work to do and it is faster.
>>> >> >
>>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>>> >>
>>> >> It sounds like ARM might be better off with mapping its pmem as
>>> >> write-through rather than write-back, and skip the explicit cache
>>> >
>>> > I doubt it would perform well - write combining combines the writes into a
>>> > larger segments - and write-through doesn't.
>>> >
>>>
>>> Last I checked write-through caching does not disable write combining
>>>
>>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>>> >> what would be more clear is fio measurements of the relative IOPs and
>>> >> latency profiles of the different approaches. The reason I am
>>> >> continuing to push here is that reducing the operation space from
>>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>>> >> maintenance long term.
>>> >
>>> > I measured it (with nvme backing store) and late cache flushing has 12%
>>> > better performance than eager flushing with memcpy_flushcache().
>>>
>>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>>> data so it becomes more efficient to do an amortized flush at the end?
>>> However, that effectively makes memcpy_flushcache() unusable in the
>>> way it can be used on x86. You claimed that ARM does not support
>>> non-temporal stores, but it does, see the STNP instruction. I do not
>>> want to see arch specific optimizations in drivers, so either
>>> write-through mappings is a potential answer to remove the need to
>>> explicitly manage flushing, or just implement STNP hacks in
>>> memcpy_flushcache() like you did with MOVNT on x86.
>>>
>>> > 131836 4k iops - vs - 117016.
>>>
>>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>>
>> I found out what caused the difference. I used dax_flush on the version of
>> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
>> it is the last version where dax on ramdisk works) - and I thought that
>> dax_flush flushes the cache, but it doesn't.
>>
>> When I replaced dax_flush with arch_wb_cache_pmem, the performance
>> difference between early flushing and late flushing disappeared.
>>
>> So I think we can remove this per-architecture switch from dm-writecache.
>
> Great find! Thanks for the due diligence. Feel free to add:
>
>     Acked-by: Dan Williams <dan.j.williams@intel.com>
>
> ...on the reworks to unify ARM and x86.

One more note. The side effect of not using dax_flush() is that you
may end up flushing caches on systems where the platform has asserted
it will take responsibility for flushing caches at power loss. If /
when those systems become more prevalent we may want to think of a way
to combine the non-temporal optimization and the cache-flush-bypass
optimizations. However that is something that can wait for a later
change beyond 4.18.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: device-mapper development
	<dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	linux-nvdimm
	<linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
Subject: Re: [patch 4/4] dm-writecache: use new API for flushing
Date: Wed, 30 May 2018 15:39:18 -0700	[thread overview]
Message-ID: <CAPcyv4gD-ZaXyV1SFAndobUYJOGLkx9rOyb5zNbzJXB_kVCzoA@mail.gmail.com> (raw)
In-Reply-To: <CAPcyv4j3pDQ3YMK5Eije0dgGw+1CQYazPJ-zvqfvGF-9O_gVJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>
>> On Mon, 28 May 2018, Dan Williams wrote:
>>
>>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >
>>> >
>>> > On Sat, 26 May 2018, Dan Williams wrote:
>>> >
>>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>>> >> >
>>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>>> >> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >> >
>>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>>> >> >> >>
>>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>>> >> >> >>
>>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>>> >> >>
>>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>>> >> >> that as a trailblazing requirement, I see that as typical review and a
>>> >> >> reduction of the operation space that you are proposing.
>>> >> >
>>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>>> >> > slower than memcpy and explicit cache flush.
>>> >> >
>>> >> > Suppose that you want to write data to a block device and make it
>>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>>> >> >
>>> >> > Now - how to implement these two bios on persistent memory:
>>> >> >
>>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>>> >> > wmb() - this is the optimal implementation.
>>> >> >
>>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>>> >> > does arch_wb_cache_pmem() on the affected range.
>>> >> >
>>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>>> >> > as memcpy() followed by a cache flush.
>>> >> >
>>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>>> >> > of dirty lines and the cache-flushing code has to write these lines back
>>> >> > and that is slow.
>>> >> >
>>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>>> >> > received), the processor already flushed some part of the cache on its
>>> >> > own, so the cache-flushing function has less work to do and it is faster.
>>> >> >
>>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>>> >>
>>> >> It sounds like ARM might be better off with mapping its pmem as
>>> >> write-through rather than write-back, and skip the explicit cache
>>> >
>>> > I doubt it would perform well - write combining combines the writes into a
>>> > larger segments - and write-through doesn't.
>>> >
>>>
>>> Last I checked write-through caching does not disable write combining
>>>
>>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>>> >> what would be more clear is fio measurements of the relative IOPs and
>>> >> latency profiles of the different approaches. The reason I am
>>> >> continuing to push here is that reducing the operation space from
>>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>>> >> maintenance long term.
>>> >
>>> > I measured it (with nvme backing store) and late cache flushing has 12%
>>> > better performance than eager flushing with memcpy_flushcache().
>>>
>>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>>> data so it becomes more efficient to do an amortized flush at the end?
>>> However, that effectively makes memcpy_flushcache() unusable in the
>>> way it can be used on x86. You claimed that ARM does not support
>>> non-temporal stores, but it does, see the STNP instruction. I do not
>>> want to see arch specific optimizations in drivers, so either
>>> write-through mappings is a potential answer to remove the need to
>>> explicitly manage flushing, or just implement STNP hacks in
>>> memcpy_flushcache() like you did with MOVNT on x86.
>>>
>>> > 131836 4k iops - vs - 117016.
>>>
>>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>>
>> I found out what caused the difference. I used dax_flush on the version of
>> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
>> it is the last version where dax on ramdisk works) - and I thought that
>> dax_flush flushes the cache, but it doesn't.
>>
>> When I replaced dax_flush with arch_wb_cache_pmem, the performance
>> difference between early flushing and late flushing disappeared.
>>
>> So I think we can remove this per-architecture switch from dm-writecache.
>
> Great find! Thanks for the due diligence. Feel free to add:
>
>     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> ...on the reworks to unify ARM and x86.

One more note. The side effect of not using dax_flush() is that you
may end up flushing caches on systems where the platform has asserted
it will take responsibility for flushing caches at power loss. If /
when those systems become more prevalent we may want to think of a way
to combine the non-temporal optimization and the cache-flush-bypass
optimizations. However that is something that can wait for a later
change beyond 4.18.

  reply	other threads:[~2018-05-30 22:39 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
2018-05-19 14:21   ` Dan Williams
2018-05-24 18:20     ` [PATCH v2] " Mike Snitzer
2018-06-18 13:23       ` [PATCH v2 RESEND] " Mike Snitzer
2018-06-18 13:23         ` Mike Snitzer
2018-06-21 14:31         ` Ingo Molnar
2018-06-22  1:19           ` Mikulas Patocka
2018-06-22  1:19             ` Mikulas Patocka
2018-06-22  1:30             ` Ingo Molnar
2018-08-08 21:22               ` [PATCH v3 " Mikulas Patocka
2018-09-10 13:18                 ` Ingo Molnar
2018-09-11  6:22                 ` [tip:x86/asm] x86/asm: Optimize memcpy_flushcache() tip-bot for Mikulas Patocka
2018-05-19  5:25 ` [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait Mikulas Patocka
2018-05-22  6:34   ` Christoph Hellwig
2018-05-22 18:52     ` Mike Snitzer
2018-05-23  9:21       ` Peter Zijlstra
2018-05-23 15:10         ` Mike Snitzer
2018-05-23 18:10           ` [PATCH v2] swait: export " Mike Snitzer
2018-05-23 20:38             ` Mikulas Patocka
2018-05-23 21:51               ` Mike Snitzer
2018-05-24 14:10             ` Peter Zijlstra
2018-05-24 15:09               ` Mike Snitzer
2018-05-19  5:25 ` [patch 3/4] dm-writecache Mikulas Patocka
2018-05-22  6:37   ` Christoph Hellwig
2018-05-19  5:25 ` [patch 4/4] dm-writecache: use new API for flushing Mikulas Patocka
2018-05-22  6:39   ` [dm-devel] " Christoph Hellwig
2018-05-22  6:39     ` Christoph Hellwig
2018-05-22 18:41     ` Mike Snitzer
2018-05-22 18:41       ` Mike Snitzer
2018-05-22 19:00       ` Dan Williams
2018-05-22 19:00         ` Dan Williams
2018-05-22 19:19         ` Mike Snitzer
2018-05-22 19:19           ` Mike Snitzer
2018-05-22 19:27           ` Dan Williams
2018-05-22 19:27             ` Dan Williams
2018-05-22 20:52             ` Mike Snitzer
2018-05-22 20:52               ` Mike Snitzer
2018-05-22 22:53               ` [dm-devel] " Jeff Moyer
2018-05-22 22:53                 ` Jeff Moyer
2018-05-23 20:57                 ` Mikulas Patocka
2018-05-23 20:57                   ` Mikulas Patocka
2018-05-28 13:52             ` Mikulas Patocka
2018-05-28 13:52               ` Mikulas Patocka
2018-05-28 17:41               ` Dan Williams
2018-05-28 17:41                 ` Dan Williams
2018-05-30 13:42                 ` [dm-devel] " Jeff Moyer
2018-05-30 13:42                   ` Jeff Moyer
2018-05-30 13:51                   ` Mikulas Patocka
2018-05-30 13:51                     ` Mikulas Patocka
2018-05-30 13:52                   ` Jeff Moyer
2018-05-30 13:52                     ` Jeff Moyer
2018-05-24  8:15         ` Mikulas Patocka
2018-05-24  8:15           ` Mikulas Patocka
2018-05-25  3:12   ` Dan Williams
2018-05-25  6:17     ` Mikulas Patocka
2018-05-25 12:51       ` Mike Snitzer
2018-05-25 12:51         ` Mike Snitzer
2018-05-25 15:57         ` Dan Williams
2018-05-25 15:57           ` Dan Williams
2018-05-26  7:02           ` Mikulas Patocka
2018-05-26  7:02             ` Mikulas Patocka
2018-05-26 15:26             ` Dan Williams
2018-05-26 15:26               ` Dan Williams
2018-05-28 13:32               ` Mikulas Patocka
2018-05-28 13:32                 ` Mikulas Patocka
2018-05-28 18:14                 ` Dan Williams
2018-05-28 18:14                   ` Dan Williams
2018-05-30 13:07                   ` Mikulas Patocka
2018-05-30 13:07                     ` Mikulas Patocka
2018-05-30 13:16                     ` Mike Snitzer
2018-05-30 13:16                       ` Mike Snitzer
2018-05-30 13:21                       ` Mikulas Patocka
2018-05-30 13:21                         ` Mikulas Patocka
2018-05-30 13:26                         ` Mike Snitzer
2018-05-30 13:26                           ` Mike Snitzer
2018-05-30 13:33                           ` Mikulas Patocka
2018-05-30 13:33                             ` Mikulas Patocka
2018-05-30 13:54                             ` Mike Snitzer
2018-05-30 13:54                               ` Mike Snitzer
2018-05-30 14:09                               ` Mikulas Patocka
2018-05-30 14:09                                 ` Mikulas Patocka
2018-05-30 14:21                                 ` Mike Snitzer
2018-05-30 14:21                                   ` Mike Snitzer
2018-05-30 14:46                                   ` Mikulas Patocka
2018-05-30 14:46                                     ` Mikulas Patocka
2018-05-31  3:42                                     ` Mike Snitzer
2018-05-31  3:42                                       ` Mike Snitzer
2018-06-03 15:03                                       ` Mikulas Patocka
2018-06-03 15:03                                         ` Mikulas Patocka
2018-05-31  3:39                                 ` Mike Snitzer
2018-05-31  3:39                                   ` Mike Snitzer
2018-05-31  8:16                                   ` Mikulas Patocka
2018-05-31  8:16                                     ` Mikulas Patocka
2018-05-31 12:09                                     ` Mike Snitzer
2018-05-31 12:09                                       ` Mike Snitzer
2018-05-30 15:58                     ` Dan Williams
2018-05-30 15:58                       ` Dan Williams
2018-05-30 22:39                       ` Dan Williams [this message]
2018-05-30 22:39                         ` Dan Williams
2018-05-31  8:19                         ` Mikulas Patocka
2018-05-31  8:19                           ` Mikulas Patocka
2018-05-31 14:51                           ` Dan Williams
2018-05-31 14:51                             ` Dan Williams
2018-05-31 15:31                             ` Mikulas Patocka
2018-05-31 15:31                               ` Mikulas Patocka
2018-05-31 16:39                               ` Dan Williams
2018-05-31 16:39                                 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4gD-ZaXyV1SFAndobUYJOGLkx9rOyb5zNbzJXB_kVCzoA@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mpatocka@redhat.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.