linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Jeff Moyer <jmoyer@redhat.com>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>, X86 ML <x86@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>
Subject: Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
Date: Sat, 7 Nov 2015 01:02:50 -0800	[thread overview]
Message-ID: <CAPcyv4gJiMPc57jAZvCBDq4e3aiuMnknpgi_OC5wPdz3N4sm8w@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1511070919160.4032@nanos>

On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >>  This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
>   - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
>   - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
>   - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
>   - are tracking where it was used and never flush on innocent cpus
>
>   - one can design his application in a way that it uses different
>     processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>

Thanks for that explanation.  Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...

One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established.  With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache.  In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself".  We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates?  In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...

Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.

  reply	other threads:[~2015-11-07  9:03 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-28 22:09 [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Ross Zwisler
2015-10-28 22:09 ` [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-10-28 22:09 ` [PATCH 2/2] pmem: Add simple and slow fsync/msync support Ross Zwisler
2015-10-28 23:02   ` Dan Williams
2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
2015-10-28 22:49   ` Dan Williams
2015-10-28 22:51   ` Ross Zwisler
2015-11-05 23:59     ` Dan Williams
2015-11-06  8:06       ` Thomas Gleixner
2015-11-06 16:04         ` Dan Williams
2015-11-06 17:35           ` Thomas Gleixner
2015-11-06 23:17             ` Dan Williams
2015-11-07  0:51               ` H. Peter Anvin
2015-11-07  6:50                 ` Thomas Gleixner
2015-11-07  8:12                   ` Dan Williams
2015-11-07  8:38                     ` Thomas Gleixner
2015-11-07  9:02                       ` Dan Williams [this message]
2015-11-07  9:24                         ` Thomas Gleixner
2015-11-06 20:25       ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4gJiMPc57jAZvCBDq4e3aiuMnknpgi_OC5wPdz3N4sm8w@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=mingo@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).