Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness

From: Dan Williams <dan.j.williams@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Jeff Moyer <jmoyer@redhat.com>,
	linux-nvdimm <linux-nvdimm@ml01.01.org>, X86 ML <x86@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>
Subject: Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
Date: Sat, 7 Nov 2015 01:02:50 -0800	[thread overview]
Message-ID: <CAPcyv4gJiMPc57jAZvCBDq4e3aiuMnknpgi_OC5wPdz3N4sm8w@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1511070919160.4032@nanos>

On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >>  This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
>   - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
>   - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
>   - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
>   - are tracking where it was used and never flush on innocent cpus
>
>   - one can design his application in a way that it uses different
>     processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>

Thanks for that explanation.  Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...

One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established.  With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache.  In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself".  We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates?  In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...

Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.