From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752284AbbKGIkL (ORCPT <rfc822;w@1wt.eu>);
	Sat, 7 Nov 2015 03:40:11 -0500
Received: from www.linutronix.de ([62.245.132.108]:34856 "EHLO
	Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750832AbbKGIjl (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 7 Nov 2015 03:39:41 -0500
Date: Sat, 7 Nov 2015 09:38:36 +0100 (CET)
From: Thomas Gleixner <tglx@linutronix.de>
To: Dan Williams <dan.j.williams@intel.com>
cc: "H. Peter Anvin" <hpa@zytor.com>,
        Ross Zwisler <ross.zwisler@linux.intel.com>,
        Jeff Moyer <jmoyer@redhat.com>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>, X86 ML <x86@kernel.org>,
        Dave Chinner <david@fromorbit.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>
Subject: Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
In-Reply-To: <CAPcyv4inLULjoc-7=c3AZBLFriJkYAJ+nmx3Wou49SVTcCeh5w@mail.gmail.com>
Message-ID: <alpine.DEB.2.11.1511070919160.4032@nanos>
References: <1446070176-14568-1-git-send-email-ross.zwisler@linux.intel.com> <x49lhamljhe.fsf@segfault.boston.devel.redhat.com> <20151028225112.GA30284@linux.intel.com> <CAPcyv4jYc_Ai+9st_QgzuWNMMHCQ-j6pe4R0XEWy5-mEb1_X5Q@mail.gmail.com>
 <alpine.DEB.2.11.1511060854570.4032@nanos> <CAPcyv4jJ3f=HC7p08KQwH4Q0+JzwqYJ8tw43j=7f8zSfOjpO2w@mail.gmail.com> <alpine.DEB.2.11.1511061832590.4032@nanos> <CAPcyv4iptRuGb0O13+LN0Qv7XUDdJYG6TCJNOVMASDTLw90gtw@mail.gmail.com> <563D4B2F.9010608@zytor.com>
 <alpine.DEB.2.11.1511070746190.4032@nanos> <CAPcyv4inLULjoc-7=c3AZBLFriJkYAJ+nmx3Wou49SVTcCeh5w@mail.gmail.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Linutronix-Spam-Score: -1.0
X-Linutronix-Spam-Level: -
X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required,  ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001,URIBL_BLOCKED=0.001
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, 7 Nov 2015, Dan Williams wrote:
> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
> >> On 11/06/15 15:17, Dan Williams wrote:
> >> >>
> >> >> Is it really required to do that on all cpus?
> >> >
> >> > I believe it is, but I'll double check.
> >> >
> >>
> >> It's required on all CPUs on which the DAX memory may have been dirtied.
> >>  This is similar to the way we flush TLBs.
> >
> > Right. And that's exactly the problem: "may have been dirtied"
> >
> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
> > happily in user space or run low latency RT tasks w/o ever touching
> > it, then having an unconditional flush on ALL CPUs is just wrong
> > because you penalize the uninvolved cores with a completely pointless
> > SMP function call and drain their caches.
> >
> 
> It's not wrong and pointless, it's all we have available outside of
> having the kernel remember every virtual address that might have been
> touched since the last fsync and sit in a loop flushing those virtual
> address cache line by cache line.
> 
> There is a crossover point where wbinvd is better than a clwb loop
> that needs to be determined.

This is a totally different issue and I'm well aware that there is a
tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
efficient performance wise above some number of cache lines, but then
again it's draining all unrelated stuff as well, which can result in a
even larger performance hit.

Now what really concerns me more is that you just unconditionally
flush on all CPUs whether they were involved in that DAX stuff or not.

Assume that DAX using application on CPU 0-3 and some other unrelated
workload on CPU4-7. That flush will 

  - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)

  - Drain the cache for CPU4-7 for no reason if done with wbinvd()
   
  - Render Cache Allocation useless if done with wbinvd()

And we are not talking about a few micro seconds here. Assume that
CPU4-7 have cache allocated and it's mostly dirty. We've measured the
wbinvd() impact on RT, back then when the graphic folks used it as a
big hammer. The maximum latency spike was way above one millisecond.

We have similar issues with TLB flushing, but there we 

  - are tracking where it was used and never flush on innocent cpus

  - one can design his application in a way that it uses different
    processes so cross CPU flushing does not happen

I know that this is not an easy problem to solve, but you should be
aware that various application scenarios are going to be massively
unhappy about that.

Thanks,

	tglx