From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1034020AbbKFXRa (ORCPT <rfc822;w@1wt.eu>);
	Fri, 6 Nov 2015 18:17:30 -0500
Received: from mail-wi0-f180.google.com ([209.85.212.180]:35161 "EHLO
	mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1033956AbbKFXR2 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 6 Nov 2015 18:17:28 -0500
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.11.1511061832590.4032@nanos>
References: <1446070176-14568-1-git-send-email-ross.zwisler@linux.intel.com>
	<x49lhamljhe.fsf@segfault.boston.devel.redhat.com>
	<20151028225112.GA30284@linux.intel.com>
	<CAPcyv4jYc_Ai+9st_QgzuWNMMHCQ-j6pe4R0XEWy5-mEb1_X5Q@mail.gmail.com>
	<alpine.DEB.2.11.1511060854570.4032@nanos>
	<CAPcyv4jJ3f=HC7p08KQwH4Q0+JzwqYJ8tw43j=7f8zSfOjpO2w@mail.gmail.com>
	<alpine.DEB.2.11.1511061832590.4032@nanos>
Date: Fri, 6 Nov 2015 15:17:27 -0800
Message-ID: <CAPcyv4iptRuGb0O13+LN0Qv7XUDdJYG6TCJNOVMASDTLw90gtw@mail.gmail.com>
Subject: Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
From: Dan Williams <dan.j.williams@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
        Jeff Moyer <jmoyer@redhat.com>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>, X86 ML <x86@kernel.org>,
        Dave Chinner <david@fromorbit.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        Jan Kara <jack@suse.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Nov 6, 2015 at 9:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 6 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Just for the record. Such a flush mechanism with
>> >
>> >      on_each_cpu()
>> >         wbinvd()
>> >         ...
>> >
>> > will make that stuff completely unusable on Real-Time systems. We've
>> > been there with the big hammer approach of the intel graphics
>> > driver.
>>
>> Noted.  This means RT systems either need to disable DAX or avoid
>> fsync.  Yes, this is a wart, but not an unexpected one in a first
>> generation persistent memory platform.
>
> And it's not just only RT. The folks who are aiming for 100%
> undisturbed user space (NOHZ_FULL) will be massively unhappy about
> that as well.
>
> Is it really required to do that on all cpus?
>

I believe it is, but I'll double check.

I assume the folks that want undisturbed userspace are ok with the
mitigation to modify their application to flush by individual cache
lines if they want to use DAX without fsync.  At least until the
platform can provide a cheaper fsync implementation.

The option to drive cache flushing from the radix is at least
interruptible, but it might be long running depending on how much
virtual address space is dirty.  Altogether, the options in the
current generation are:

1/ wbinvd driven: quick flush O(size of cache), but long interrupt-off latency

2/ radix driven: long flush O(size of dirty range), but at least preempt-able

3/ DAX without calling fsync: userspace takes direct responsibility
for cache management of DAX mappings

4/ DAX disabled: fsync is the standard page cache writeback latency

We could potentially argue about 1 vs 2 ad nauseum, but I wonder if
there is room to it punt it to a configuration option or make it
dynamic?  My stance is do 1 with the hope of riding options 3 and 4
until the platform happens to provide a better alternative.