All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	linux-nvdimm@lists.01.org, lsf-pc@lists.linux-foundation.org,
	linux-mm@kvack.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
Date: Tue, 17 Jan 2017 16:59:10 +0100	[thread overview]
Message-ID: <20170117155910.GU2517@quack2.suse.cz> (raw)
In-Reply-To: <20170114002008.GA25379@linux.intel.com>

On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	linux-nvdimm@lists.01.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
Date: Tue, 17 Jan 2017 16:59:10 +0100	[thread overview]
Message-ID: <20170117155910.GU2517@quack2.suse.cz> (raw)
In-Reply-To: <20170114002008.GA25379@linux.intel.com>

On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	linux-nvdimm@lists.01.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
Date: Tue, 17 Jan 2017 16:59:10 +0100	[thread overview]
Message-ID: <20170117155910.GU2517@quack2.suse.cz> (raw)
In-Reply-To: <20170114002008.GA25379@linux.intel.com>

On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-01-17 15:59 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-14  0:20 [LSF/MM TOPIC] Future direction of DAX Ross Zwisler
2017-01-14  0:20 ` Ross Zwisler
2017-01-14  0:20 ` Ross Zwisler
     [not found] ` <20170114002008.GA25379-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-01-14  8:26   ` Darrick J. Wong
2017-01-14  8:26     ` Darrick J. Wong
2017-01-14  8:26     ` Darrick J. Wong
2017-01-16  0:19     ` Viacheslav Dubeyko
2017-01-16  0:19       ` Viacheslav Dubeyko
2017-01-16  0:19       ` Viacheslav Dubeyko
2017-01-16 20:00     ` Jeff Moyer
2017-01-16 20:00       ` Jeff Moyer
2017-01-17  1:50       ` Darrick J. Wong
2017-01-17  1:50         ` Darrick J. Wong
2017-01-17  2:42         ` Dan Williams
2017-01-17  2:42           ` Dan Williams
     [not found]         ` <20170117015033.GD10498-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2017-01-17  7:57           ` Christoph Hellwig
2017-01-17  7:57             ` Christoph Hellwig
2017-01-17  7:57             ` Christoph Hellwig
2017-01-17 14:54             ` Jeff Moyer
2017-01-17 14:54               ` Jeff Moyer
     [not found]               ` <x49mvep4tzw.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2017-01-17 15:06                 ` Christoph Hellwig
2017-01-17 15:06                   ` Christoph Hellwig
2017-01-17 15:06                   ` Christoph Hellwig
     [not found]                   ` <20170117150638.GA3747-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-01-17 16:07                     ` Jeff Moyer
2017-01-17 16:07                       ` Jeff Moyer
2017-01-17 16:07                       ` Jeff Moyer
2017-01-17 15:59 ` Jan Kara [this message]
2017-01-17 15:59   ` [Lsf-pc] " Jan Kara
2017-01-17 15:59   ` Jan Kara
2017-01-17 16:56   ` Dan Williams
2017-01-17 16:56     ` Dan Williams
2017-01-17 16:56     ` Dan Williams
2017-01-18  0:03   ` Kani, Toshimitsu
2017-01-18  0:03     ` Kani, Toshimitsu
2017-01-18  0:03     ` Kani, Toshimitsu
2017-01-18  5:25 ` willy
2017-01-18  5:25   ` willy
2017-01-18  5:25   ` willy
2017-01-18  6:01   ` Dan Williams
2017-01-18  6:01     ` Dan Williams
2017-01-18  6:01     ` Dan Williams
2017-01-18  6:07     ` willy
2017-01-18  6:07       ` willy
2017-01-18  6:07       ` willy
2017-01-18  6:25       ` Dan Williams
2017-01-18  6:25         ` Dan Williams
2017-01-18  6:25         ` Dan Williams
2017-01-18 17:22   ` Ross Zwisler
2017-01-18 17:22     ` Ross Zwisler
2017-01-18 17:22     ` Ross Zwisler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170117155910.GU2517@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.