linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Future direction of DAX
@ 2017-01-14  0:20 Ross Zwisler
  2017-01-14  8:26 ` Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ross Zwisler @ 2017-01-14  0:20 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-nvdimm, linux-block, linux-mm

This past year has seen a lot of new DAX development.  We have added support
for fsync/msync, moved to the new iomap I/O data structure, introduced radix
tree based locking, re-enabled PMD support (twice!), and have fixed a bunch of
bugs.

We still have a lot of work to do, though, and I'd like to propose a discussion
around what features people would like to see enabled in the coming year as
well as what what use cases their customers have that we might not be aware of.

Here are a few topics to start the conversation:

- The current plan to allow users to safely flush dirty data from userspace is
  built around the PMEM_IMMUTABLE feature [1].  I'm hoping that by LSF/MM we
  will have at least started work on PMEM_IMMUTABLE, but I'm guessing there
  will be more to discuss.

- The DAX fsync/msync model was built for platforms that need to flush dirty
  processor cache lines in order to make data durable on NVDIMMs.  There exist
  platforms, however, that are set up so that the processor caches are
  effectively part of the ADR safe zone.  This means that dirty data can be
  assumed to be durable even in the processor cache, obviating the need to
  manually flush the cache during fsync/msync.  These platforms still need to
  call fsync/msync to ensure that filesystem metadata updates are properly
  written to media.  Our first idea on how to properly support these platforms
  would be for DAX to be made aware that in some cases doesn't need to keep
  metadata about dirty cache lines.  A similar issue exists for volatile uses
  of DAX such as with BRD or with PMEM and the memmap command line parameter,
  and we'd like a solution that covers them all.

- If I recall correctly, at one point Dave Chinner suggested that we change
  DAX so that I/O would use cached stores instead of the non-temporal stores
  that it currently uses.  We would then track pages that were written to by
  DAX in the radix tree so that they would be flushed later during
  fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
  solution for platforms where the processor cache is part of the ADR safe
  zone (above topic) this would be a clear improvement, moving us from using
  non-temporal stores to faster cached stores with no downside.

- Jan suggested [2] that we could use the radix tree as a cache to service DAX
  faults without needing to call into the filesystem.  Are there any issues
  with this approach, and should we move forward with it as an optimization?

- Whenever you mount a filesystem with DAX, it spits out a message that says
  "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
  needs to be met for DAX to no longer be considered experimental?

- When we msync() a huge page, if the range is less than the entire huge page,
  should we flush the entire huge page and mark it clean in the radix tree, or
  should we only flush the requested range and leave the radix tree entry
  dirty?

- Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
  specific customer requests for this or performance data suggesting it would
  be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
  filesystem block allocations, to get the required enabling in the MM layer,
  etc?

Thanks,
- Ross

[1] https://lkml.org/lkml/2016/12/19/571
[2] https://lkml.org/lkml/2016/10/12/70

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-14  0:20 [LSF/MM TOPIC] Future direction of DAX Ross Zwisler
@ 2017-01-14  8:26 ` Darrick J. Wong
  2017-01-16  0:19   ` Viacheslav Dubeyko
  2017-01-16 20:00   ` Jeff Moyer
  2017-01-17 15:59 ` [Lsf-pc] " Jan Kara
  2017-01-18  5:25 ` willy
  2 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2017-01-14  8:26 UTC (permalink / raw)
  To: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, linux-mm

On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote:
> This past year has seen a lot of new DAX development.  We have added support
> for fsync/msync, moved to the new iomap I/O data structure, introduced radix
> tree based locking, re-enabled PMD support (twice!), and have fixed a bunch of
> bugs.
> 
> We still have a lot of work to do, though, and I'd like to propose a discussion
> around what features people would like to see enabled in the coming year as
> well as what what use cases their customers have that we might not be aware of.
> 
> Here are a few topics to start the conversation:
> 
> - The current plan to allow users to safely flush dirty data from userspace is
>   built around the PMEM_IMMUTABLE feature [1].  I'm hoping that by LSF/MM we
>   will have at least started work on PMEM_IMMUTABLE, but I'm guessing there
>   will be more to discuss.

Yes, probably. :)

> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.
> 
> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.
> 
> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?
> 
> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

For XFS I'd like to get reflink working with it, for starters.  We
probably need a bunch more verification work to show that file IO
doesn't adopt any bad quirks having turned on the per-inode DAX flag.

Some day we'll start designing a pmem-native fs, I guess. :P

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?
> 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

<giggle> :)

--D

> 
> Thanks,
> - Ross
> 
> [1] https://lkml.org/lkml/2016/12/19/571
> [2] https://lkml.org/lkml/2016/10/12/70
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-14  8:26 ` Darrick J. Wong
@ 2017-01-16  0:19   ` Viacheslav Dubeyko
  2017-01-16 20:00   ` Jeff Moyer
  1 sibling, 0 replies; 18+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-16  0:19 UTC (permalink / raw)
  To: Darrick J. Wong, Ross Zwisler, lsf-pc, linux-fsdevel,
	linux-nvdimm, linux-block, linux-mm

On Sat, 2017-01-14 at 00:26 -0800, Darrick J. Wong wrote:

<skipped>

> Some day we'll start designing a pmem-native fs, I guess. :P

There are research efforts in this direction already ([1]-[15]). The
latest one is NOVA, as far as I can see. But, frankly speaking, I
believe that we need in new hardware paradigm/architecture and new OS
paradigm for the next generation of NVM memory. The DAX is
simple palliative, temporary solution. But, from my point of view,
pmem-native fs is also not good direction because, anyway, memory
subsystem will be affected significantly. And, finally, evolution of
memory subsystem will reveal something completely different that we can
imagine right now.

Thanks,
Vyacheslav Dubeyko. 

[1] http://pages.cs.wisc.edu/~swift/papers/eurosys14-aerie.pdf
[2] https://www.researchgate.net/publication/282792714_A_User-Level_File_System_for_Fast_Storage_Devices
[3] https://people.eecs.berkeley.edu/~dcoetzee/publications/Better%20IO%20Through%20Byte-Addressable,%20Persistent%20Memory.pdf
[4] https://www.computer.org/csdl/proceedings/msst/2013/0217/00/06558440.pdf
[5] https://users.soe.ucsc.edu/~scott/papers/MASCOTS04b.pdf
[6] http://ieeexplore.ieee.org/document/4142472/
[7] https://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
[8] http://cesg.tamu.edu/wp-content/uploads/2012/02/MSST13.pdf
[9] http://ieeexplore.ieee.org/document/5487498/
[10] https://pdfs.semanticscholar.org/544c/1ddf24b90c3dfba7b1934049911b869c99b4.pdf
[11] http://pramfs.sourceforge.net/tech.html
[12] https://pdfs.semanticscholar.org/2981/b5abcbe1023b9f3cd962b0be7ef8bd45acfd.pdf
[13] http://ieeexplore.ieee.org/document/6232378/
[14] http://ieeexplore.ieee.org/document/7304365/
[15] http://ieeexplore.ieee.org/document/6272446/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-14  8:26 ` Darrick J. Wong
  2017-01-16  0:19   ` Viacheslav Dubeyko
@ 2017-01-16 20:00   ` Jeff Moyer
  2017-01-17  1:50     ` Darrick J. Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Moyer @ 2017-01-16 20:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, linux-mm

"Darrick J. Wong" <darrick.wong@oracle.com> writes:

>> - Whenever you mount a filesystem with DAX, it spits out a message that says
>>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>>   needs to be met for DAX to no longer be considered experimental?
>
> For XFS I'd like to get reflink working with it, for starters.

What do you mean by this, exactly?  When Dave outlined the requirements
for PMEM_IMMUTABLE, it was very clear that metadata updates would not be
possible.  And would you really cosider this a barrier to marking dax
fully supported?  I wouldn't.

> We probably need a bunch more verification work to show that file IO
> doesn't adopt any bad quirks having turned on the per-inode DAX flag.

Can you be more specific?  We have ltp and xfstests.  If you have some
mkfs/mount options that you think should be tested, speak up.  Beyond
that, if it passes ./check -g auto and ltp, are we good?

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-16 20:00   ` Jeff Moyer
@ 2017-01-17  1:50     ` Darrick J. Wong
  2017-01-17  2:42       ` Dan Williams
  2017-01-17  7:57       ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2017-01-17  1:50 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, linux-mm

On Mon, Jan 16, 2017 at 03:00:41PM -0500, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> 
> >> - Whenever you mount a filesystem with DAX, it spits out a message that says
> >>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
> >>   needs to be met for DAX to no longer be considered experimental?
> >
> > For XFS I'd like to get reflink working with it, for starters.
> 
> What do you mean by this, exactly?  When Dave outlined the requirements
> for PMEM_IMMUTABLE, it was very clear that metadata updates would not be
> possible.  And would you really cosider this a barrier to marking dax
> fully supported?  I wouldn't.

For PMEM_IMMUTABLE files, yes, reflink cannot be supported.

I'm talking about supporting reflink for DAX files that are /not/
PMEM_IMMUTABLE, where user programs can mmap pmem directly but write
activity still must use fsync/msync to ensure that everything's on disk.

I wouldn't consider it a barrier in general (since ext4 also prints
EXPERIMENTAL warnings for DAX), merely one for XFS.  I don't even think
it's that big of a hurdle -- afaict XFS ought to be able to achieve this
by modifying iomap_begin to allocate new pmem blocks, memcpy the
contents, and update the memory mappings.  I think.

> > We probably need a bunch more verification work to show that file IO
> > doesn't adopt any bad quirks having turned on the per-inode DAX flag.
> 
> Can you be more specific?  We have ltp and xfstests.  If you have some
> mkfs/mount options that you think should be tested, speak up.  Beyond
> that, if it passes ./check -g auto and ltp, are we good?

That's probably good -- I simply wanted to know if we'd at least gotten
to the point that someone had run both suites with and without DAX and
not seen any major regressions between the two.

--D

> 
> -Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-17  1:50     ` Darrick J. Wong
@ 2017-01-17  2:42       ` Dan Williams
  2017-01-17  7:57       ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-01-17  2:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, linux-nvdimm, linux-block, Linux MM, linux-fsdevel, lsf-pc

On Mon, Jan 16, 2017 at 5:50 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Jan 16, 2017 at 03:00:41PM -0500, Jeff Moyer wrote:
>> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
>>
>> >> - Whenever you mount a filesystem with DAX, it spits out a message that says
>> >>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>> >>   needs to be met for DAX to no longer be considered experimental?
>> >
>> > For XFS I'd like to get reflink working with it, for starters.
>>
>> What do you mean by this, exactly?  When Dave outlined the requirements
>> for PMEM_IMMUTABLE, it was very clear that metadata updates would not be
>> possible.  And would you really cosider this a barrier to marking dax
>> fully supported?  I wouldn't.
>
> For PMEM_IMMUTABLE files, yes, reflink cannot be supported.
>
> I'm talking about supporting reflink for DAX files that are /not/
> PMEM_IMMUTABLE, where user programs can mmap pmem directly but write
> activity still must use fsync/msync to ensure that everything's on disk.
>
> I wouldn't consider it a barrier in general (since ext4 also prints
> EXPERIMENTAL warnings for DAX), merely one for XFS.  I don't even think
> it's that big of a hurdle -- afaict XFS ought to be able to achieve this
> by modifying iomap_begin to allocate new pmem blocks, memcpy the
> contents, and update the memory mappings.  I think.
>
>> > We probably need a bunch more verification work to show that file IO
>> > doesn't adopt any bad quirks having turned on the per-inode DAX flag.
>>
>> Can you be more specific?  We have ltp and xfstests.  If you have some
>> mkfs/mount options that you think should be tested, speak up.  Beyond
>> that, if it passes ./check -g auto and ltp, are we good?
>
> That's probably good -- I simply wanted to know if we'd at least gotten
> to the point that someone had run both suites with and without DAX and
> not seen any major regressions between the two.

Yes, xfstests is part the dax development flow. The hard part has been
maintaining a blacklist of tests that fail in both the DAX and non-DAX
cases, or false negatives due to DAX disabling delayed allocation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-17  1:50     ` Darrick J. Wong
  2017-01-17  2:42       ` Dan Williams
@ 2017-01-17  7:57       ` Christoph Hellwig
  2017-01-17 14:54         ` Jeff Moyer
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-01-17  7:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm,
	linux-block, linux-mm

On Mon, Jan 16, 2017 at 05:50:33PM -0800, Darrick J. Wong wrote:
> I wouldn't consider it a barrier in general (since ext4 also prints
> EXPERIMENTAL warnings for DAX), merely one for XFS.  I don't even think
> it's that big of a hurdle -- afaict XFS ought to be able to achieve this
> by modifying iomap_begin to allocate new pmem blocks, memcpy the
> contents, and update the memory mappings.  I think.

Yes, and I have a working prototype for that.  I'm just way to busy
with lots of bugfixing at the moment but I plan to get to it in this
merge window.  I also agree that we can't mark a feature as fully
supported until it doesn't conflict with other features.

And I'm not going to get start on the PMEM_IMMUTABLE bullshit, please
don't even go there folks, it's a dead end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-17  7:57       ` Christoph Hellwig
@ 2017-01-17 14:54         ` Jeff Moyer
  2017-01-17 15:06           ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Moyer @ 2017-01-17 14:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ross Zwisler, lsf-pc, linux-fsdevel,
	linux-nvdimm, linux-block, linux-mm

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, Jan 16, 2017 at 05:50:33PM -0800, Darrick J. Wong wrote:
>> I wouldn't consider it a barrier in general (since ext4 also prints
>> EXPERIMENTAL warnings for DAX), merely one for XFS.  I don't even think
>> it's that big of a hurdle -- afaict XFS ought to be able to achieve this
>> by modifying iomap_begin to allocate new pmem blocks, memcpy the
>> contents, and update the memory mappings.  I think.

Ah, I wasn't even thinking about non PMEM_IMMUTABLE usage.

> Yes, and I have a working prototype for that.  I'm just way to busy
> with lots of bugfixing at the moment but I plan to get to it in this
> merge window.  I also agree that we can't mark a feature as fully
> supported until it doesn't conflict with other features.

Fair enough.

> And I'm not going to get start on the PMEM_IMMUTABLE bullshit, please
> don't even go there folks, it's a dead end.

I spoke with Dave before the holidays, and he indicated that
PMEM_IMMUTABLE would be an acceptable solution to allowing applications
to flush data completely from userspace.  I know this subject has been
beaten to death, but would you mind just summarizing your opinion on
this one more time?  I'm guessing this will be something more easily
hashed out at LSF, though.

Thanks,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-17 14:54         ` Jeff Moyer
@ 2017-01-17 15:06           ` Christoph Hellwig
  2017-01-17 16:07             ` Jeff Moyer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-01-17 15:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Darrick J. Wong, Ross Zwisler, lsf-pc,
	linux-fsdevel, linux-nvdimm, linux-block, linux-mm

On Tue, Jan 17, 2017 at 09:54:27AM -0500, Jeff Moyer wrote:
> I spoke with Dave before the holidays, and he indicated that
> PMEM_IMMUTABLE would be an acceptable solution to allowing applications
> to flush data completely from userspace.  I know this subject has been
> beaten to death, but would you mind just summarizing your opinion on
> this one more time?  I'm guessing this will be something more easily
> hashed out at LSF, though.

Come up with a prototype that doesn't suck and allows all fs features to
actually work.  And show an application that actually cares and shows
benefits on publicly available real hardware.  Until then go away and
stop wasting everyones time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
  2017-01-14  0:20 [LSF/MM TOPIC] Future direction of DAX Ross Zwisler
  2017-01-14  8:26 ` Darrick J. Wong
@ 2017-01-17 15:59 ` Jan Kara
  2017-01-17 16:56   ` Dan Williams
  2017-01-18  0:03   ` Kani, Toshimitsu
  2017-01-18  5:25 ` willy
  2 siblings, 2 replies; 18+ messages in thread
From: Jan Kara @ 2017-01-17 15:59 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: lsf-pc, linux-fsdevel, linux-block, linux-mm, linux-nvdimm

On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-17 15:06           ` Christoph Hellwig
@ 2017-01-17 16:07             ` Jeff Moyer
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Moyer @ 2017-01-17 16:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ross Zwisler, lsf-pc, linux-fsdevel,
	linux-nvdimm, linux-block, linux-mm

Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Jan 17, 2017 at 09:54:27AM -0500, Jeff Moyer wrote:
>> I spoke with Dave before the holidays, and he indicated that
>> PMEM_IMMUTABLE would be an acceptable solution to allowing applications
>> to flush data completely from userspace.  I know this subject has been
>> beaten to death, but would you mind just summarizing your opinion on
>> this one more time?  I'm guessing this will be something more easily
>> hashed out at LSF, though.
>
> Come up with a prototype that doesn't suck and allows all fs features to
> actually work.

OK, I'll take this to mean that PMEM_IMMUTABLE is a non-starter.
Perhaps synchronous page faults (or whatever you want to call it) would
work, but...

> And show an application that actually cares and shows benefits on
> publicly available real hardware.

This is the crux of the issue.

> Until then go away and stop wasting everyones time.

Fair enough.  It seems fairly likely that this sort of functionality
would provide a big benefit.  But I agree we should have a real-world
use case as proof.

Thanks,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
  2017-01-17 15:59 ` [Lsf-pc] " Jan Kara
@ 2017-01-17 16:56   ` Dan Williams
  2017-01-18  0:03   ` Kani, Toshimitsu
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-01-17 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-fsdevel, linux-block, linux-nvdimm, lsf-pc, Linux MM

On Tue, Jan 17, 2017 at 7:59 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
>> - The DAX fsync/msync model was built for platforms that need to flush dirty
>>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>>   platforms, however, that are set up so that the processor caches are
>>   effectively part of the ADR safe zone.  This means that dirty data can be
>>   assumed to be durable even in the processor cache, obviating the need to
>>   manually flush the cache during fsync/msync.  These platforms still need to
>>   call fsync/msync to ensure that filesystem metadata updates are properly
>>   written to media.  Our first idea on how to properly support these platforms
>>   would be for DAX to be made aware that in some cases doesn't need to keep
>>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>>   and we'd like a solution that covers them all.
>
> Well, we still need the radix tree entries for locking. And you still need
> to keep track of which file offsets are writeably mapped (which we
> currently implicitely keep via dirty radix tree entries) so that you can
> writeprotect them if needed (during filesystem freezing, for reflink, ...).
> So I think what is going to gain the most by far is simply to avoid doing
> the writeback at all in such situations.

I came to the same conclusion when taking a look at this. I have some
patches that simply make the writeback optional, but do not touch any
of the other dirty tracking infrastructure. I'll send them out shortly
after a bit more testing. This also dovetails with the request from
Linus to push pmem flushing routines into the driver and stop abusing
__copy_user_nocache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX
  2017-01-17 15:59 ` [Lsf-pc] " Jan Kara
  2017-01-17 16:56   ` Dan Williams
@ 2017-01-18  0:03   ` Kani, Toshimitsu
  1 sibling, 0 replies; 18+ messages in thread
From: Kani, Toshimitsu @ 2017-01-18  0:03 UTC (permalink / raw)
  To: ross.zwisler, jack
  Cc: linux-mm, linux-nvdimm, linux-block, lsf-pc, linux-fsdevel

On Tue, 2017-01-17 at 16:59 +0100, Jan Kara wrote:
> On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
 :
> > - If I recall correctly, at one point Dave Chinner suggested that
> > we change - If I recall correctly, at one point Dave Chinner
> > suggested that we change   DAX so that I/O would use cached stores
> > instead of the non-temporal stores   that it currently uses.  We
> > would then track pages that were written to by DAX in the radix
> > tree so that they would be flushed later during  
> > fsync/msync.  Does this sound like a win?  Also, assuming that we
> > can find a solution for platforms where the processor cache is part
> > of the ADR safe zone (above topic) this would be a clear
> > improvement, moving us from using non-temporal stores to faster
> > cached stores with no downside.
> 
> I guess this needs measurements. But it is worth a try.

Brain Boylston did some measurement before.
http://oss.sgi.com/archives/xfs/2016-08/msg00239.html

I updated his test program to skip pmem_persist() for the cached copy
case.

                        dst = dstbase;
+ #if 0
                        /* see note above */
                        if (mode == 'c')
                                pmem_persist(dst, dstsz);
+ #endif
                }

Here are sample runs:

$ numactl -N0 time -p ./memcpyperf c /mnt/pmem0/file 1000000
INFO: dst 0x7f1d00000000 src 0x601200 dstsz 2756509696 cpysz 16384
real 3.28
user 3.27
sys 0.00

$ numactl -N0 time -p ./memcpyperf n /mnt/pmem0/file 1000000
INFO: dst 0x7f6080000000 src 0x601200 dstsz 2756509696 cpysz 16384
real 1.01
user 1.01
sys 0.00

$ numactl -N1 time -p ./memcpyperf c /mnt/pmem0/file 1000000
INFO: dst 0x7fe900000000 src 0x601200 dstsz 2756509696 cpysz 16384
real 4.06
user 4.06
sys 0.00

$ numactl -N1 time -p ./memcpyperf n /mnt/pmem0/file 1000000
INFO: dst 0x7f7640000000 src 0x601200 dstsz 2756509696 cpysz 16384
real 1.27
user 1.27
sys 0.00

In this simple test, using non-temporal copy is still faster than using
cached copy.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-14  0:20 [LSF/MM TOPIC] Future direction of DAX Ross Zwisler
  2017-01-14  8:26 ` Darrick J. Wong
  2017-01-17 15:59 ` [Lsf-pc] " Jan Kara
@ 2017-01-18  5:25 ` willy
  2017-01-18  6:01   ` Dan Williams
  2017-01-18 17:22   ` Ross Zwisler
  2 siblings, 2 replies; 18+ messages in thread
From: willy @ 2017-01-18  5:25 UTC (permalink / raw)
  To: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, linux-mm

On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote:
> We still have a lot of work to do, though, and I'd like to propose a discussion
> around what features people would like to see enabled in the coming year as
> well as what what use cases their customers have that we might not be aware of.

+1 to the discussion

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
patches to start that work.  And Dan blocked it.  So I'm not terribly
amused to see somebody else given credit for the idea.

It's not just an optimisation.  It's also essential for supporting
filesystems which don't have block devices.  I'm aware of at least two
customer demands for this in different domains.

1. Embedded uses with NOR flash
2. Cloud/virt uses with multiple VMs on a single piece of hardware

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-18  5:25 ` willy
@ 2017-01-18  6:01   ` Dan Williams
  2017-01-18  6:07     ` willy
  2017-01-18 17:22   ` Ross Zwisler
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Williams @ 2017-01-18  6:01 UTC (permalink / raw)
  To: willy
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, Linux MM

On Tue, Jan 17, 2017 at 9:25 PM,  <willy@bombadil.infradead.org> wrote:
> On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote:
>> We still have a lot of work to do, though, and I'd like to propose a discussion
>> around what features people would like to see enabled in the coming year as
>> well as what what use cases their customers have that we might not be aware of.
>
> +1 to the discussion
>
>> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>>   faults without needing to call into the filesystem.  Are there any issues
>>   with this approach, and should we move forward with it as an optimization?
>
> Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
> patches to start that work.  And Dan blocked it.  So I'm not terribly
> amused to see somebody else given credit for the idea.
>

I "blocked" moving the phys to virt translation out of the driver
since that mapping lifetime is device specific.

However, I think caching the file offset to physical sector/address
result is a great idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-18  6:01   ` Dan Williams
@ 2017-01-18  6:07     ` willy
  2017-01-18  6:25       ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: willy @ 2017-01-18  6:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, Linux MM

On Tue, Jan 17, 2017 at 10:01:30PM -0800, Dan Williams wrote:
> >> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
> >>   faults without needing to call into the filesystem.  Are there any issues
> >>   with this approach, and should we move forward with it as an optimization?
> >
> > Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
> > patches to start that work.  And Dan blocked it.  So I'm not terribly
> > amused to see somebody else given credit for the idea.
> 
> I "blocked" moving the phys to virt translation out of the driver
> since that mapping lifetime is device specific.

The problem is that DAX currently assumes that there *is* a block driver,
and it might be a char device or no device at all (the two examples I
gave earlier).

> However, I think caching the file offset to physical sector/address
> result is a great idea.

OK, great.  The lifetime problem I think you care about (hotplug) can be
handled by removing all the cached entries for every file on every file
on that block device ... I know there were prototype patches for that;
did they ever get merged?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-18  6:07     ` willy
@ 2017-01-18  6:25       ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-01-18  6:25 UTC (permalink / raw)
  To: willy
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, Linux MM

On Tue, Jan 17, 2017 at 10:07 PM,  <willy@bombadil.infradead.org> wrote:
> On Tue, Jan 17, 2017 at 10:01:30PM -0800, Dan Williams wrote:
>> >> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>> >>   faults without needing to call into the filesystem.  Are there any issues
>> >>   with this approach, and should we move forward with it as an optimization?
>> >
>> > Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
>> > patches to start that work.  And Dan blocked it.  So I'm not terribly
>> > amused to see somebody else given credit for the idea.
>>
>> I "blocked" moving the phys to virt translation out of the driver
>> since that mapping lifetime is device specific.
>
> The problem is that DAX currently assumes that there *is* a block driver,
> and it might be a char device or no device at all (the two examples I
> gave earlier).
>
>> However, I think caching the file offset to physical sector/address
>> result is a great idea.
>
> OK, great.  The lifetime problem I think you care about (hotplug) can be
> handled by removing all the cached entries for every file on every file
> on that block device ... I know there were prototype patches for that;
> did they ever get merged?

No, they didn't.. The last review comment was from Al. He wanted the
mechanism converted from explicit calls at del_gendisk() time into a
notifier chain since it's not just filesystems that may want to
register for a block-device end-of-life event.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] Future direction of DAX
  2017-01-18  5:25 ` willy
  2017-01-18  6:01   ` Dan Williams
@ 2017-01-18 17:22   ` Ross Zwisler
  1 sibling, 0 replies; 18+ messages in thread
From: Ross Zwisler @ 2017-01-18 17:22 UTC (permalink / raw)
  To: willy
  Cc: Ross Zwisler, lsf-pc, linux-fsdevel, linux-nvdimm, linux-block, linux-mm

On Tue, Jan 17, 2017 at 09:25:33PM -0800, willy@bombadil.infradead.org wrote:
> On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote:
> > We still have a lot of work to do, though, and I'd like to propose a discussion
> > around what features people would like to see enabled in the coming year as
> > well as what what use cases their customers have that we might not be aware of.
> 
> +1 to the discussion
> 
> > - Jan suggested [2] that we could use the radix tree as a cache to service DAX
> >   faults without needing to call into the filesystem.  Are there any issues
> >   with this approach, and should we move forward with it as an optimization?
> 
> Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
> patches to start that work.  And Dan blocked it.  So I'm not terribly
> amused to see somebody else given credit for the idea.
> 
> It's not just an optimisation.  It's also essential for supporting
> filesystems which don't have block devices.  I'm aware of at least two
> customer demands for this in different domains.
> 
> 1. Embedded uses with NOR flash
> 2. Cloud/virt uses with multiple VMs on a single piece of hardware

Yea, I didn't mean the full move to having PFNs in the tree, just using the
sector number in the radix tree instead of calling into the filesystem.

My apologies if you feel I didn't give you proper credit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-01-18 17:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-14  0:20 [LSF/MM TOPIC] Future direction of DAX Ross Zwisler
2017-01-14  8:26 ` Darrick J. Wong
2017-01-16  0:19   ` Viacheslav Dubeyko
2017-01-16 20:00   ` Jeff Moyer
2017-01-17  1:50     ` Darrick J. Wong
2017-01-17  2:42       ` Dan Williams
2017-01-17  7:57       ` Christoph Hellwig
2017-01-17 14:54         ` Jeff Moyer
2017-01-17 15:06           ` Christoph Hellwig
2017-01-17 16:07             ` Jeff Moyer
2017-01-17 15:59 ` [Lsf-pc] " Jan Kara
2017-01-17 16:56   ` Dan Williams
2017-01-18  0:03   ` Kani, Toshimitsu
2017-01-18  5:25 ` willy
2017-01-18  6:01   ` Dan Williams
2017-01-18  6:07     ` willy
2017-01-18  6:25       ` Dan Williams
2017-01-18 17:22   ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).